From 5594655fba2805812feebbe49e01c70fcdafa341 Mon Sep 17 00:00:00 2001 From: Felipe Cardoso Date: Mon, 29 Dec 2025 13:31:02 +0100 Subject: [PATCH] docs: add architecture spikes and deep analysis documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add comprehensive spike research documents: - SPIKE-002: Agent Orchestration Pattern (LangGraph + Temporal hybrid) - SPIKE-006: Knowledge Base pgvector (RAG with hybrid search) - SPIKE-007: Agent Communication Protocol (JSON-RPC + Redis Streams) - SPIKE-008: Workflow State Machine (transitions lib + event sourcing) - SPIKE-009: Issue Synchronization (bi-directional sync with conflict resolution) - SPIKE-010: Cost Tracking (LiteLLM callbacks + budget enforcement) - SPIKE-011: Audit Logging (structured event sourcing) - SPIKE-012: Client Approval Flow (checkpoint-based approvals) Add architecture documentation: - ARCHITECTURE_DEEP_ANALYSIS.md: Memory management, security, testing strategy - IMPLEMENTATION_ROADMAP.md: 6-phase, 24-week implementation plan Closes #2, #6, #7, #8, #9, #10, #11, #12 πŸ€– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- .../ARCHITECTURE_DEEP_ANALYSIS.md | 680 ++++++ docs/architecture/IMPLEMENTATION_ROADMAP.md | 339 +++ .../SPIKE-002-agent-orchestration-pattern.md | 1326 ++++++++++++ .../SPIKE-006-knowledge-base-pgvector.md | 1259 ++++++++++++ .../SPIKE-007-agent-communication-protocol.md | 1496 ++++++++++++++ .../SPIKE-008-workflow-state-machine.md | 1513 ++++++++++++++ .../spikes/SPIKE-009-issue-synchronization.md | 1494 ++++++++++++++ docs/spikes/SPIKE-010-cost-tracking.md | 1821 +++++++++++++++++ docs/spikes/SPIKE-011-audit-logging.md | 1064 ++++++++++ docs/spikes/SPIKE-012-client-approval-flow.md | 1662 +++++++++++++++ 10 files changed, 12654 insertions(+) create mode 100644 docs/architecture/ARCHITECTURE_DEEP_ANALYSIS.md create mode 100644 docs/architecture/IMPLEMENTATION_ROADMAP.md create mode 100644 docs/spikes/SPIKE-002-agent-orchestration-pattern.md create mode 100644 docs/spikes/SPIKE-006-knowledge-base-pgvector.md create mode 100644 docs/spikes/SPIKE-007-agent-communication-protocol.md create mode 100644 docs/spikes/SPIKE-008-workflow-state-machine.md create mode 100644 docs/spikes/SPIKE-009-issue-synchronization.md create mode 100644 docs/spikes/SPIKE-010-cost-tracking.md create mode 100644 docs/spikes/SPIKE-011-audit-logging.md create mode 100644 docs/spikes/SPIKE-012-client-approval-flow.md diff --git a/docs/architecture/ARCHITECTURE_DEEP_ANALYSIS.md b/docs/architecture/ARCHITECTURE_DEEP_ANALYSIS.md new file mode 100644 index 0000000..1cc3885 --- /dev/null +++ b/docs/architecture/ARCHITECTURE_DEEP_ANALYSIS.md @@ -0,0 +1,680 @@ +# Syndarix Architecture Deep Analysis + +**Version:** 1.0 +**Date:** 2025-12-29 +**Status:** Draft - Architectural Thinking + +--- + +## Executive Summary + +This document captures deep architectural thinking about Syndarix beyond the immediate spikes. It addresses complex challenges that arise when building a truly autonomous multi-agent system and proposes solutions based on first principles. + +--- + +## 1. Agent Memory and Context Management + +### The Challenge + +Agents in Syndarix may work on projects for weeks or months. LLM context windows are finite (128K-200K tokens), but project context grows unboundedly. How do we maintain coherent agent "memory" over time? + +### Analysis + +**Context Window Constraints:** +| Model | Context Window | Practical Limit (with tools) | +|-------|---------------|------------------------------| +| Claude 3.5 Sonnet | 200K tokens | ~150K usable | +| GPT-4 Turbo | 128K tokens | ~100K usable | +| Llama 3 (70B) | 8K-128K tokens | ~80K usable | + +**Memory Types Needed:** +1. **Working Memory** - Current task context (fits in context window) +2. **Short-term Memory** - Recent conversation history (RAG-retrievable) +3. **Long-term Memory** - Project knowledge, past decisions (RAG + summarization) +4. **Episodic Memory** - Specific past events/mistakes to learn from + +### Proposed Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Agent Memory System β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Working β”‚ β”‚ Short-term β”‚ β”‚ Long-term β”‚ β”‚ +β”‚ β”‚ Memory β”‚ β”‚ Memory β”‚ β”‚ Memory β”‚ β”‚ +β”‚ β”‚ (Context) β”‚ β”‚ (Redis) β”‚ β”‚ (pgvector) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Context Assembler β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ 1. System prompt (agent personality, role) β”‚ β”‚ +β”‚ β”‚ 2. Project context (from long-term memory) β”‚ β”‚ +β”‚ β”‚ 3. Task context (current issue, requirements) β”‚ β”‚ +β”‚ β”‚ 4. Relevant history (from short-term memory) β”‚ β”‚ +β”‚ β”‚ 5. User message β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Total: Fit within context window limits β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Context Compression Strategy:** +```python +class ContextManager: + """Manages agent context to fit within LLM limits.""" + + MAX_CONTEXT_TOKENS = 100_000 # Leave room for response + + async def build_context( + self, + agent: AgentInstance, + task: Task, + user_message: str + ) -> list[Message]: + # Fixed costs + system_prompt = self._get_system_prompt(agent) # ~2K tokens + task_context = self._get_task_context(task) # ~1K tokens + + # Variable budget + remaining = self.MAX_CONTEXT_TOKENS - token_count(system_prompt, task_context, user_message) + + # Allocate remaining to memories + long_term = await self._query_long_term(agent, task, budget=remaining * 0.4) + short_term = await self._get_short_term(agent, budget=remaining * 0.4) + episodic = await self._get_relevant_episodes(agent, task, budget=remaining * 0.2) + + return self._assemble_messages( + system_prompt, task_context, long_term, short_term, episodic, user_message + ) +``` + +**Conversation Summarization:** +- After every N turns (e.g., 10), summarize conversation and archive +- Use smaller/cheaper model for summarization +- Store summaries in pgvector for semantic retrieval + +### Recommendation + +Implement a **tiered memory system** with automatic context compression and semantic retrieval. Use Redis for hot short-term memory, pgvector for cold long-term memory, and automatic summarization to prevent context overflow. + +--- + +## 2. Cross-Project Knowledge Sharing + +### The Challenge + +Each project has isolated knowledge, but agents could benefit from cross-project learnings: +- Common patterns (authentication, testing, CI/CD) +- Technology expertise (how to configure Kubernetes) +- Anti-patterns (what didn't work before) + +### Analysis + +**Privacy Considerations:** +- Client data must remain isolated (contractual, legal) +- Technical patterns are generally shareable +- Need clear data classification + +**Knowledge Categories:** +| Category | Scope | Examples | +|----------|-------|----------| +| **Client Data** | Project-only | Requirements, business logic, code | +| **Technical Patterns** | Global | Best practices, configurations | +| **Agent Learnings** | Global | What approaches worked/failed | +| **Anti-patterns** | Global | Common mistakes to avoid | + +### Proposed Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Knowledge Graph β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ GLOBAL KNOWLEDGE β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Patterns β”‚ β”‚ Anti-patternsβ”‚ β”‚ Expertise β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ Library β”‚ β”‚ Library β”‚ β”‚ Index β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β–² β”‚ +β”‚ β”‚ Curated extraction β”‚ +β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Project A β”‚ β”‚ Project B β”‚ β”‚ Project C β”‚ β”‚ +β”‚ β”‚ Knowledge β”‚ β”‚ Knowledge β”‚ β”‚ Knowledge β”‚ β”‚ +β”‚ β”‚ (Isolated) β”‚ β”‚ (Isolated) β”‚ β”‚ (Isolated) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Knowledge Extraction Pipeline:** +```python +class KnowledgeExtractor: + """Extracts shareable learnings from project work.""" + + async def extract_learnings(self, project_id: str) -> list[Learning]: + """ + Run periodically or after sprints to extract learnings. + Human review required before promoting to global. + """ + # Get completed work + completed_issues = await self.get_completed_issues(project_id) + + # Extract patterns using LLM + patterns = await self.llm.extract_patterns( + completed_issues, + categories=["architecture", "testing", "deployment", "security"] + ) + + # Classify privacy + for pattern in patterns: + pattern.privacy_level = await self.llm.classify_privacy(pattern) + + # Return only shareable patterns for review + return [p for p in patterns if p.privacy_level == "public"] +``` + +### Recommendation + +Implement **privacy-aware knowledge extraction** with human review gate. Project knowledge stays isolated by default; only explicitly approved patterns flow to global knowledge. + +--- + +## 3. Agent Specialization vs Generalization Trade-offs + +### The Challenge + +Should each agent type be highly specialized (depth) or have overlapping capabilities (breadth)? + +### Analysis + +**Specialization Benefits:** +- Deeper expertise in domain +- Cleaner system prompts +- Less confusion about responsibilities +- Easier to optimize prompts per role + +**Generalization Benefits:** +- Fewer agent types to maintain +- Smoother handoffs (shared context) +- More flexible team composition +- Graceful degradation if agent unavailable + +**Current Agent Types (10):** +| Role | Primary Domain | Potential Overlap | +|------|---------------|-------------------| +| Product Owner | Requirements | Business Analyst | +| Business Analyst | Documentation | Product Owner | +| Project Manager | Planning | Product Owner | +| Software Architect | Design | Senior Engineer | +| Software Engineer | Coding | Architect, QA | +| UI/UX Designer | Interface | Frontend Engineer | +| QA Engineer | Testing | Software Engineer | +| DevOps Engineer | Infrastructure | Senior Engineer | +| AI/ML Engineer | ML/AI | Software Engineer | +| Security Expert | Security | All | + +### Proposed Approach: Layered Specialization + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Agent Capability Layers β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ Layer 3: Role-Specific Expertise β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Product β”‚ β”‚ Architectβ”‚ β”‚Engineer β”‚ β”‚ QA β”‚ β”‚ +β”‚ β”‚ Owner β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ Layer 2: Shared Professional Skills β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Technical Communication | Code Understanding | Git β”‚ β”‚ +β”‚ β”‚ Documentation | Research | Problem Decomposition β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ Layer 1: Foundation Model Capabilities β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Reasoning | Analysis | Writing | Coding (LLM Base) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Capability Inheritance:** +```python +class AgentTypeBuilder: + """Builds agent types with layered capabilities.""" + + BASE_CAPABILITIES = [ + "reasoning", "analysis", "writing", "coding_assist" + ] + + PROFESSIONAL_SKILLS = [ + "technical_communication", "code_understanding", + "git_operations", "documentation", "research" + ] + + ROLE_SPECIFIC = { + "ENGINEER": ["code_generation", "code_review", "testing", "debugging"], + "ARCHITECT": ["system_design", "adr_writing", "tech_selection"], + "QA": ["test_planning", "test_automation", "bug_reporting"], + # ... + } + + def build_capabilities(self, role: AgentRole) -> list[str]: + return ( + self.BASE_CAPABILITIES + + self.PROFESSIONAL_SKILLS + + self.ROLE_SPECIFIC[role] + ) +``` + +### Recommendation + +Adopt **layered specialization** where all agents share foundational and professional capabilities, with role-specific expertise on top. This enables smooth collaboration while maintaining clear responsibilities. + +--- + +## 4. Human-Agent Collaboration Model + +### The Challenge + +Beyond approval gates, how do humans effectively collaborate with autonomous agents during active work? + +### Interaction Patterns + +| Pattern | Use Case | Frequency | +|---------|----------|-----------| +| **Approval** | Confirm before action | Per checkpoint | +| **Guidance** | Steer direction | On-demand | +| **Override** | Correct mistake | Rare | +| **Pair Working** | Work together | Optional | +| **Review** | Evaluate output | Post-completion | + +### Proposed Collaboration Interface + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Human-Agent Collaboration Dashboard β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Activity Stream β”‚ β”‚ +β”‚ β”‚ ────────────────────────────────────────────────────── β”‚ β”‚ +β”‚ β”‚ [10:23] Dave (Engineer) is implementing login API β”‚ β”‚ +β”‚ β”‚ [10:24] Dave created auth/service.py β”‚ β”‚ +β”‚ β”‚ [10:25] Dave is writing unit tests β”‚ β”‚ +β”‚ β”‚ [LIVE] Dave: "I'm adding JWT validation. Using HS256..." β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Intervention Panel β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ [πŸ’¬ Chat] [⏸️ Pause] [↩️ Undo Last] [πŸ“ Guide] β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Quick Guidance: β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ "Use RS256 instead of HS256 for JWT signing" β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ [Send] πŸ“€ β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Intervention API:** +```python +@router.post("/agents/{agent_id}/intervene") +async def intervene( + agent_id: UUID, + intervention: InterventionRequest, + current_user: User = Depends(get_current_user) +): + """Allow human to intervene in agent work.""" + match intervention.type: + case "pause": + await orchestrator.pause_agent(agent_id) + case "resume": + await orchestrator.resume_agent(agent_id) + case "guide": + await orchestrator.send_guidance(agent_id, intervention.message) + case "undo": + await orchestrator.undo_last_action(agent_id) + case "override": + await orchestrator.override_decision(agent_id, intervention.decision) +``` + +### Recommendation + +Build a **real-time collaboration dashboard** with intervention capabilities. Humans should be able to observe, guide, pause, and correct agents without stopping the entire workflow. + +--- + +## 5. Testing Strategy for Autonomous AI Systems + +### The Challenge + +Traditional testing (unit, integration, E2E) doesn't capture autonomous agent behavior. How do we ensure quality? + +### Testing Pyramid for AI Agents + +``` + β–² + β•± β•² + β•± β•² + β•± E2E β•² Agent Scenarios + β•± Agent β•² (Full workflows) + ╱─────────╲ + β•± Integrationβ•² Tool + LLM Integration + β•± (with mocks) β•² (Deterministic responses) + ╱─────────────────╲ + β•± Unit Tests β•² Orchestrator, Services + β•± (no LLM needed) β•² (Pure logic) + ╱───────────────────────╲ + β•± Prompt Testing β•² System prompt evaluation + β•± (LLM evals) β•²(Quality metrics) + ╱─────────────────────────────╲ +``` + +### Test Categories + +**1. Prompt Testing (Eval Framework):** +```python +class PromptEvaluator: + """Evaluate system prompt quality.""" + + TEST_CASES = [ + EvalCase( + name="requirement_extraction", + input="Client wants a mobile app for food delivery", + expected_behaviors=[ + "asks clarifying questions", + "identifies stakeholders", + "considers non-functional requirements" + ] + ), + EvalCase( + name="code_review_thoroughness", + input="Review this PR: [vulnerable SQL code]", + expected_behaviors=[ + "identifies SQL injection", + "suggests parameterized queries", + "mentions security best practices" + ] + ) + ] + + async def evaluate(self, agent_type: AgentType) -> EvalReport: + results = [] + for case in self.TEST_CASES: + response = await self.llm.complete( + system=agent_type.system_prompt, + user=case.input + ) + score = await self.judge_response(response, case.expected_behaviors) + results.append(score) + return EvalReport(results) +``` + +**2. Integration Testing (Mock LLM):** +```python +@pytest.fixture +def mock_llm(): + """Deterministic LLM responses for integration tests.""" + responses = { + "analyze requirements": "...", + "generate code": "def hello(): return 'world'", + "review code": "LGTM" + } + return MockLLM(responses) + +async def test_story_implementation_workflow(mock_llm): + """Test full workflow with predictable responses.""" + orchestrator = AgentOrchestrator(llm=mock_llm) + + result = await orchestrator.execute_workflow( + workflow="implement_story", + inputs={"story_id": "TEST-123"} + ) + + assert result.status == "completed" + assert "hello" in result.artifacts["code"] +``` + +**3. Agent Scenario Testing:** +```python +class AgentScenarioTest: + """End-to-end agent behavior testing.""" + + @scenario("engineer_handles_bug_report") + async def test_bug_resolution(self): + """Engineer agent should fix bugs correctly.""" + # Setup + project = await create_test_project() + engineer = await spawn_agent("engineer", project) + + # Act + bug = await create_issue( + project, + title="Login button not working", + type="bug" + ) + result = await engineer.handle(bug) + + # Assert + assert result.pr_created + assert result.tests_pass + assert "button" in result.changes_summary.lower() +``` + +### Recommendation + +Implement a **multi-layer testing strategy** with prompt evals, deterministic integration tests, and scenario-based agent testing. Use LLM-as-judge for evaluating open-ended responses. + +--- + +## 6. Rollback and Recovery + +### The Challenge + +Autonomous agents will make mistakes. How do we recover gracefully? + +### Error Categories + +| Category | Example | Recovery Strategy | +|----------|---------|-------------------| +| **Reversible** | Wrong code generated | Revert commit, regenerate | +| **Partially Reversible** | Merged bad PR | Revert PR, fix, re-merge | +| **Non-reversible** | Deployed to production | Forward-fix or rollback deploy | +| **External Side Effects** | Email sent to client | Apology + correction | + +### Recovery Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Recovery System β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Action Log β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Action ID | Agent | Type | Reversible | State β”‚ β”‚ β”‚ +β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ +β”‚ β”‚ β”‚ a-001 | Dave | commit | Yes | completed β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ a-002 | Dave | push | Yes | completed β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ a-003 | Dave | create_pr | Yes | completed β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ a-004 | Kate | merge_pr | Partial | completed β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Rollback Engine β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ rollback_to(action_id) -> Reverses all actions after β”‚ β”‚ +β”‚ β”‚ undo_action(action_id) -> Reverses single action β”‚ β”‚ +β”‚ β”‚ compensate(action_id) -> Creates compensating action β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Action Logging:** +```python +class ActionLog: + """Immutable log of all agent actions for recovery.""" + + async def record( + self, + agent_id: UUID, + action_type: str, + inputs: dict, + outputs: dict, + reversible: bool, + reverse_action: str | None = None + ) -> ActionRecord: + record = ActionRecord( + id=uuid4(), + agent_id=agent_id, + action_type=action_type, + inputs=inputs, + outputs=outputs, + reversible=reversible, + reverse_action=reverse_action, + timestamp=datetime.utcnow() + ) + await self.db.add(record) + return record + + async def rollback_to(self, action_id: UUID) -> RollbackResult: + """Rollback all actions after the given action.""" + actions = await self.get_actions_after(action_id) + + results = [] + for action in reversed(actions): + if action.reversible: + result = await self._execute_reverse(action) + results.append(result) + else: + results.append(RollbackSkipped(action, reason="non-reversible")) + + return RollbackResult(results) +``` + +**Compensation Pattern:** +```python +class CompensationEngine: + """Handles compensating actions for non-reversible operations.""" + + COMPENSATIONS = { + "email_sent": "send_correction_email", + "deployment": "rollback_deployment", + "external_api_call": "create_reversal_request" + } + + async def compensate(self, action: ActionRecord) -> CompensationResult: + if action.action_type in self.COMPENSATIONS: + compensation = self.COMPENSATIONS[action.action_type] + return await self._execute_compensation(compensation, action) + else: + return CompensationResult( + status="manual_required", + message=f"No automatic compensation for {action.action_type}" + ) +``` + +### Recommendation + +Implement **comprehensive action logging** with rollback capabilities. Define compensation strategies for non-reversible actions. Enable point-in-time recovery for project state. + +--- + +## 7. Security Considerations for Autonomous Agents + +### Threat Model + +| Threat | Risk | Mitigation | +|--------|------|------------| +| Agent executes malicious code | High | Sandboxed execution, code review gates | +| Agent exfiltrates data | High | Network isolation, output filtering | +| Prompt injection via user input | Medium | Input sanitization, prompt hardening | +| Agent credential abuse | Medium | Least-privilege tokens, short TTL | +| Agent collusion | Low | Independent agent instances, monitoring | + +### Security Architecture + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Security Layers β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ Layer 4: Output Filtering β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ - Code scan before commit β”‚ β”‚ +β”‚ β”‚ - Secrets detection β”‚ β”‚ +β”‚ β”‚ - Policy compliance check β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ Layer 3: Action Authorization β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ - Role-based permissions β”‚ β”‚ +β”‚ β”‚ - Project scope enforcement β”‚ β”‚ +β”‚ β”‚ - Sensitive action approval β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ Layer 2: Input Sanitization β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ - Prompt injection detection β”‚ β”‚ +β”‚ β”‚ - Content filtering β”‚ β”‚ +β”‚ β”‚ - Schema validation β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β”‚ Layer 1: Infrastructure Isolation β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ - Container sandboxing β”‚ β”‚ +β”‚ β”‚ - Network segmentation β”‚ β”‚ +β”‚ β”‚ - File system restrictions β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Recommendation + +Implement **defense-in-depth** with multiple security layers. Assume agents can be compromised and design for containment. + +--- + +## Summary of Recommendations + +| Area | Recommendation | Priority | +|------|----------------|----------| +| Memory | Tiered memory with context compression | High | +| Knowledge | Privacy-aware extraction with human gate | Medium | +| Specialization | Layered capabilities with role-specific top | Medium | +| Collaboration | Real-time dashboard with intervention | High | +| Testing | Multi-layer with prompt evals | High | +| Recovery | Action logging with rollback engine | High | +| Security | Defense-in-depth, assume compromise | High | + +--- + +## Next Steps + +1. **Validate with spike research** - Update based on spike findings +2. **Create detailed ADRs** - For memory, recovery, security +3. **Prototype critical paths** - Memory system, rollback engine +4. **Security review** - External audit before production + +--- + +*This document captures architectural thinking to guide implementation. It should be updated as spikes complete and design evolves.* diff --git a/docs/architecture/IMPLEMENTATION_ROADMAP.md b/docs/architecture/IMPLEMENTATION_ROADMAP.md new file mode 100644 index 0000000..4abe008 --- /dev/null +++ b/docs/architecture/IMPLEMENTATION_ROADMAP.md @@ -0,0 +1,339 @@ +# Syndarix Implementation Roadmap + +**Version:** 1.0 +**Date:** 2025-12-29 +**Status:** Draft + +--- + +## Executive Summary + +This roadmap outlines the phased implementation approach for Syndarix, prioritizing foundational infrastructure before advanced features. Each phase builds upon the previous, with clear milestones and deliverables. + +--- + +## Phase 0: Foundation (Weeks 1-2) +**Goal:** Establish development infrastructure and basic platform + +### 0.1 Repository Setup +- [x] Fork PragmaStack to Syndarix +- [x] Create spike backlog in Gitea +- [x] Complete architecture documentation +- [ ] Rebrand codebase (Issue #13 - in progress) +- [ ] Configure CI/CD pipelines +- [ ] Set up development environment documentation + +### 0.2 Core Infrastructure +- [ ] Configure Redis for cache + pub/sub +- [ ] Set up Celery worker infrastructure +- [ ] Configure pgvector extension +- [ ] Create MCP server directory structure +- [ ] Set up Docker Compose for local development + +### Deliverables +- Fully branded Syndarix repository +- Working local development environment +- CI/CD pipeline running tests + +--- + +## Phase 1: Core Platform (Weeks 3-6) +**Goal:** Basic project and agent management without LLM integration + +### 1.1 Data Model +- [ ] Create Project entity and CRUD +- [ ] Create AgentType entity and CRUD +- [ ] Create AgentInstance entity and CRUD +- [ ] Create Issue entity with external tracker fields +- [ ] Create Sprint entity and CRUD +- [ ] Database migrations with Alembic + +### 1.2 API Layer +- [ ] Project management endpoints +- [ ] Agent type configuration endpoints +- [ ] Agent instance management endpoints +- [ ] Issue CRUD endpoints +- [ ] Sprint management endpoints + +### 1.3 Real-time Infrastructure +- [ ] Implement EventBus with Redis Pub/Sub +- [ ] Create SSE endpoint for project events +- [ ] Implement event types enum +- [ ] Add keepalive mechanism +- [ ] Client-side SSE handling + +### 1.4 Frontend Foundation +- [ ] Project dashboard page +- [ ] Agent configuration UI +- [ ] Issue list and detail views +- [ ] Real-time activity feed component +- [ ] Basic navigation and layout + +### Deliverables +- CRUD operations for all core entities +- Real-time event streaming working +- Basic admin UI for configuration + +--- + +## Phase 2: MCP Integration (Weeks 7-10) +**Goal:** Build MCP servers for external integrations + +### 2.1 MCP Client Infrastructure +- [ ] Create MCPClientManager class +- [ ] Implement server registry +- [ ] Add connection management with reconnection +- [ ] Create tool call routing + +### 2.2 LLM Gateway MCP (Priority 1) +- [ ] Create FastMCP server structure +- [ ] Implement LiteLLM integration +- [ ] Add model group routing +- [ ] Implement failover chain +- [ ] Add cost tracking callbacks +- [ ] Create token usage logging + +### 2.3 Knowledge Base MCP (Priority 2) +- [ ] Create pgvector schema for embeddings +- [ ] Implement document ingestion pipeline +- [ ] Create chunking strategies (code, markdown, text) +- [ ] Implement semantic search +- [ ] Add hybrid search (vector + keyword) +- [ ] Per-project collection isolation + +### 2.4 Git MCP (Priority 3) +- [ ] Create Git operations wrapper +- [ ] Implement clone, commit, push operations +- [ ] Add branch management +- [ ] Create PR operations +- [ ] Add Gitea API integration +- [ ] Implement GitHub/GitLab adapters + +### 2.5 Issues MCP (Priority 4) +- [ ] Create issue sync service +- [ ] Implement Gitea issue operations +- [ ] Add GitHub issue adapter +- [ ] Add GitLab issue adapter +- [ ] Implement bi-directional sync +- [ ] Create conflict resolution logic + +### Deliverables +- 4 working MCP servers +- LLM calls routed through gateway +- RAG search functional +- Git operations working +- Issue sync with external trackers + +--- + +## Phase 3: Agent Orchestration (Weeks 11-14) +**Goal:** Enable agents to perform autonomous work + +### 3.1 Agent Runner +- [ ] Create AgentRunner class +- [ ] Implement context assembly +- [ ] Add memory management (short-term, long-term) +- [ ] Implement action execution +- [ ] Add tool call handling +- [ ] Create agent error handling + +### 3.2 Agent Orchestrator +- [ ] Implement spawn_agent method +- [ ] Create terminate_agent method +- [ ] Implement send_message routing +- [ ] Add broadcast functionality +- [ ] Create agent status tracking +- [ ] Implement agent recovery + +### 3.3 Inter-Agent Communication +- [ ] Define message format schema +- [ ] Implement message persistence +- [ ] Create message routing logic +- [ ] Add @mention parsing +- [ ] Implement priority queues +- [ ] Add conversation threading + +### 3.4 Background Task Integration +- [ ] Create Celery task wrappers +- [ ] Implement progress reporting +- [ ] Add task chaining for workflows +- [ ] Create agent queue routing +- [ ] Implement task retry logic + +### Deliverables +- Agents can be spawned and communicate +- Agents can call MCP tools +- Background tasks for long operations +- Agent activity visible in real-time + +--- + +## Phase 4: Workflow Engine (Weeks 15-18) +**Goal:** Implement structured workflows for software delivery + +### 4.1 State Machine Foundation +- [ ] Create workflow state machine base +- [ ] Implement state persistence +- [ ] Add transition validation +- [ ] Create state history logging +- [ ] Implement compensation patterns + +### 4.2 Core Workflows +- [ ] Requirements Discovery workflow +- [ ] Architecture Spike workflow +- [ ] Sprint Planning workflow +- [ ] Story Implementation workflow +- [ ] Sprint Demo workflow + +### 4.3 Approval Gates +- [ ] Create approval checkpoint system +- [ ] Implement approval UI components +- [ ] Add notification triggers +- [ ] Create timeout handling +- [ ] Implement escalation logic + +### 4.4 Autonomy Levels +- [ ] Implement FULL_CONTROL mode +- [ ] Implement MILESTONE mode +- [ ] Implement AUTONOMOUS mode +- [ ] Create autonomy configuration UI +- [ ] Add per-action approval overrides + +### Deliverables +- Structured workflows executing +- Approval gates working +- Autonomy levels configurable +- Full sprint cycle possible + +--- + +## Phase 5: Advanced Features (Weeks 19-22) +**Goal:** Polish and production readiness + +### 5.1 Cost Management +- [ ] Real-time cost tracking dashboard +- [ ] Budget configuration per project +- [ ] Alert threshold system +- [ ] Cost optimization recommendations +- [ ] Historical cost analytics + +### 5.2 Audit & Compliance +- [ ] Comprehensive action logging +- [ ] Audit trail viewer UI +- [ ] Export functionality +- [ ] Retention policy implementation +- [ ] Compliance report generation + +### 5.3 Human-Agent Collaboration +- [ ] Live activity dashboard +- [ ] Intervention panel (pause, guide, undo) +- [ ] Agent chat interface +- [ ] Context inspector +- [ ] Decision explainer + +### 5.4 Additional MCP Servers +- [ ] File System MCP +- [ ] Code Analysis MCP +- [ ] CI/CD MCP + +### Deliverables +- Production-ready system +- Full observability +- Cost controls active +- Audit compliance + +--- + +## Phase 6: Polish & Launch (Weeks 23-24) +**Goal:** Production deployment + +### 6.1 Performance Optimization +- [ ] Load testing +- [ ] Query optimization +- [ ] Caching optimization +- [ ] Memory profiling + +### 6.2 Security Hardening +- [ ] Security audit +- [ ] Penetration testing +- [ ] Secrets management +- [ ] Rate limiting tuning + +### 6.3 Documentation +- [ ] User documentation +- [ ] API documentation +- [ ] Deployment guide +- [ ] Runbook + +### 6.4 Deployment +- [ ] Production environment setup +- [ ] Monitoring & alerting +- [ ] Backup & recovery +- [ ] Launch checklist + +--- + +## Risk Register + +| Risk | Impact | Probability | Mitigation | +|------|--------|-------------|------------| +| LLM API outages | High | Medium | Multi-provider failover | +| Cost overruns | High | Medium | Budget enforcement, local models | +| Agent hallucinations | High | Medium | Approval gates, code review | +| Performance bottlenecks | Medium | Medium | Load testing, caching | +| Integration failures | Medium | Low | Contract testing, mocks | + +--- + +## Success Metrics + +| Metric | Target | Measurement | +|--------|--------|-------------| +| Agent task success rate | >90% | Completed tasks / total tasks | +| Response time (P95) | <2s | API latency | +| Cost per project | <$50/sprint | LLM + compute costs | +| Time to first commit | <1 hour | From requirements to PR | +| Client satisfaction | >4/5 | Post-sprint survey | + +--- + +## Dependencies + +``` +Phase 0 ─────▢ Phase 1 ─────▢ Phase 2 ─────▢ Phase 3 ─────▢ Phase 4 ─────▢ Phase 5 ─────▢ Phase 6 +Foundation Core Platform MCP Integration Agent Orch Workflows Advanced Launch + β”‚ + β”‚ + Depends on: + - LLM Gateway + - Knowledge Base + - Real-time events +``` + +--- + +## Resource Requirements + +### Development Team +- 1 Backend Engineer (Python/FastAPI) +- 1 Frontend Engineer (React/Next.js) +- 0.5 DevOps Engineer +- 0.25 Product Manager + +### Infrastructure +- PostgreSQL (managed or self-hosted) +- Redis (managed or self-hosted) +- Celery workers (2-4 instances) +- MCP servers (7 containers) +- API server (2+ instances) +- Frontend (static hosting or SSR) + +### External Services +- Anthropic API (primary LLM) +- OpenAI API (fallback) +- Ollama (local models, optional) +- Gitea/GitHub/GitLab (issue tracking) + +--- + +*This roadmap will be refined as spikes complete and requirements evolve.* diff --git a/docs/spikes/SPIKE-002-agent-orchestration-pattern.md b/docs/spikes/SPIKE-002-agent-orchestration-pattern.md new file mode 100644 index 0000000..e55482c --- /dev/null +++ b/docs/spikes/SPIKE-002-agent-orchestration-pattern.md @@ -0,0 +1,1326 @@ +# SPIKE-002: Agent Orchestration Pattern + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #2 + +--- + +## Executive Summary + +After researching leading multi-agent orchestration frameworks (AutoGen, CrewAI, LangGraph) and enterprise patterns, we recommend a **hybrid architecture** for Syndarix that combines: + +1. **LangGraph** for core agent orchestration logic (graph-based state machines) +2. **Temporal** for durable, long-running workflow execution +3. **Event-driven communication** via Redis Streams with A2A protocol concepts +4. **Hierarchical supervisor pattern** with peer-to-peer collaboration within teams + +This architecture provides the flexibility, scalability, and durability required for 50+ concurrent agent instances while maintaining individual agent context, token tracking, and LLM failover capabilities. + +### Key Recommendation + +**Do not adopt a single framework wholesale.** Instead, build a custom orchestration layer using: + +- **LangGraph** for agent logic and state transitions (graph primitives) +- **Temporal** for durable execution and workflow persistence +- **Redis Streams** for event-driven inter-agent communication +- **LiteLLM** for unified LLM access with failover (per SPIKE-005) + +--- + +## Research Questions Addressed + +### 1. What are the leading patterns for multi-agent orchestration in 2024-2025? + +The multi-agent landscape has matured significantly. Key patterns include: + +| Pattern | Description | Best For | +|---------|-------------|----------| +| **Supervisor** | Central orchestrator coordinates specialized agents | Complex multi-domain workflows | +| **Hierarchical** | Tree structure with managers and workers | Large-scale, layered problems | +| **Peer-to-Peer** | Agents collaborate as equals without central control | Open-ended ideation, negotiation | +| **Orchestrator-Worker** | Planner breaks tasks, workers execute | Deterministic pipelines | +| **Blackboard** | Shared knowledge space, agents react to changes | Incremental problem-solving | +| **Market-Based** | Agents bid for tasks based on capabilities | Dynamic resource allocation | + +**Industry Trend (2025):** Event-driven architectures are becoming dominant, with protocols like Google's A2A (Agent-to-Agent), IBM's ACP, and AG-UI standardizing communication. + +### 2. Framework Comparison + +| Feature | AutoGen 0.4 | CrewAI | LangGraph | Custom | +|---------|-------------|--------|-----------|--------| +| **Architecture** | Event-driven, distributed | Crews + Flows dual-mode | Graph-based state machine | Flexible | +| **State Management** | Built-in, cross-language | Shared memory | Persistent, checkpointed | Custom (Temporal) | +| **Scalability** | High (Kubernetes-ready) | Medium-High | High | Very High | +| **Learning Curve** | Medium | Low-Medium | High | High | +| **Enterprise Adoption** | Microsoft ecosystem | 60% Fortune 500 | Klarna, Replit, Elastic | N/A | +| **Long-running Workflows** | Good | Limited | Good | Excellent (Temporal) | +| **Customization** | Medium | Limited | High | Full | +| **Token Tracking** | Basic | Basic | Via LangSmith | Custom | +| **Multi-Provider Failover** | Limited | Limited | Limited | Full (LiteLLM) | +| **50+ Concurrent Agents** | Possible | Challenging | Possible | Designed for | + +#### AutoGen 0.4 (Microsoft) + +**Pros:** +- Event-driven, async-first architecture +- Cross-language support (.NET, Python) +- Built-in observability (OpenTelemetry) +- Strong Microsoft ecosystem integration +- New Agent Framework combines AutoGen + Semantic Kernel + +**Cons:** +- Tied to Microsoft patterns +- Less flexible for custom orchestration +- Newer 0.4 version still maturing + +**Best For:** Microsoft-centric enterprises, standardized agent patterns + +#### CrewAI + +**Pros:** +- Easy to get started (role-based agents) +- Good for sequential/hierarchical workflows +- Strong enterprise traction ($18M Series A) +- LLM-agnostic design + +**Cons:** +- Limited ceiling for complex patterns +- Teams report hitting walls at 6-12 months +- Multi-agent complexity can cause loops +- Flows architecture adds learning curve + +**Best For:** Quick prototypes, straightforward workflows, teams new to agents + +#### LangGraph + +**Pros:** +- Fine-grained control over agent flow +- Excellent state management with persistence +- Time-travel debugging (LangSmith) +- Human-in-the-loop built-in +- Cycles, conditionals, parallel execution + +**Cons:** +- Steep learning curve (graph theory, state machines) +- Requires distributed systems knowledge +- Observability requires LangSmith subscription + +**Best For:** Complex, stateful workflows requiring precise control + +### 3. Enterprise Agent State Management + +Enterprise systems use these patterns: + +**Event Sourcing:** +```python +# All state changes stored as events +class AgentStateEvent: + event_type: str # "TASK_ASSIGNED", "STATE_CHANGED", etc. + agent_id: str + timestamp: datetime + data: dict + sequence_number: int +``` + +**Checkpoint/Snapshot Pattern:** +```python +# Periodic snapshots for fast recovery +class AgentCheckpoint: + agent_id: str + state: AgentState + memory: dict + context_window: list[Message] + checkpoint_time: datetime +``` + +**Temporal Durable Execution:** +```python +# Workflow state is automatically persistent +@workflow.defn +class AgentWorkflow: + @workflow.run + async def run(self, task: Task) -> Result: + # State survives crashes, restarts, deployments + result = await workflow.execute_activity( + execute_agent_task, + task, + start_to_close_timeout=timedelta(hours=24) + ) + return result +``` + +### 4. Agent-to-Agent Communication + +**Recommended: Event-Driven with Redis Streams** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Redis Streams β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ project:{project_id}:events β”‚ β”‚ +β”‚ β”‚ - agent_message β”‚ β”‚ +β”‚ β”‚ - task_assignment β”‚ β”‚ +β”‚ β”‚ - state_change β”‚ β”‚ +β”‚ β”‚ - artifact_produced β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β–² β–² β–² β–² + β”‚ β”‚ β”‚ β”‚ + β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β” + β”‚ PO β”‚ β”‚ Arch β”‚ β”‚ Dev-1 β”‚ β”‚ Dev-2 β”‚ + β”‚ Agent β”‚ β”‚ Agent β”‚ β”‚ Agent β”‚ β”‚ Agent β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Protocol Design (A2A-inspired):** + +```python +@dataclass +class AgentMessage: + """Agent-to-Agent message following A2A concepts.""" + id: UUID + source_agent_id: str + target_agent_id: str | None # None = broadcast + project_id: str + message_type: Literal[ + "TASK_HANDOFF", + "CONTEXT_SHARE", + "REVIEW_REQUEST", + "FEEDBACK", + "ARTIFACT", + "QUERY", + ] + payload: dict + correlation_id: UUID | None # For request/response + timestamp: datetime + +@dataclass +class AgentCard: + """Agent capability advertisement (A2A concept).""" + agent_id: str + agent_type: str + capabilities: list[str] + current_state: AgentState + project_id: str +``` + +### 5. Long-Running Agent Workflows + +**Recommended: Temporal for Durable Execution** + +Temporal provides: +- **Automatic state persistence** - survives crashes and restarts +- **Built-in retries** with exponential backoff +- **Long-running support** - hours, days, even months +- **Human-in-the-loop** - pause and wait for approval +- **Visibility** - full execution history + +```python +# workflows/sprint_workflow.py +from temporalio import workflow, activity +from datetime import timedelta + +@workflow.defn +class SprintWorkflow: + """Durable workflow for autonomous sprint execution.""" + + def __init__(self): + self._state = SprintState.PLANNING + self._agents: dict[str, AgentHandle] = {} + self._artifacts: list[Artifact] = [] + + @workflow.run + async def run(self, sprint: SprintConfig) -> SprintResult: + # Phase 1: Planning (may take hours) + backlog = await workflow.execute_activity( + plan_sprint, + sprint, + start_to_close_timeout=timedelta(hours=4), + ) + + # Phase 2: Parallel development (may take days) + dev_tasks = await self._execute_development(backlog) + + # Phase 3: Review checkpoint (human approval) + if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS: + await workflow.wait_condition( + lambda: self._human_approved + ) + + # Phase 4: QA and finalization + result = await workflow.execute_activity( + finalize_sprint, + dev_tasks, + start_to_close_timeout=timedelta(hours=2), + ) + + return result + + @workflow.signal + async def approve_checkpoint(self, approved: bool): + self._human_approved = approved + + @workflow.query + def get_state(self) -> SprintState: + return self._state +``` + +### 6. Orchestration Topology Comparison + +#### Supervisor Pattern + +``` + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Supervisor β”‚ + β”‚ (Orchestrator)β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ β”‚ β”‚ + β–Ό β–Ό β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Specialist 1 β”‚ β”‚ Specialist 2 β”‚ β”‚ Specialist 3 β”‚ +β”‚ (Dev Agent) β”‚ β”‚ (QA Agent) β”‚ β”‚ (DevOps) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Pros:** Clear control, auditability, easier debugging +**Cons:** Bottleneck at supervisor, single point of failure +**Syndarix Use:** Project Manager as supervisor for sprints + +#### Hierarchical Pattern + +``` + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Project Lead β”‚ + β”‚ (PO) β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ β”‚ β”‚ + β–Ό β–Ό β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Dev Lead β”‚ β”‚ QA Lead β”‚ β”‚ Ops Lead β”‚ + β”‚ (Architect)β”‚ β”‚ β”‚ β”‚ β”‚ + β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” + β”‚ β”‚ β”‚ β”‚ β”‚ + β–Ό β–Ό β–Ό β–Ό β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” +β”‚ Dave β”‚ β”‚ Ellisβ”‚ β”‚ Kate β”‚ β”‚ QA-1 β”‚ β”‚ QA-2 β”‚ +β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Pros:** Scalable, localized decision-making, mirrors real teams +**Cons:** Communication overhead, coordination complexity +**Syndarix Use:** Recommended primary pattern + +#### Peer-to-Peer Pattern + +``` + β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” + β”‚ Dave │◄───►│ Ellisβ”‚ + β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ + β”‚ β”‚ + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β” β”‚ + └──►│ Kate β”‚β—„β”˜ + β””β”€β”€β”¬β”€β”€β”€β”˜ + β”‚ + β–Ό + [Shared Blackboard] +``` + +**Pros:** Flexible, emergent collaboration, no bottleneck +**Cons:** Harder to control, potential infinite loops +**Syndarix Use:** Within teams for brainstorming/spikes + +--- + +## Proposed Architecture for Syndarix + +### High-Level Design + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Syndarix Orchestration Layer β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Temporal Workflow Engine β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Project β”‚ β”‚ Sprint β”‚ β”‚ Agent Task β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ Workflow β”‚ β”‚ Workflow β”‚ β”‚ Workflow β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Agent Runtime (LangGraph) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Agent State Graph β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [BLOCKED] β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β–² β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Agent β”‚ β”‚ Agent β”‚ β”‚ Agent β”‚ β”‚ Agent β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ Instance β”‚ β”‚ Instance β”‚ β”‚ Instance β”‚ β”‚ Instance β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ (PO) β”‚ β”‚ (Arch) β”‚ β”‚ (Dev-1) β”‚ β”‚ (Dev-2) β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Communication Layer β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Redis Streams β”‚ β”‚ WebSocket/SSE β”‚ β”‚ Event Bus β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ (Agent-Agent) β”‚ β”‚ (Agent-UI) β”‚ β”‚ (System) β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Infrastructure Layer β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ LiteLLM β”‚ β”‚ MCP β”‚ β”‚ PostgreSQL β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ (Gateway) β”‚ β”‚ (Servers) β”‚ β”‚ (pgvector) β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Agent Instance Model + +```python +# app/models/agent_instance.py +from sqlalchemy import Column, String, Enum, ForeignKey, JSON, DateTime +from sqlalchemy.dialects.postgresql import UUID +from app.db.base import Base +import enum + +class AgentState(str, enum.Enum): + IDLE = "idle" + THINKING = "thinking" + EXECUTING = "executing" + WAITING_INPUT = "waiting_input" + WAITING_REVIEW = "waiting_review" + BLOCKED = "blocked" + COMPLETED = "completed" + ERROR = "error" + +class AgentInstance(Base): + """Individual agent instance spawned from a type.""" + __tablename__ = "agent_instances" + + id = Column(UUID, primary_key=True) + name = Column(String(50)) # "Dave", "Ellis", "Kate" + + # Type relationship + agent_type_id = Column(UUID, ForeignKey("agent_types.id")) + + # Project assignment + project_id = Column(UUID, ForeignKey("projects.id")) + + # Current state + state = Column(Enum(AgentState), default=AgentState.IDLE) + current_task_id = Column(UUID, ForeignKey("tasks.id"), nullable=True) + + # Context/Memory + working_memory = Column(JSON, default=dict) # Short-term context + system_prompt_override = Column(String, nullable=True) + + # Token/Cost tracking (per instance) + total_tokens_used = Column(Integer, default=0) + total_cost_usd = Column(Float, default=0.0) + + # Timestamps + created_at = Column(DateTime, default=datetime.utcnow) + last_active_at = Column(DateTime, nullable=True) + + # Relationships + agent_type = relationship("AgentType", back_populates="instances") + project = relationship("Project", back_populates="agents") + conversations = relationship("Conversation", back_populates="agent") +``` + +### Agent State Machine (LangGraph) + +```python +# app/agents/state_machine.py +from langgraph.graph import StateGraph, END +from langgraph.checkpoint.postgres import PostgresSaver +from typing import TypedDict, Annotated, Literal +from operator import add + +class AgentGraphState(TypedDict): + """State shared across agent graph nodes.""" + agent_id: str + project_id: str + task: dict | None + messages: Annotated[list, add] # Append-only + artifacts: Annotated[list, add] + current_state: AgentState + iteration_count: int + error: str | None + +def create_agent_graph(agent_type: AgentType) -> StateGraph: + """Create a LangGraph state machine for an agent type.""" + + graph = StateGraph(AgentGraphState) + + # Define nodes + graph.add_node("think", think_node) + graph.add_node("execute", execute_node) + graph.add_node("review", review_node) + graph.add_node("handoff", handoff_node) + graph.add_node("error_handler", error_node) + + # Define edges + graph.set_entry_point("think") + + graph.add_conditional_edges( + "think", + route_after_think, + { + "execute": "execute", + "handoff": "handoff", + "review": "review", + "end": END, + } + ) + + graph.add_conditional_edges( + "execute", + route_after_execute, + { + "think": "think", # Continue reasoning + "review": "review", + "error": "error_handler", + } + ) + + graph.add_edge("review", "think") + graph.add_edge("error_handler", END) + graph.add_edge("handoff", END) + + return graph.compile( + checkpointer=PostgresSaver.from_conn_string(settings.DATABASE_URL) + ) + +async def think_node(state: AgentGraphState) -> AgentGraphState: + """Agent reasoning/planning node.""" + agent = await get_agent_instance(state["agent_id"]) + + # Get LLM response + response = await llm_gateway.complete( + agent_id=state["agent_id"], + project_id=state["project_id"], + messages=build_messages(agent, state), + model_preference=agent.agent_type.model_preference, + ) + + # Parse response for next action + action = parse_agent_action(response["content"]) + + return { + **state, + "messages": [{"role": "assistant", "content": response["content"]}], + "current_action": action, + "iteration_count": state["iteration_count"] + 1, + } + +async def execute_node(state: AgentGraphState) -> AgentGraphState: + """Execute agent action (tool calls, artifact creation).""" + action = state.get("current_action") + + if action["type"] == "tool_call": + result = await mcp_client.call_tool( + server=action["server"], + tool_name=action["tool"], + arguments={ + "project_id": state["project_id"], + "agent_id": state["agent_id"], + **action["arguments"], + } + ) + return { + **state, + "messages": [{"role": "tool", "content": str(result)}], + } + + elif action["type"] == "produce_artifact": + artifact = await create_artifact(action["artifact"]) + return { + **state, + "artifacts": [artifact], + } + + return state +``` + +### Inter-Agent Communication + +```python +# app/agents/communication.py +from redis import asyncio as aioredis +from dataclasses import dataclass, asdict +import json + +@dataclass +class AgentMessage: + id: str + source_agent_id: str + target_agent_id: str | None + project_id: str + message_type: str + payload: dict + timestamp: str + correlation_id: str | None = None + +class AgentMessageBus: + """Redis Streams-based message bus for agent communication.""" + + def __init__(self, redis: aioredis.Redis): + self.redis = redis + + async def publish(self, message: AgentMessage) -> str: + """Publish message to project stream.""" + stream_key = f"project:{message.project_id}:agent_events" + + message_id = await self.redis.xadd( + stream_key, + asdict(message), + maxlen=10000, # Keep last 10k messages + ) + + # Also publish to target-specific stream if targeted + if message.target_agent_id: + target_stream = f"agent:{message.target_agent_id}:inbox" + await self.redis.xadd(target_stream, asdict(message)) + + return message_id + + async def subscribe( + self, + agent_id: str, + project_id: str, + ) -> AsyncGenerator[AgentMessage, None]: + """Subscribe to messages for an agent.""" + streams = { + f"agent:{agent_id}:inbox": ">", + f"project:{project_id}:agent_events": ">", + } + + while True: + messages = await self.redis.xread( + streams, + count=10, + block=5000, # 5 second timeout + ) + + for stream_name, stream_messages in messages: + for msg_id, msg_data in stream_messages: + yield AgentMessage(**msg_data) + + async def handoff_task( + self, + source_agent_id: str, + target_agent_id: str, + project_id: str, + task: dict, + context: dict, + ) -> None: + """Hand off a task from one agent to another.""" + message = AgentMessage( + id=str(uuid4()), + source_agent_id=source_agent_id, + target_agent_id=target_agent_id, + project_id=project_id, + message_type="TASK_HANDOFF", + payload={ + "task": task, + "context": context, + "handoff_reason": "task_completion", + }, + timestamp=datetime.utcnow().isoformat(), + ) + await self.publish(message) +``` + +### Agent Supervisor (Hierarchical Control) + +```python +# app/agents/supervisor.py +from langgraph.graph import StateGraph, END + +class SupervisorState(TypedDict): + """State for supervisor agent managing a team.""" + project_id: str + team_agents: list[str] + pending_tasks: list[dict] + in_progress: dict[str, dict] # agent_id -> task + completed_tasks: list[dict] + current_phase: str + +def create_supervisor_graph(supervisor_type: str) -> StateGraph: + """Create supervisor graph for team coordination.""" + + graph = StateGraph(SupervisorState) + + graph.add_node("assess_state", assess_team_state) + graph.add_node("assign_tasks", assign_tasks_to_agents) + graph.add_node("monitor_progress", monitor_agent_progress) + graph.add_node("handle_completion", handle_task_completion) + graph.add_node("coordinate_handoff", coordinate_agent_handoff) + graph.add_node("escalate", escalate_to_human) + + graph.set_entry_point("assess_state") + + graph.add_conditional_edges( + "assess_state", + decide_supervisor_action, + { + "assign": "assign_tasks", + "monitor": "monitor_progress", + "coordinate": "coordinate_handoff", + "escalate": "escalate", + "complete": END, + } + ) + + # ... additional edges + + return graph.compile() + +async def assess_team_state(state: SupervisorState) -> SupervisorState: + """Assess current state of all team agents.""" + team_status = {} + + for agent_id in state["team_agents"]: + agent = await get_agent_instance(agent_id) + team_status[agent_id] = { + "state": agent.state, + "current_task": agent.current_task_id, + "last_active": agent.last_active_at, + } + + return { + **state, + "team_status": team_status, + } + +async def assign_tasks_to_agents(state: SupervisorState) -> SupervisorState: + """Assign pending tasks to available agents.""" + available_agents = [ + agent_id for agent_id, status in state["team_status"].items() + if status["state"] == AgentState.IDLE + ] + + assignments = [] + for task in state["pending_tasks"]: + if not available_agents: + break + + # Find best agent for task + best_agent = await select_best_agent(task, available_agents) + + # Assign task + await message_bus.publish(AgentMessage( + id=str(uuid4()), + source_agent_id="supervisor", + target_agent_id=best_agent, + project_id=state["project_id"], + message_type="TASK_ASSIGNMENT", + payload={"task": task}, + timestamp=datetime.utcnow().isoformat(), + )) + + assignments.append((best_agent, task)) + available_agents.remove(best_agent) + + return { + **state, + "in_progress": { + **state["in_progress"], + **{agent: task for agent, task in assignments}, + }, + "pending_tasks": [ + t for t in state["pending_tasks"] + if t not in [task for _, task in assignments] + ], + } +``` + +### Temporal Workflow Integration + +```python +# app/workflows/project_workflow.py +from temporalio import workflow, activity +from temporalio.common import RetryPolicy +from datetime import timedelta + +@workflow.defn +class ProjectWorkflow: + """Long-running workflow for project lifecycle.""" + + def __init__(self): + self._phase = ProjectPhase.DISCOVERY + self._sprints: list[SprintResult] = [] + self._human_feedback_pending = False + + @workflow.run + async def run(self, project: ProjectConfig) -> ProjectResult: + # Phase 1: Discovery (PO + BA brainstorm) + requirements = await workflow.execute_activity( + run_discovery_phase, + project, + start_to_close_timeout=timedelta(hours=8), + retry_policy=RetryPolicy(maximum_attempts=3), + ) + + # Phase 2: Architecture spike + architecture = await workflow.execute_activity( + run_architecture_spike, + requirements, + start_to_close_timeout=timedelta(hours=4), + ) + + # Checkpoint: Human review of architecture + if project.autonomy_level != AutonomyLevel.AUTONOMOUS: + self._human_feedback_pending = True + await workflow.wait_condition( + lambda: not self._human_feedback_pending + ) + + # Phase 3: Sprint execution (iterative) + while not self._is_complete(): + sprint_result = await workflow.execute_child_workflow( + SprintWorkflow.run, + self._next_sprint_config(), + id=f"{project.id}-sprint-{len(self._sprints) + 1}", + ) + self._sprints.append(sprint_result) + + # Milestone checkpoint if configured + if project.autonomy_level == AutonomyLevel.MILESTONE: + self._human_feedback_pending = True + await workflow.wait_condition( + lambda: not self._human_feedback_pending + ) + + return ProjectResult( + project_id=project.id, + sprints=self._sprints, + status="completed", + ) + + @workflow.signal + async def provide_feedback(self, feedback: HumanFeedback): + """Handle human feedback at checkpoints.""" + self._last_feedback = feedback + self._human_feedback_pending = False + + @workflow.query + def get_status(self) -> ProjectStatus: + return ProjectStatus( + phase=self._phase, + sprints_completed=len(self._sprints), + awaiting_feedback=self._human_feedback_pending, + ) +``` + +### Token Usage Tracking Per Agent + +```python +# app/services/agent_metrics.py +from sqlalchemy.ext.asyncio import AsyncSession +from app.models.token_usage import TokenUsage + +class AgentMetricsService: + """Track token usage and costs per agent instance.""" + + def __init__(self, db: AsyncSession): + self.db = db + + async def record_completion( + self, + agent_instance_id: str, + project_id: str, + model: str, + prompt_tokens: int, + completion_tokens: int, + latency_ms: int, + ) -> TokenUsage: + """Record a completion for an agent instance.""" + cost = self._calculate_cost(model, prompt_tokens, completion_tokens) + + usage = TokenUsage( + agent_instance_id=agent_instance_id, + project_id=project_id, + model=model, + prompt_tokens=prompt_tokens, + completion_tokens=completion_tokens, + total_tokens=prompt_tokens + completion_tokens, + cost_usd=cost, + latency_ms=latency_ms, + timestamp=datetime.utcnow(), + ) + + self.db.add(usage) + + # Update agent instance totals + agent = await self.db.get(AgentInstance, agent_instance_id) + agent.total_tokens_used += usage.total_tokens + agent.total_cost_usd += cost + agent.last_active_at = datetime.utcnow() + + await self.db.commit() + return usage + + async def get_agent_usage_summary( + self, + agent_instance_id: str, + period_days: int = 30, + ) -> dict: + """Get usage summary for an agent instance.""" + cutoff = datetime.utcnow() - timedelta(days=period_days) + + result = await self.db.execute( + select( + func.sum(TokenUsage.total_tokens).label("total_tokens"), + func.sum(TokenUsage.cost_usd).label("total_cost"), + func.count(TokenUsage.id).label("completion_count"), + func.avg(TokenUsage.latency_ms).label("avg_latency"), + ) + .where(TokenUsage.agent_instance_id == agent_instance_id) + .where(TokenUsage.timestamp >= cutoff) + ) + + return dict(result.first()) + + async def get_project_agent_breakdown( + self, + project_id: str, + ) -> list[dict]: + """Get token usage breakdown by agent for a project.""" + result = await self.db.execute( + select( + AgentInstance.id, + AgentInstance.name, + AgentType.role, + func.sum(TokenUsage.total_tokens).label("tokens"), + func.sum(TokenUsage.cost_usd).label("cost"), + ) + .join(TokenUsage, TokenUsage.agent_instance_id == AgentInstance.id) + .join(AgentType, AgentType.id == AgentInstance.agent_type_id) + .where(AgentInstance.project_id == project_id) + .group_by(AgentInstance.id, AgentInstance.name, AgentType.role) + .order_by(func.sum(TokenUsage.cost_usd).desc()) + ) + + return [dict(row) for row in result] +``` + +### Real-Time Agent Activity Visibility + +```python +# app/services/realtime_updates.py +from fastapi import WebSocket +from redis import asyncio as aioredis + +class AgentActivityStream: + """Real-time agent activity updates via WebSocket/SSE.""" + + def __init__(self, redis: aioredis.Redis): + self.redis = redis + self.pubsub = redis.pubsub() + + async def subscribe_project( + self, + websocket: WebSocket, + project_id: str, + ): + """Stream agent activities for a project.""" + await websocket.accept() + + channel = f"project:{project_id}:activities" + await self.pubsub.subscribe(channel) + + try: + async for message in self.pubsub.listen(): + if message["type"] == "message": + await websocket.send_json(json.loads(message["data"])) + except WebSocketDisconnect: + await self.pubsub.unsubscribe(channel) + + async def publish_activity( + self, + project_id: str, + activity: AgentActivity, + ): + """Publish an agent activity event.""" + channel = f"project:{project_id}:activities" + await self.redis.publish(channel, json.dumps({ + "type": "agent_activity", + "agent_id": activity.agent_id, + "agent_name": activity.agent_name, + "agent_role": activity.agent_role, + "activity_type": activity.activity_type, + "description": activity.description, + "state": activity.state, + "timestamp": activity.timestamp.isoformat(), + "metadata": activity.metadata, + })) + +# Activity types for real-time updates +class AgentActivityType(str, enum.Enum): + STATE_CHANGE = "state_change" + THINKING = "thinking" + TOOL_CALL = "tool_call" + MESSAGE_SENT = "message_sent" + ARTIFACT_CREATED = "artifact_created" + ERROR = "error" + TASK_STARTED = "task_started" + TASK_COMPLETED = "task_completed" +``` + +--- + +## Code Examples for Key Patterns + +### Pattern 1: Spawning Multiple Agent Instances + +```python +# app/services/agent_factory.py +class AgentFactory: + """Factory for creating agent instances from types.""" + + DEVELOPER_NAMES = ["Dave", "Ellis", "Kate", "Marcus", "Nina"] + QA_NAMES = ["Quinn", "Raja", "Sierra", "Tyler", "Uma"] + + async def spawn_agent_team( + self, + project_id: str, + team_config: TeamConfig, + ) -> list[AgentInstance]: + """Spawn a team of agents for a project.""" + agents = [] + name_counters = {} + + for agent_spec in team_config.agents: + agent_type = await self.db.get(AgentType, agent_spec.type_id) + + # Get next available name for this type + name = self._get_next_name(agent_type.role, name_counters) + + agent = AgentInstance( + id=uuid4(), + name=name, + agent_type_id=agent_type.id, + project_id=project_id, + state=AgentState.IDLE, + working_memory={ + "project_context": {}, + "conversation_history": [], + "current_focus": None, + }, + ) + + self.db.add(agent) + agents.append(agent) + + await self.db.commit() + + # Initialize agent graphs + for agent in agents: + await self._initialize_agent_graph(agent) + + return agents + + def _get_next_name(self, role: str, counters: dict) -> str: + """Get next name for an agent role.""" + names = { + "Software Engineer": self.DEVELOPER_NAMES, + "QA Engineer": self.QA_NAMES, + # ... other roles + }.get(role, ["Agent"]) + + idx = counters.get(role, 0) + counters[role] = idx + 1 + + if idx < len(names): + return names[idx] + return f"{names[0]}-{idx + 1}" +``` + +### Pattern 2: LLM Failover Per Agent Type + +```python +# app/agents/llm_config.py +from litellm import Router + +class AgentLLMConfig: + """Per-agent-type LLM configuration with failover.""" + + # High-stakes agents get premium models with reliable fallbacks + HIGH_REASONING_CHAIN = [ + {"model": "claude-sonnet-4-20250514", "provider": "anthropic"}, + {"model": "gpt-4-turbo", "provider": "openai"}, + {"model": "claude-3-5-sonnet-20241022", "provider": "anthropic"}, + ] + + # Fast agents get quick models + FAST_RESPONSE_CHAIN = [ + {"model": "claude-3-haiku-20240307", "provider": "anthropic"}, + {"model": "gpt-4o-mini", "provider": "openai"}, + ] + + # Cost-optimized for high-volume tasks + COST_OPTIMIZED_CHAIN = [ + {"model": "ollama/llama3", "provider": "ollama"}, + {"model": "gpt-4o-mini", "provider": "openai"}, + ] + + AGENT_TYPE_MAPPING = { + "Product Owner": HIGH_REASONING_CHAIN, + "Software Architect": HIGH_REASONING_CHAIN, + "Software Engineer": HIGH_REASONING_CHAIN, + "Business Analyst": HIGH_REASONING_CHAIN, + "QA Engineer": FAST_RESPONSE_CHAIN, + "Project Manager": FAST_RESPONSE_CHAIN, + "DevOps Engineer": FAST_RESPONSE_CHAIN, + } + + def build_router_for_agent(self, agent_type: str) -> Router: + """Build LiteLLM router with failover for agent type.""" + chain = self.AGENT_TYPE_MAPPING.get(agent_type, self.FAST_RESPONSE_CHAIN) + + model_list = [] + for i, model_config in enumerate(chain): + model_list.append({ + "model_name": f"agent-{agent_type.lower().replace(' ', '-')}", + "litellm_params": { + "model": model_config["model"], + "api_key": self._get_api_key(model_config["provider"]), + }, + "model_info": {"id": i, "priority": i}, + }) + + return Router( + model_list=model_list, + routing_strategy="simple-shuffle", + num_retries=3, + timeout=120, + fallbacks=[ + {f"agent-{agent_type.lower().replace(' ', '-')}": + [f"agent-{agent_type.lower().replace(' ', '-')}"]} + ], + ) +``` + +### Pattern 3: Agent Context Isolation + +```python +# app/agents/context_manager.py +class AgentContextManager: + """Manage individual agent context/memory.""" + + def __init__(self, redis: aioredis.Redis, db: AsyncSession): + self.redis = redis + self.db = db + + async def get_agent_context( + self, + agent_id: str, + max_messages: int = 50, + ) -> AgentContext: + """Get full context for an agent.""" + agent = await self.db.get(AgentInstance, agent_id) + + # Get recent conversation history + conversations = await self._get_recent_conversations( + agent_id, max_messages + ) + + # Get project context + project_context = await self._get_project_context(agent.project_id) + + # Get working memory from Redis (fast access) + working_memory = await self.redis.hgetall( + f"agent:{agent_id}:memory" + ) + + return AgentContext( + agent=agent, + system_prompt=self._build_system_prompt(agent), + conversation_history=conversations, + project_context=project_context, + working_memory=working_memory, + tools=await self._get_available_tools(agent), + ) + + def _build_system_prompt(self, agent: AgentInstance) -> str: + """Build personalized system prompt for agent.""" + base_prompt = agent.agent_type.system_prompt + + return f""" +{base_prompt} + +You are {agent.name}, a {agent.agent_type.role} working on project {agent.project.name}. + +Your personality traits: +{agent.agent_type.personality_traits} + +Current focus areas: +{json.dumps(agent.working_memory.get('current_focus', []))} + +Remember: You have your own perspective and expertise. Collaborate with other agents +but bring your unique viewpoint to discussions. +""" + + async def update_working_memory( + self, + agent_id: str, + key: str, + value: Any, + ttl: int = 3600, + ): + """Update agent's working memory.""" + await self.redis.hset( + f"agent:{agent_id}:memory", + key, + json.dumps(value), + ) + await self.redis.expire(f"agent:{agent_id}:memory", ttl) +``` + +--- + +## Risks and Mitigations + +| Risk | Impact | Probability | Mitigation | +|------|--------|-------------|------------| +| **Temporal complexity** | High | Medium | Start with simple workflows, invest in team training | +| **LangGraph learning curve** | Medium | High | Create abstractions, use simpler patterns initially | +| **Agent coordination deadlocks** | High | Medium | Implement timeouts, circuit breakers, supervisor oversight | +| **Token cost explosion** | High | Medium | Budget caps, usage monitoring, caching strategies | +| **State consistency issues** | High | Low | Event sourcing, Temporal durability, comprehensive testing | +| **Debugging complexity** | Medium | High | Invest in observability (LangSmith, Temporal UI) | +| **Vendor lock-in (LangGraph)** | Medium | Low | Abstract core patterns, LangGraph is open source | + +### Mitigation Strategies + +**1. Complexity Management:** +- Start with 2-3 agent types, expand gradually +- Build comprehensive test harnesses +- Document all state transitions and message types + +**2. Cost Control:** +```python +class BudgetGuard: + async def check_before_completion( + self, + project_id: str, + estimated_tokens: int, + ) -> bool: + project = await self.db.get(Project, project_id) + current_spend = await self.get_project_spend(project_id) + estimated_cost = self._estimate_cost(estimated_tokens) + + if current_spend + estimated_cost > project.budget_limit: + await self.notify_budget_exceeded(project_id) + return False + return True +``` + +**3. Deadlock Prevention:** +```python +class AgentTimeoutGuard: + THINKING_TIMEOUT = 300 # 5 minutes + EXECUTION_TIMEOUT = 600 # 10 minutes + BLOCKED_TIMEOUT = 1800 # 30 minutes + + async def monitor_agent(self, agent_id: str): + agent = await self.db.get(AgentInstance, agent_id) + timeout = self._get_timeout(agent.state) + + if agent.last_active_at < datetime.utcnow() - timedelta(seconds=timeout): + await self.handle_timeout(agent) +``` + +--- + +## Implementation Roadmap + +### Phase 1: Foundation (Weeks 1-2) +- [ ] Set up Temporal server and workers +- [ ] Implement basic LangGraph agent state machine +- [ ] Create AgentInstance model and database schema +- [ ] Implement Redis Streams message bus + +### Phase 2: Core Orchestration (Weeks 3-4) +- [ ] Implement supervisor pattern with Project Manager +- [ ] Build agent spawning and team creation +- [ ] Add token tracking per agent instance +- [ ] Create basic inter-agent communication + +### Phase 3: Durability (Weeks 5-6) +- [ ] Implement Temporal workflows for projects and sprints +- [ ] Add checkpoint and recovery mechanisms +- [ ] Build human-in-the-loop approval flows +- [ ] Create agent context persistence + +### Phase 4: Observability (Week 7) +- [ ] Implement real-time activity streaming +- [ ] Add comprehensive logging and tracing +- [ ] Build cost monitoring dashboards +- [ ] Create agent performance analytics + +--- + +## References + +### Frameworks and Tools +- [AutoGen 0.4 Documentation](https://github.com/microsoft/autogen) +- [CrewAI Documentation](https://docs.crewai.com/) +- [LangGraph Documentation](https://www.langchain.com/langgraph) +- [Temporal.io Documentation](https://temporal.io/) +- [LiteLLM Documentation](https://docs.litellm.ai/) + +### Research and Articles +- [Microsoft AI Agent Design Patterns](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns) +- [Four Design Patterns for Event-Driven Multi-Agent Systems (Confluent)](https://www.confluent.io/blog/event-driven-multi-agent-systems/) +- [Google A2A Protocol](https://arxiv.org/html/2505.02279v2) +- [Temporal for AI Agents](https://temporal.io/blog/durable-execution-meets-ai-why-temporal-is-the-perfect-foundation-for-ai) +- [LangFuse Token Tracking](https://langfuse.com/docs/observability/features/token-and-cost-tracking) +- [Portkey LLM Failover Patterns](https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/) + +### Related Syndarix Documents +- [SPIKE-001: MCP Integration Pattern](./SPIKE-001-mcp-integration-pattern.md) +- [SPIKE-005: LLM Provider Abstraction](./SPIKE-005-llm-provider-abstraction.md) + +--- + +## Decision + +**Adopt a hybrid architecture** combining: + +1. **LangGraph** for agent state machines and logic (graph-based, stateful) +2. **Temporal** for durable, long-running workflow orchestration +3. **Redis Streams** for event-driven agent-to-agent communication +4. **Hierarchical supervisor pattern** with Project Manager coordinating teams +5. **LiteLLM** (per SPIKE-005) for unified LLM access with failover + +This provides the flexibility, durability, and scalability required for Syndarix's 50+ agent autonomous consulting platform while maintaining individual agent context, comprehensive token tracking, and real-time visibility. + +--- + +*Spike completed. Findings will inform ADR-002: Agent Orchestration Architecture.* diff --git a/docs/spikes/SPIKE-006-knowledge-base-pgvector.md b/docs/spikes/SPIKE-006-knowledge-base-pgvector.md new file mode 100644 index 0000000..d455fa1 --- /dev/null +++ b/docs/spikes/SPIKE-006-knowledge-base-pgvector.md @@ -0,0 +1,1259 @@ +# SPIKE-006: Knowledge Base with pgvector for RAG System + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #6 + +--- + +## Executive Summary + +This spike researches the optimal approach for implementing a knowledge base system to enable RAG (Retrieval-Augmented Generation) for Syndarix AI agents. After evaluating options, we recommend **pgvector with hybrid search** as the primary solution. + +### Key Recommendation + +**Use pgvector** for the following reasons: +- Already using PostgreSQL in the stack (operational simplicity) +- Handles 10-100M vectors effectively (sufficient for project-scoped knowledge) +- Transactional consistency with application data +- Native hybrid search with PostgreSQL full-text search +- Row-level security for multi-tenant isolation +- Integrates seamlessly with existing migrations and tooling + +For projects that scale beyond 100M vectors per tenant, consider migration to Qdrant (open-source, high-performance) or Pinecone (fully managed). + +--- + +## Research Questions & Findings + +### 1. pgvector vs Dedicated Vector Databases + +| Feature | pgvector | Pinecone | Qdrant | Weaviate | +|---------|----------|----------|--------|----------| +| **Max Scale** | 10-100M vectors | Billions | Billions | Billions | +| **Self-Hosted** | Yes | No | Yes | Yes | +| **Managed Option** | RDS, Neon, Supabase | Yes (only) | Yes | Yes | +| **Query Latency** | Good (<100ms) | Excellent | Excellent | Good | +| **Hybrid Search** | Native + pg_search | Sparse vectors | Native | Native | +| **Cost (1M vectors)** | ~$0 (existing DB) | $20-30/mo | ~$27/mo | Variable | +| **Operational Overhead** | Zero (existing) | None | Medium | Medium | +| **ACID Transactions** | Yes | No | No | No | + +**Why pgvector for Syndarix:** +- Per-project knowledge isolation means smaller vector sets (thousands to millions, not billions) +- Transactional ingest: embed and index in the same INSERT as application data +- Single database backup/restore story +- Migration path exists if scale requires dedicated solution + +**When to Consider Alternatives:** +- Pinecone: Zero-ops requirement, budget available, billions of vectors +- Qdrant: Need advanced filtering, high QPS, open-source preference +- Weaviate: Multi-modal (images, video), knowledge graph features + +### 2. Embedding Model Recommendations + +Based on [research from 2024-2025](https://elephas.app/blog/best-embedding-models) and [Modal's code embedding comparison](https://modal.com/blog/6-best-code-embedding-models-compared): + +| Model | Best For | Dimensions | Cost/1M tokens | Notes | +|-------|----------|------------|----------------|-------| +| **text-embedding-3-small** | General text, docs | 512-1536 | $0.02 | Good balance | +| **text-embedding-3-large** | High accuracy needs | 256-3072 | $0.13 | Dimension reduction | +| **voyage-code-3** | Code retrieval | 1024 | $0.06 | State-of-art for code | +| **voyage-3-large** | General + code | 1024 | $0.12 | Top leaderboard | +| **nomic-embed-text** | Open-source, local | 768 | Free | Ollama compatible | + +**Recommendation for Syndarix:** + +```python +# Content-type based model selection +EMBEDDING_MODELS = { + "code": "voyage/voyage-code-3", # Code files (.py, .js, etc.) + "documentation": "text-embedding-3-small", # Markdown, docs + "general": "text-embedding-3-small", # Default + "high_accuracy": "voyage/voyage-3-large", # Critical queries + "local": "ollama/nomic-embed-text", # Fallback / dev +} +``` + +**LiteLLM Integration:** + +```python +from litellm import embedding + +# Via LiteLLM (unified interface) +response = await embedding( + model="voyage/voyage-code-3", + input=["def hello(): return 'world'"], +) +vector = response.data[0].embedding +``` + +### 3. Chunking Strategies + +Based on [Weaviate's research](https://weaviate.io/blog/chunking-strategies-for-rag) and [Stack Overflow's analysis](https://stackoverflow.blog/2024/12/27/breaking-up-is-hard-to-do-chunking-in-rag-applications/): + +**Strategy by Content Type:** + +| Content Type | Strategy | Chunk Size | Overlap | Notes | +|--------------|----------|------------|---------|-------| +| **Code Files** | AST-based / Function | Per function/class | None | Preserve semantic units | +| **Markdown Docs** | Heading-based | Per section | 10% | Respect document structure | +| **PDF Specs** | Page-level + semantic | 1000 tokens | 15% | NVIDIA recommends page-level | +| **Conversations** | Turn-based | Per exchange | Context window | Preserve dialogue flow | +| **API Docs** | Endpoint-based | Per endpoint | None | Group by resource | + +**Implementation:** + +```python +# app/services/knowledge/chunkers.py +from abc import ABC, abstractmethod +from dataclasses import dataclass +from typing import List +import tree_sitter_python as tspython +from tree_sitter import Parser + +@dataclass +class Chunk: + content: str + metadata: dict + start_line: int | None = None + end_line: int | None = None + +class BaseChunker(ABC): + @abstractmethod + def chunk(self, content: str, metadata: dict) -> List[Chunk]: + pass + +class CodeChunker(BaseChunker): + """AST-based chunking for source code.""" + + def __init__(self, language: str = "python"): + self.parser = Parser() + if language == "python": + self.parser.set_language(tspython.language()) + + def chunk(self, content: str, metadata: dict) -> List[Chunk]: + tree = self.parser.parse(bytes(content, "utf8")) + chunks = [] + + for node in tree.root_node.children: + if node.type in ("function_definition", "class_definition"): + chunk_content = content[node.start_byte:node.end_byte] + chunks.append(Chunk( + content=chunk_content, + metadata={ + **metadata, + "type": node.type, + "name": self._get_name(node), + }, + start_line=node.start_point[0], + end_line=node.end_point[0], + )) + + # Handle module-level code + if not chunks: + chunks.append(Chunk(content=content, metadata=metadata)) + + return chunks + + def _get_name(self, node) -> str: + for child in node.children: + if child.type == "identifier": + return child.text.decode("utf8") + return "unknown" + +class MarkdownChunker(BaseChunker): + """Heading-based chunking for markdown.""" + + def __init__(self, max_tokens: int = 1000, overlap_ratio: float = 0.1): + self.max_tokens = max_tokens + self.overlap_ratio = overlap_ratio + + def chunk(self, content: str, metadata: dict) -> List[Chunk]: + import re + + # Split by headings + sections = re.split(r'^(#{1,6}\s+.+)$', content, flags=re.MULTILINE) + chunks = [] + current_heading = "" + + for i, section in enumerate(sections): + if section.startswith('#'): + current_heading = section.strip() + elif section.strip(): + chunks.append(Chunk( + content=f"{current_heading}\n\n{section.strip()}", + metadata={ + **metadata, + "heading": current_heading, + "section_index": i, + } + )) + + return self._apply_overlap(chunks) + + def _apply_overlap(self, chunks: List[Chunk]) -> List[Chunk]: + # Add overlap between chunks for context + for i in range(1, len(chunks)): + overlap_size = int(len(chunks[i-1].content) * self.overlap_ratio) + overlap_text = chunks[i-1].content[-overlap_size:] + chunks[i].content = f"[Context: ...{overlap_text}]\n\n{chunks[i].content}" + return chunks + +class SemanticChunker(BaseChunker): + """Semantic chunking based on embedding similarity.""" + + def __init__(self, embedding_model: str = "text-embedding-3-small"): + from litellm import embedding + self.embed = embedding + self.model = embedding_model + self.similarity_threshold = 0.7 + + async def chunk(self, content: str, metadata: dict) -> List[Chunk]: + import nltk + sentences = nltk.sent_tokenize(content) + + # Get embeddings for each sentence + response = await self.embed(model=self.model, input=sentences) + embeddings = [d.embedding for d in response.data] + + # Group sentences by semantic similarity + chunks = [] + current_chunk = [sentences[0]] + current_embedding = embeddings[0] + + for i in range(1, len(sentences)): + similarity = self._cosine_similarity(current_embedding, embeddings[i]) + if similarity > self.similarity_threshold: + current_chunk.append(sentences[i]) + else: + chunks.append(Chunk( + content=" ".join(current_chunk), + metadata=metadata + )) + current_chunk = [sentences[i]] + current_embedding = embeddings[i] + + if current_chunk: + chunks.append(Chunk(content=" ".join(current_chunk), metadata=metadata)) + + return chunks + + def _cosine_similarity(self, a, b): + import numpy as np + return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) +``` + +### 4. Hybrid Search (Semantic + Keyword) + +Hybrid search combines the precision of BM25 keyword matching with semantic vector similarity. Based on [ParadeDB's research](https://www.paradedb.com/blog/hybrid-search-in-postgresql-the-missing-manual): + +**Approach: Reciprocal Rank Fusion (RRF)** + +```sql +-- Hybrid search with RRF scoring +WITH semantic_results AS ( + SELECT id, content, + 1 - (embedding <=> $1::vector) as semantic_score, + ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) as semantic_rank + FROM knowledge_chunks + WHERE project_id = $2 + ORDER BY embedding <=> $1::vector + LIMIT 20 +), +keyword_results AS ( + SELECT id, content, + ts_rank(search_vector, plainto_tsquery('english', $3)) as keyword_score, + ROW_NUMBER() OVER (ORDER BY ts_rank(search_vector, plainto_tsquery('english', $3)) DESC) as keyword_rank + FROM knowledge_chunks + WHERE project_id = $2 + AND search_vector @@ plainto_tsquery('english', $3) + ORDER BY keyword_score DESC + LIMIT 20 +) +SELECT + COALESCE(s.id, k.id) as id, + COALESCE(s.content, k.content) as content, + -- RRF formula: 1/(k + rank) where k=60 is standard + (1.0 / (60 + COALESCE(s.semantic_rank, 1000))) + + (1.0 / (60 + COALESCE(k.keyword_rank, 1000))) as rrf_score +FROM semantic_results s +FULL OUTER JOIN keyword_results k ON s.id = k.id +ORDER BY rrf_score DESC +LIMIT 10; +``` + +**Implementation with SQLAlchemy:** + +```python +# app/services/knowledge/search.py +from sqlalchemy import text, func +from sqlalchemy.ext.asyncio import AsyncSession +from pgvector.sqlalchemy import Vector + +class HybridSearchService: + def __init__(self, db: AsyncSession): + self.db = db + + async def search( + self, + query: str, + query_embedding: list[float], + project_id: str, + agent_id: str | None = None, + limit: int = 10, + semantic_weight: float = 0.5, + ) -> list[dict]: + """ + Hybrid search combining semantic and keyword matching. + + Args: + query: Natural language query + query_embedding: Pre-computed query embedding + project_id: Project scope + agent_id: Optional agent-specific scope + limit: Max results + semantic_weight: 0-1, weight for semantic vs keyword + """ + keyword_weight = 1 - semantic_weight + + sql = text(""" + WITH semantic AS ( + SELECT id, content, metadata, + 1 - (embedding <=> :embedding::vector) as score, + ROW_NUMBER() OVER (ORDER BY embedding <=> :embedding::vector) as rank + FROM knowledge_chunks + WHERE project_id = :project_id + AND (:agent_id IS NULL OR agent_id = :agent_id OR agent_id IS NULL) + ORDER BY embedding <=> :embedding::vector + LIMIT 30 + ), + keyword AS ( + SELECT id, content, metadata, + ts_rank_cd(search_vector, websearch_to_tsquery('english', :query)) as score, + ROW_NUMBER() OVER ( + ORDER BY ts_rank_cd(search_vector, websearch_to_tsquery('english', :query)) DESC + ) as rank + FROM knowledge_chunks + WHERE project_id = :project_id + AND (:agent_id IS NULL OR agent_id = :agent_id OR agent_id IS NULL) + AND search_vector @@ websearch_to_tsquery('english', :query) + ORDER BY score DESC + LIMIT 30 + ) + SELECT + COALESCE(s.id, k.id) as id, + COALESCE(s.content, k.content) as content, + COALESCE(s.metadata, k.metadata) as metadata, + ( + :semantic_weight * (1.0 / (60 + COALESCE(s.rank, 1000))) + + :keyword_weight * (1.0 / (60 + COALESCE(k.rank, 1000))) + ) as combined_score, + s.score as semantic_score, + k.score as keyword_score + FROM semantic s + FULL OUTER JOIN keyword k ON s.id = k.id + ORDER BY combined_score DESC + LIMIT :limit + """) + + result = await self.db.execute(sql, { + "embedding": query_embedding, + "query": query, + "project_id": project_id, + "agent_id": agent_id, + "semantic_weight": semantic_weight, + "keyword_weight": keyword_weight, + "limit": limit, + }) + + return [dict(row._mapping) for row in result.fetchall()] +``` + +### 5. Multi-Tenant Vector Collections + +Based on [Timescale's research on multi-tenant RAG](https://www.tigerdata.com/blog/building-multi-tenant-rag-applications-with-postgresql-choosing-the-right-approach): + +**Recommended Pattern: Shared Table with Tenant ID** + +For Syndarix, use a shared table with `project_id` and `agent_id` columns: + +```python +# app/models/knowledge.py +from sqlalchemy import Column, String, Text, ForeignKey, Index +from sqlalchemy.dialects.postgresql import UUID, JSONB +from pgvector.sqlalchemy import Vector +from app.db.base import Base + +class KnowledgeChunk(Base): + __tablename__ = "knowledge_chunks" + + id = Column(UUID, primary_key=True, default=uuid.uuid4) + + # Multi-tenant isolation + project_id = Column(UUID, ForeignKey("projects.id"), nullable=False, index=True) + agent_id = Column(UUID, ForeignKey("agent_instances.id"), nullable=True, index=True) + + # Content + content = Column(Text, nullable=False) + content_type = Column(String(50), nullable=False) # code, markdown, pdf, etc. + + # Source tracking + source_uri = Column(String(512)) # file path, URL, etc. + source_type = Column(String(50)) # file, url, conversation, etc. + + # Vector embedding + embedding = Column(Vector(1536)) # Dimension depends on model + embedding_model = Column(String(100)) + + # Full-text search + search_vector = Column(TSVECTOR) + + # Metadata + metadata = Column(JSONB, default={}) + + # Timestamps + created_at = Column(DateTime, default=func.now()) + updated_at = Column(DateTime, onupdate=func.now()) + + __table_args__ = ( + # HNSW index for vector similarity (per-project partitioning) + Index( + 'ix_knowledge_chunks_embedding_hnsw', + 'embedding', + postgresql_using='hnsw', + postgresql_with={'m': 16, 'ef_construction': 64}, + postgresql_ops={'embedding': 'vector_cosine_ops'} + ), + # GIN index for full-text search + Index( + 'ix_knowledge_chunks_search_vector', + 'search_vector', + postgresql_using='gin' + ), + # Composite index for tenant isolation + Index('ix_knowledge_chunks_project_agent', 'project_id', 'agent_id'), + ) + +class KnowledgeCollection(Base): + """Groups of chunks for organizing knowledge.""" + __tablename__ = "knowledge_collections" + + id = Column(UUID, primary_key=True, default=uuid.uuid4) + project_id = Column(UUID, ForeignKey("projects.id"), nullable=False) + name = Column(String(100), nullable=False) + description = Column(Text) + collection_type = Column(String(50)) # codebase, documentation, specs, etc. + + # Configuration + chunking_strategy = Column(String(50), default="auto") + embedding_model = Column(String(100), default="text-embedding-3-small") + + created_at = Column(DateTime, default=func.now()) + updated_at = Column(DateTime, onupdate=func.now()) + +class ChunkCollectionAssociation(Base): + """Many-to-many: chunks can belong to multiple collections.""" + __tablename__ = "chunk_collection_associations" + + chunk_id = Column(UUID, ForeignKey("knowledge_chunks.id"), primary_key=True) + collection_id = Column(UUID, ForeignKey("knowledge_collections.id"), primary_key=True) +``` + +**Row-Level Security (Optional but Recommended):** + +```sql +-- Enable RLS on knowledge_chunks +ALTER TABLE knowledge_chunks ENABLE ROW LEVEL SECURITY; + +-- Policy: Users can only access chunks from their projects +CREATE POLICY knowledge_chunk_project_isolation ON knowledge_chunks + USING (project_id IN ( + SELECT project_id FROM project_members + WHERE user_id = current_setting('app.current_user_id')::uuid + )); +``` + +### 6. Indexing Strategies for Large Codebases + +**HNSW vs IVFFlat Selection:** + +| Factor | HNSW | IVFFlat | +|--------|------|---------| +| Query speed | Faster | Slower | +| Build time | Slower | Faster | +| Memory | Higher | Lower | +| Accuracy | Higher | Lower | +| Use when | <10M vectors, high recall needed | >10M vectors, memory constrained | + +**HNSW Parameter Guidelines:** + +```sql +-- Small collections (<100K vectors) +CREATE INDEX ON knowledge_chunks +USING hnsw (embedding vector_cosine_ops) +WITH (m = 16, ef_construction = 64); + +-- Medium collections (100K-1M vectors) +CREATE INDEX ON knowledge_chunks +USING hnsw (embedding vector_cosine_ops) +WITH (m = 24, ef_construction = 100); + +-- Large collections (1M-10M vectors) +CREATE INDEX ON knowledge_chunks +USING hnsw (embedding vector_cosine_ops) +WITH (m = 32, ef_construction = 128); + +-- Query-time tuning +SET hnsw.ef_search = 100; -- Higher = better recall, slower +``` + +**Partial Indexes for Multi-Tenant:** + +```sql +-- Create partial indexes per high-traffic project +CREATE INDEX ON knowledge_chunks +USING hnsw (embedding vector_cosine_ops) +WITH (m = 16, ef_construction = 64) +WHERE project_id = 'frequently-queried-project-id'; +``` + +**Build Performance:** + +```sql +-- Speed up index builds +SET maintenance_work_mem = '2GB'; -- Ensure graph fits in memory +SET max_parallel_maintenance_workers = 7; -- Parallel building +``` + +### 7. Real-Time vs Batch Embedding Updates + +**Recommendation: Hybrid Approach** + +| Scenario | Strategy | Why | +|----------|----------|-----| +| New file added | Real-time | Immediate availability | +| Bulk import | Batch (Celery) | Avoid blocking | +| File modified | Debounced real-time | Avoid churning | +| Conversation | Real-time | Context needed now | +| Codebase sync | Scheduled batch | Efficient | + +**Implementation:** + +```python +# app/services/knowledge/ingestion.py +from celery import shared_task +from app.core.celery import celery_app +from app.services.knowledge.embedder import EmbeddingService +from app.services.knowledge.chunkers import get_chunker + +class KnowledgeIngestionService: + def __init__(self, db: AsyncSession): + self.db = db + self.embedder = EmbeddingService() + + async def ingest_realtime( + self, + project_id: str, + content: str, + content_type: str, + source_uri: str, + agent_id: str | None = None, + ) -> list[str]: + """Real-time ingestion for immediate availability.""" + chunker = get_chunker(content_type) + chunks = chunker.chunk(content, {"source_uri": source_uri}) + + # Embed and store + chunk_ids = [] + for chunk in chunks: + embedding = await self.embedder.embed(chunk.content, content_type) + + db_chunk = KnowledgeChunk( + project_id=project_id, + agent_id=agent_id, + content=chunk.content, + content_type=content_type, + source_uri=source_uri, + embedding=embedding, + embedding_model=self.embedder.get_model(content_type), + metadata=chunk.metadata, + ) + self.db.add(db_chunk) + chunk_ids.append(str(db_chunk.id)) + + await self.db.commit() + return chunk_ids + + def schedule_batch_ingestion( + self, + project_id: str, + files: list[dict], # [{path, content_type}] + ) -> str: + """Schedule batch ingestion via Celery.""" + task = batch_ingest_files.delay(project_id, files) + return task.id + +@celery_app.task(bind=True, max_retries=3) +def batch_ingest_files(self, project_id: str, files: list[dict]): + """Celery task for batch file ingestion.""" + from app.core.database import get_sync_session + + with get_sync_session() as db: + ingestion = KnowledgeIngestionService(db) + + for file in files: + try: + # Read file content + with open(file["path"], "r") as f: + content = f.read() + + # Process (sync version) + ingestion.ingest_sync( + project_id=project_id, + content=content, + content_type=file["content_type"], + source_uri=file["path"], + ) + except Exception as e: + # Log and continue, don't fail entire batch + logger.error(f"Failed to ingest {file['path']}: {e}") + + db.commit() +``` + +**Debounced Updates:** + +```python +# app/services/knowledge/watcher.py +import asyncio +from collections import defaultdict + +class KnowledgeUpdateDebouncer: + """Debounce rapid file changes to avoid excessive re-embedding.""" + + def __init__(self, delay_seconds: float = 2.0): + self.delay = delay_seconds + self.pending: dict[str, asyncio.Task] = {} + + async def schedule_update( + self, + file_path: str, + update_callback: callable, + ): + """Schedule an update, canceling any pending update for the same file.""" + # Cancel existing pending update + if file_path in self.pending: + self.pending[file_path].cancel() + + # Schedule new update + self.pending[file_path] = asyncio.create_task( + self._delayed_update(file_path, update_callback) + ) + + async def _delayed_update(self, file_path: str, callback: callable): + await asyncio.sleep(self.delay) + await callback(file_path) + del self.pending[file_path] +``` + +--- + +## Schema Design + +### Database Schema + +```sql +-- Enable required extensions +CREATE EXTENSION IF NOT EXISTS vector; +CREATE EXTENSION IF NOT EXISTS pg_trgm; -- For fuzzy text matching + +-- Knowledge chunks table +CREATE TABLE knowledge_chunks ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + + -- Multi-tenant isolation + project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, + agent_id UUID REFERENCES agent_instances(id) ON DELETE SET NULL, + collection_id UUID REFERENCES knowledge_collections(id) ON DELETE SET NULL, + + -- Content + content TEXT NOT NULL, + content_type VARCHAR(50) NOT NULL, + + -- Source tracking + source_uri VARCHAR(512), + source_type VARCHAR(50), + source_hash VARCHAR(64), -- For detecting changes + + -- Vector embedding (1536 for text-embedding-3-small) + embedding vector(1536), + embedding_model VARCHAR(100), + + -- Full-text search + search_vector tsvector GENERATED ALWAYS AS ( + setweight(to_tsvector('english', coalesce(content, '')), 'A') + ) STORED, + + -- Metadata + metadata JSONB DEFAULT '{}', + + -- Timestamps + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ DEFAULT NOW() +); + +-- Indexes +CREATE INDEX ix_knowledge_chunks_project ON knowledge_chunks(project_id); +CREATE INDEX ix_knowledge_chunks_agent ON knowledge_chunks(agent_id); +CREATE INDEX ix_knowledge_chunks_collection ON knowledge_chunks(collection_id); +CREATE INDEX ix_knowledge_chunks_source_hash ON knowledge_chunks(source_hash); + +-- HNSW vector index +CREATE INDEX ix_knowledge_chunks_embedding ON knowledge_chunks +USING hnsw (embedding vector_cosine_ops) +WITH (m = 16, ef_construction = 64); + +-- GIN index for full-text search +CREATE INDEX ix_knowledge_chunks_fts ON knowledge_chunks USING gin(search_vector); + +-- Knowledge collections +CREATE TABLE knowledge_collections ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, + name VARCHAR(100) NOT NULL, + description TEXT, + collection_type VARCHAR(50), + chunking_strategy VARCHAR(50) DEFAULT 'auto', + embedding_model VARCHAR(100) DEFAULT 'text-embedding-3-small', + created_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ DEFAULT NOW(), + + UNIQUE(project_id, name) +); + +-- Trigger for updated_at +CREATE OR REPLACE FUNCTION update_updated_at() +RETURNS TRIGGER AS $$ +BEGIN + NEW.updated_at = NOW(); + RETURN NEW; +END; +$$ LANGUAGE plpgsql; + +CREATE TRIGGER knowledge_chunks_updated_at + BEFORE UPDATE ON knowledge_chunks + FOR EACH ROW + EXECUTE FUNCTION update_updated_at(); +``` + +### Alembic Migration + +```python +# alembic/versions/xxxx_add_knowledge_base.py +"""Add knowledge base tables for RAG + +Revision ID: xxxx +""" +from alembic import op +import sqlalchemy as sa +from pgvector.sqlalchemy import Vector + +def upgrade(): + # Enable extensions + op.execute("CREATE EXTENSION IF NOT EXISTS vector") + op.execute("CREATE EXTENSION IF NOT EXISTS pg_trgm") + + # Create knowledge_collections table + op.create_table( + 'knowledge_collections', + sa.Column('id', sa.dialects.postgresql.UUID(), primary_key=True), + sa.Column('project_id', sa.dialects.postgresql.UUID(), sa.ForeignKey('projects.id'), nullable=False), + sa.Column('name', sa.String(100), nullable=False), + sa.Column('description', sa.Text()), + sa.Column('collection_type', sa.String(50)), + sa.Column('chunking_strategy', sa.String(50), default='auto'), + sa.Column('embedding_model', sa.String(100), default='text-embedding-3-small'), + sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.func.now()), + sa.Column('updated_at', sa.DateTime(timezone=True)), + sa.UniqueConstraint('project_id', 'name', name='uq_knowledge_collections_project_name'), + ) + + # Create knowledge_chunks table + op.create_table( + 'knowledge_chunks', + sa.Column('id', sa.dialects.postgresql.UUID(), primary_key=True), + sa.Column('project_id', sa.dialects.postgresql.UUID(), sa.ForeignKey('projects.id', ondelete='CASCADE'), nullable=False), + sa.Column('agent_id', sa.dialects.postgresql.UUID(), sa.ForeignKey('agent_instances.id', ondelete='SET NULL')), + sa.Column('collection_id', sa.dialects.postgresql.UUID(), sa.ForeignKey('knowledge_collections.id', ondelete='SET NULL')), + sa.Column('content', sa.Text(), nullable=False), + sa.Column('content_type', sa.String(50), nullable=False), + sa.Column('source_uri', sa.String(512)), + sa.Column('source_type', sa.String(50)), + sa.Column('source_hash', sa.String(64)), + sa.Column('embedding', Vector(1536)), + sa.Column('embedding_model', sa.String(100)), + sa.Column('metadata', sa.dialects.postgresql.JSONB(), default={}), + sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.func.now()), + sa.Column('updated_at', sa.DateTime(timezone=True)), + ) + + # Create indexes + op.create_index('ix_knowledge_chunks_project', 'knowledge_chunks', ['project_id']) + op.create_index('ix_knowledge_chunks_agent', 'knowledge_chunks', ['agent_id']) + op.create_index('ix_knowledge_chunks_collection', 'knowledge_chunks', ['collection_id']) + op.create_index('ix_knowledge_chunks_source_hash', 'knowledge_chunks', ['source_hash']) + + # Create HNSW vector index + op.execute(""" + CREATE INDEX ix_knowledge_chunks_embedding ON knowledge_chunks + USING hnsw (embedding vector_cosine_ops) + WITH (m = 16, ef_construction = 64) + """) + + # Add full-text search column and index + op.execute(""" + ALTER TABLE knowledge_chunks + ADD COLUMN search_vector tsvector + GENERATED ALWAYS AS ( + setweight(to_tsvector('english', coalesce(content, '')), 'A') + ) STORED + """) + op.execute("CREATE INDEX ix_knowledge_chunks_fts ON knowledge_chunks USING gin(search_vector)") + +def downgrade(): + op.drop_table('knowledge_chunks') + op.drop_table('knowledge_collections') +``` + +--- + +## Complete Service Implementation + +```python +# app/services/knowledge/service.py +from dataclasses import dataclass +from typing import Optional +from sqlalchemy.ext.asyncio import AsyncSession +from litellm import embedding as litellm_embedding + +from app.models.knowledge import KnowledgeChunk, KnowledgeCollection +from app.services.knowledge.chunkers import get_chunker +from app.services.knowledge.search import HybridSearchService + +@dataclass +class SearchResult: + id: str + content: str + metadata: dict + score: float + source_uri: Optional[str] = None + +class KnowledgeBaseService: + """ + Main service for knowledge base operations. + Integrates with LiteLLM for embeddings. + """ + + # Model selection by content type + EMBEDDING_MODELS = { + "code": "voyage/voyage-code-3", + "markdown": "text-embedding-3-small", + "pdf": "text-embedding-3-small", + "conversation": "text-embedding-3-small", + "default": "text-embedding-3-small", + } + + def __init__(self, db: AsyncSession): + self.db = db + self.search_service = HybridSearchService(db) + + async def create_collection( + self, + project_id: str, + name: str, + collection_type: str, + description: str = "", + chunking_strategy: str = "auto", + ) -> KnowledgeCollection: + """Create a new knowledge collection for a project.""" + collection = KnowledgeCollection( + project_id=project_id, + name=name, + description=description, + collection_type=collection_type, + chunking_strategy=chunking_strategy, + embedding_model=self.EMBEDDING_MODELS.get(collection_type, "text-embedding-3-small"), + ) + self.db.add(collection) + await self.db.commit() + await self.db.refresh(collection) + return collection + + async def ingest( + self, + project_id: str, + content: str, + content_type: str, + source_uri: str, + agent_id: Optional[str] = None, + collection_id: Optional[str] = None, + metadata: Optional[dict] = None, + ) -> list[str]: + """ + Ingest content into the knowledge base. + Automatically chunks and embeds the content. + """ + import hashlib + + # Check for existing content by hash + source_hash = hashlib.sha256(content.encode()).hexdigest() + existing = await self.db.execute( + select(KnowledgeChunk).where( + KnowledgeChunk.project_id == project_id, + KnowledgeChunk.source_hash == source_hash, + ) + ) + if existing.scalar_one_or_none(): + # Content unchanged, skip + return [] + + # Get appropriate chunker + chunker = get_chunker(content_type) + chunks = chunker.chunk(content, metadata or {}) + + # Get embedding model + model = self.EMBEDDING_MODELS.get(content_type, "text-embedding-3-small") + + # Embed all chunks in batch + chunk_texts = [c.content for c in chunks] + embeddings = await self._embed_batch(chunk_texts, model) + + # Store chunks + chunk_ids = [] + for chunk, emb in zip(chunks, embeddings): + db_chunk = KnowledgeChunk( + project_id=project_id, + agent_id=agent_id, + collection_id=collection_id, + content=chunk.content, + content_type=content_type, + source_uri=source_uri, + source_type=self._infer_source_type(source_uri), + source_hash=source_hash, + embedding=emb, + embedding_model=model, + metadata={ + **chunk.metadata, + **(metadata or {}), + }, + ) + self.db.add(db_chunk) + chunk_ids.append(str(db_chunk.id)) + + await self.db.commit() + return chunk_ids + + async def search( + self, + project_id: str, + query: str, + agent_id: Optional[str] = None, + collection_id: Optional[str] = None, + limit: int = 10, + content_types: Optional[list[str]] = None, + semantic_weight: float = 0.6, + ) -> list[SearchResult]: + """ + Search the knowledge base using hybrid search. + + Args: + project_id: Project scope + query: Natural language query + agent_id: Optional agent-specific scope + collection_id: Optional collection scope + limit: Max results + content_types: Filter by content types + semantic_weight: 0-1, weight for semantic vs keyword + """ + # Get query embedding + query_embedding = await self._embed_query(query) + + # Perform hybrid search + results = await self.search_service.search( + query=query, + query_embedding=query_embedding, + project_id=project_id, + agent_id=agent_id, + limit=limit, + semantic_weight=semantic_weight, + ) + + return [ + SearchResult( + id=r["id"], + content=r["content"], + metadata=r.get("metadata", {}), + score=r["combined_score"], + source_uri=r.get("source_uri"), + ) + for r in results + ] + + async def delete_by_source( + self, + project_id: str, + source_uri: str, + ) -> int: + """Delete all chunks from a specific source.""" + result = await self.db.execute( + delete(KnowledgeChunk).where( + KnowledgeChunk.project_id == project_id, + KnowledgeChunk.source_uri == source_uri, + ) + ) + await self.db.commit() + return result.rowcount + + async def _embed_batch( + self, + texts: list[str], + model: str, + ) -> list[list[float]]: + """Embed multiple texts in a single API call.""" + response = await litellm_embedding( + model=model, + input=texts, + ) + return [d["embedding"] for d in response.data] + + async def _embed_query(self, query: str) -> list[float]: + """Embed a query string.""" + response = await litellm_embedding( + model="text-embedding-3-small", + input=[query], + ) + return response.data[0]["embedding"] + + def _infer_source_type(self, source_uri: str) -> str: + """Infer source type from URI.""" + if source_uri.startswith("http"): + return "url" + if source_uri.startswith("conversation:"): + return "conversation" + return "file" +``` + +--- + +## Performance Considerations + +### Query Latency Targets + +| Vector Count | Target Latency | Recommended Config | +|--------------|----------------|-------------------| +| <100K | <20ms | Default HNSW | +| 100K-1M | <50ms | m=24, ef_construction=100 | +| 1M-10M | <100ms | m=32, ef_construction=128, ef_search=100 | + +### Memory Requirements + +``` +HNSW memory β‰ˆ vectors Γ— dimensions Γ— 4 bytes Γ— (1 + m/8) + +Example: 1M vectors Γ— 1536 dims Γ— 4 bytes Γ— (1 + 16/8) = ~9.2 GB +``` + +### Batch Embedding Costs + +| Model | 1K chunks | 10K chunks | 100K chunks | +|-------|-----------|------------|-------------| +| text-embedding-3-small | $0.002 | $0.02 | $0.20 | +| voyage-code-3 | $0.006 | $0.06 | $0.60 | +| Local (nomic-embed) | $0 | $0 | $0 | + +### Optimization Tips + +1. **Use batch embedding** - Single API call for multiple chunks +2. **Cache query embeddings** - Same queries return same vectors +3. **Partial indexes** - Create per-project indexes for high-traffic projects +4. **Dimension reduction** - Use 512-dim with text-embedding-3-small for cost savings +5. **Connection pooling** - Use pgBouncer for high-concurrency scenarios + +--- + +## Integration with Syndarix Agents + +### Agent Context Retrieval + +```python +# app/services/agent/context.py +class AgentContextBuilder: + """Builds context for agent prompts using RAG.""" + + def __init__(self, kb_service: KnowledgeBaseService): + self.kb = kb_service + + async def build_context( + self, + agent_id: str, + project_id: str, + task_description: str, + max_context_tokens: int = 4000, + ) -> str: + """ + Build relevant context for an agent task. + + Returns formatted context string for inclusion in prompt. + """ + # Search for relevant knowledge + results = await self.kb.search( + project_id=project_id, + query=task_description, + agent_id=agent_id, # Prefer agent-specific knowledge + limit=10, + semantic_weight=0.7, + ) + + # Format context + context_parts = [] + current_tokens = 0 + + for result in results: + chunk_tokens = self._count_tokens(result.content) + if current_tokens + chunk_tokens > max_context_tokens: + break + + context_parts.append(f""" +### Source: {result.source_uri or 'Unknown'} +{result.content} +""") + current_tokens += chunk_tokens + + if not context_parts: + return "" + + return f""" +## Relevant Context + +The following information was retrieved from the project knowledge base: + +{"".join(context_parts)} + +--- +""" + + def _count_tokens(self, text: str) -> int: + """Approximate token count.""" + return len(text) // 4 # Rough estimate +``` + +### MCP Tool for Knowledge Access + +```python +# app/mcp/tools/knowledge.py +from mcp import Tool, ToolResult + +class KnowledgeSearchTool(Tool): + """MCP tool for agents to search project knowledge.""" + + name = "search_knowledge" + description = "Search the project knowledge base for relevant information" + + parameters = { + "type": "object", + "properties": { + "project_id": { + "type": "string", + "description": "The project ID to search within" + }, + "query": { + "type": "string", + "description": "Natural language search query" + }, + "content_types": { + "type": "array", + "items": {"type": "string"}, + "description": "Filter by content types (code, markdown, pdf)" + }, + "limit": { + "type": "integer", + "default": 5, + "description": "Maximum results to return" + } + }, + "required": ["project_id", "query"] + } + + async def execute(self, **params) -> ToolResult: + results = await self.kb_service.search( + project_id=params["project_id"], + query=params["query"], + content_types=params.get("content_types"), + limit=params.get("limit", 5), + ) + + return ToolResult( + content=[ + { + "source": r.source_uri, + "content": r.content[:500] + "..." if len(r.content) > 500 else r.content, + "relevance_score": r.score, + } + for r in results + ] + ) +``` + +--- + +## References + +### Vector Databases +- [Best Vector Databases 2025 - Firecrawl](https://www.firecrawl.dev/blog/best-vector-databases-2025) +- [pgvector vs Qdrant Comparison - MyScale](https://www.myscale.com/blog/comprehensive-comparison-pgvector-vs-qdrant-performance-vector-database-benchmarks/) +- [Multi-Tenancy in Vector Databases - Pinecone](https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/vector-database-multi-tenancy/) + +### Embedding Models +- [Best Embedding Models 2025 - Elephas](https://elephas.app/blog/best-embedding-models) +- [6 Best Code Embedding Models - Modal](https://modal.com/blog/6-best-code-embedding-models-compared) +- [LiteLLM Embedding Documentation](https://docs.litellm.ai/docs/embedding/supported_embedding) + +### Chunking & RAG +- [Chunking Strategies for RAG - Weaviate](https://weaviate.io/blog/chunking-strategies-for-rag) +- [Breaking Up is Hard to Do - Stack Overflow](https://stackoverflow.blog/2024/12/27/breaking-up-is-hard-to-do-chunking-in-rag-applications/) +- [Best Chunking Strategies 2025 - Firecrawl](https://www.firecrawl.dev/blog/best-chunking-strategies-rag-2025) + +### Hybrid Search +- [Hybrid Search in PostgreSQL - ParadeDB](https://www.paradedb.com/blog/hybrid-search-in-postgresql-the-missing-manual) +- [Hybrid Search with pgvector - Jonathan Katz](https://jkatz05.com/post/postgres/hybrid-search-postgres-pgvector/) +- [Stop the Hallucinations - Cloudurable](https://cloudurable.com/blog/stop-the-hallucinations-hybrid-retrieval-with-bm25-pgvector-embedding-rerank-llm-rubric-rerank-hyde/) + +### Multi-Tenant RAG +- [Multi-Tenant RAG with PostgreSQL - Timescale](https://www.tigerdata.com/blog/building-multi-tenant-rag-applications-with-postgresql-choosing-the-right-approach) +- [Building Multi-Tenancy RAG with Milvus](https://milvus.io/blog/build-multi-tenancy-rag-with-milvus-best-practices-part-one.md) + +### pgvector +- [pgvector GitHub](https://github.com/pgvector/pgvector) +- [HNSW Indexes with pgvector - Crunchy Data](https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector) +- [Optimize pgvector - Neon](https://neon.com/docs/ai/ai-vector-search-optimization) + +--- + +## Decision + +**Adopt pgvector with hybrid search** as the knowledge base solution for Syndarix RAG: + +1. **pgvector** for vector storage and similarity search +2. **PostgreSQL full-text search** (tsvector) for keyword matching +3. **Reciprocal Rank Fusion (RRF)** for combining results +4. **LiteLLM** for unified embedding API +5. **Content-type-aware chunking** with AST parsing for code +6. **Shared table with tenant isolation** via project_id/agent_id + +**Migration Path:** If any project exceeds 10M vectors or requires sub-10ms latency, evaluate Qdrant as a dedicated vector store while keeping metadata in PostgreSQL. + +--- + +*Spike completed. Findings will inform ADR-006: Knowledge Base Architecture.* diff --git a/docs/spikes/SPIKE-007-agent-communication-protocol.md b/docs/spikes/SPIKE-007-agent-communication-protocol.md new file mode 100644 index 0000000..9258251 --- /dev/null +++ b/docs/spikes/SPIKE-007-agent-communication-protocol.md @@ -0,0 +1,1496 @@ +# SPIKE-007: Agent Communication Protocol + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #7 + +--- + +## Executive Summary + +This spike researches inter-agent communication protocols for Syndarix, where 10+ specialized AI agents need to collaborate on software projects. After analyzing industry standards (A2A, ACP, MCP) and multi-agent system patterns, we recommend a **structured message-based protocol** built on the existing Redis Pub/Sub event bus. + +### Recommendation + +Adopt a **hybrid communication model** combining: +1. **Structured JSON-RPC messages** for request-response patterns +2. **Redis Pub/Sub channels** for broadcasts and topic-based routing +3. **Database-backed message persistence** for auditability and context recovery +4. **Priority queues via Celery** for async task delegation + +This approach aligns with Google's A2A protocol principles while leveraging Syndarix's existing infrastructure (Redis, Celery, SSE events). + +--- + +## Research Questions + +### 1. What communication patterns work for AI multi-agent systems? + +Industry research identifies four primary patterns: + +| Pattern | Use Case | Syndarix Application | +|---------|----------|---------------------| +| **Request-Response** | Direct task delegation | Engineer asks Architect for guidance | +| **Publish-Subscribe** | Broadcasts, notifications | PO announces sprint goals to all agents | +| **Task Queue** | Async work delegation | PO assigns issues to Engineers | +| **Streaming** | Long-running updates | Agent streaming progress to observers | + +**Key Insight:** Syndarix needs all four patterns. A2A protocol supports request-response and streaming; Celery handles task queues; Redis Pub/Sub provides pub-sub. + +### 2. Structured message formats vs natural language between agents? + +**Recommendation: Structured messages with natural language payload.** + +```python +# Structured envelope with natural language content +{ + "id": "msg-uuid-123", + "type": "request", + "from": {"agent_id": "eng-001", "role": "Engineer", "name": "Dave"}, + "to": {"agent_id": "arch-001", "role": "Architect", "name": "Alex"}, + "action": "request_guidance", + "priority": "normal", + "context": { + "project_id": "proj-123", + "issue_id": "issue-456", + "conversation_id": "conv-789" + }, + "content": "I'm implementing the auth module. Should I use JWT with refresh tokens or session-based auth?", + "metadata": { + "created_at": "2025-12-29T10:00:00Z", + "expires_at": "2025-12-29T11:00:00Z", + "requires_response": true + } +} +``` + +**Rationale:** +- Structured envelope enables routing, filtering, and auditing +- Natural language content preserves LLM reasoning capabilities +- Matches A2A/ACP hybrid approach (structured headers, flexible payload) +- Enables agent-to-agent understanding without rigid schemas + +### 3. How to handle async vs sync communication? + +**Pattern Mapping:** + +| Scenario | Timing | Implementation | +|----------|--------|----------------| +| Quick clarification | Sync-like (< 30s) | Request-response with timeout | +| Code review request | Async (minutes-hours) | Task queue + callback | +| Broadcast announcement | Fire-and-forget | Pub/Sub | +| Long-running analysis | Streaming | SSE with progress events | + +**Implementation Strategy:** + +```python +class MessageMode(str, Enum): + SYNC = "sync" # Await response (with timeout) + ASYNC = "async" # Queue for later, callback on completion + FIRE_AND_FORGET = "fire_and_forget" # No response expected + STREAM = "stream" # Continuous updates +``` + +### 4. Message routing strategies? + +**Recommended: Hierarchical routing with three strategies.** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Message Router β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ DIRECT β”‚ β”‚ ROLE-BASED β”‚ β”‚ BROADCAST β”‚ β”‚ +β”‚ β”‚ Routing β”‚ β”‚ Routing β”‚ β”‚ Routing β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ to: agent β”‚ β”‚ to: @role β”‚ β”‚ to: @all β”‚ β”‚ +β”‚ β”‚ "arch-001" β”‚ β”‚ "@engineers"β”‚ β”‚ "@project" β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +| Strategy | Syntax | Use Case | +|----------|--------|----------| +| **Direct** | `to: "agent-123"` | Specific agent communication | +| **Role-based** | `to: "@engineers"` | All agents of a role | +| **Broadcast** | `to: "@all"` | Project-wide announcements | +| **Topic-based** | `to: "#auth-module"` | Agents subscribed to topic | + +### 5. How to maintain conversation context across agent interactions? + +**Three-tier context management:** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Context Hierarchy β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ CONVERSATION CONTEXT (Short-term) β”‚ β”‚ +β”‚ β”‚ - Current message thread β”‚ β”‚ +β”‚ β”‚ - Last N exchanges between agents β”‚ β”‚ +β”‚ β”‚ - Active topic/issue context β”‚ β”‚ +β”‚ β”‚ Storage: In-memory + Redis (TTL: 1 hour) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ SESSION CONTEXT (Medium-term) β”‚ β”‚ +β”‚ β”‚ - Sprint goals and decisions β”‚ β”‚ +β”‚ β”‚ - Shared agreements between agents β”‚ β”‚ +β”‚ β”‚ - Recent artifacts and references β”‚ β”‚ +β”‚ β”‚ Storage: PostgreSQL AgentMessage table β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ PROJECT CONTEXT (Long-term) β”‚ β”‚ +β”‚ β”‚ - Architecture decisions (ADRs) β”‚ β”‚ +β”‚ β”‚ - Requirements and constraints β”‚ β”‚ +β”‚ β”‚ - Knowledge base documents β”‚ β”‚ +β”‚ β”‚ Storage: PostgreSQL + pgvector (RAG) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Context injection pattern:** + +```python +async def prepare_agent_context( + agent_id: str, + conversation_id: str, + project_id: str +) -> list[dict]: + """Build context for LLM call.""" + context = [] + + # 1. Conversation context (recent exchanges) + recent_messages = await get_conversation_history( + conversation_id, limit=10 + ) + context.extend(format_as_messages(recent_messages)) + + # 2. Session context (relevant decisions) + session_context = await get_session_context(agent_id, project_id) + if session_context: + context.append({ + "role": "system", + "content": f"Session context:\n{session_context}" + }) + + # 3. Project context (RAG retrieval) + query = extract_query_from_conversation(recent_messages) + rag_results = await search_knowledge_base(project_id, query) + if rag_results: + context.append({ + "role": "system", + "content": f"Relevant project context:\n{rag_results}" + }) + + return context +``` + +### 6. Conflict resolution when agents disagree? + +**Hierarchical escalation model:** + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Conflict Resolution Hierarchy β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”‚ +β”‚ Level 1: Direct Negotiation β”‚ +β”‚ β”œβ”€β”€ Agents exchange arguments (max 3 rounds) β”‚ +β”‚ β”œβ”€β”€ If consensus β†’ Resolution recorded β”‚ +β”‚ └── If no consensus β†’ Escalate to Level 2 β”‚ +β”‚ β”‚ +β”‚ Level 2: Expert Arbitration β”‚ +β”‚ β”œβ”€β”€ Relevant expert agent weighs in (Architect for tech) β”‚ +β”‚ β”œβ”€β”€ Expert makes binding recommendation β”‚ +β”‚ └── If expert unavailable β†’ Escalate to Level 3 β”‚ +β”‚ β”‚ +β”‚ Level 3: Human Decision β”‚ +β”‚ β”œβ”€β”€ Conflict summarized and sent to human client β”‚ +β”‚ β”œβ”€β”€ Approval event triggers resolution β”‚ +β”‚ └── Decision recorded for future reference β”‚ +β”‚ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Implementation:** + +```python +class ConflictResolution(str, Enum): + NEGOTIATION = "negotiation" + EXPERT_ARBITRATION = "expert_arbitration" + HUMAN_DECISION = "human_decision" + +class ConflictTracker: + async def record_conflict( + self, + participants: list[str], + topic: str, + positions: dict[str, str] + ) -> Conflict: + """Record a conflict between agents.""" + return await self.db.create(Conflict( + participants=participants, + topic=topic, + positions=positions, + status=ConflictStatus.OPEN, + resolution_level=ConflictResolution.NEGOTIATION, + negotiation_rounds=0 + )) + + async def escalate(self, conflict_id: str) -> Conflict: + """Escalate conflict to next resolution level.""" + conflict = await self.get(conflict_id) + + if conflict.resolution_level == ConflictResolution.NEGOTIATION: + # Find relevant expert + expert = await self.find_expert_for_topic(conflict.topic) + conflict.arbitrator_id = expert.id + conflict.resolution_level = ConflictResolution.EXPERT_ARBITRATION + + elif conflict.resolution_level == ConflictResolution.EXPERT_ARBITRATION: + # Create human approval request + await self.create_approval_request(conflict) + conflict.resolution_level = ConflictResolution.HUMAN_DECISION + + return await self.db.update(conflict) +``` + +### 7. How to audit/log inter-agent communication? + +**Comprehensive audit trail:** + +```python +class MessageAuditLog: + """All agent messages are persisted for audit.""" + + id = Column(UUID, primary_key=True) + message_id = Column(UUID, index=True) + + # Routing info + from_agent_id = Column(UUID, ForeignKey("agent_instances.id")) + to_agent_id = Column(UUID, nullable=True) # NULL for broadcasts + to_role = Column(String, nullable=True) + + # Message details + message_type = Column(Enum(MessageType)) + action = Column(String(50)) + content_hash = Column(String(64)) # SHA-256 of content + content = Column(Text) # Full content (encrypted at rest) + + # Context + project_id = Column(UUID, ForeignKey("projects.id"), index=True) + conversation_id = Column(UUID, index=True) + parent_message_id = Column(UUID, nullable=True) + + # Response tracking + response_to_id = Column(UUID, nullable=True) + response_received_at = Column(DateTime, nullable=True) + + # Timestamps + created_at = Column(DateTime, default=datetime.utcnow) + delivered_at = Column(DateTime, nullable=True) + read_at = Column(DateTime, nullable=True) +``` + +--- + +## Message Format Specification + +### Core Message Schema + +```python +from pydantic import BaseModel, Field +from typing import Optional, Literal +from datetime import datetime +from uuid import UUID + +class AgentIdentifier(BaseModel): + """Identifies an agent in communication.""" + agent_id: UUID + role: str + name: str + +class MessageContext(BaseModel): + """Contextual information for message routing.""" + project_id: UUID + conversation_id: UUID + issue_id: Optional[UUID] = None + sprint_id: Optional[UUID] = None + parent_message_id: Optional[UUID] = None + +class MessageMetadata(BaseModel): + """Message metadata for processing.""" + created_at: datetime = Field(default_factory=datetime.utcnow) + expires_at: Optional[datetime] = None + priority: Literal["urgent", "high", "normal", "low"] = "normal" + requires_response: bool = False + response_timeout_seconds: Optional[int] = None + retry_count: int = 0 + max_retries: int = 3 + +class AgentMessage(BaseModel): + """Primary message format for inter-agent communication.""" + + # Identity + id: UUID = Field(default_factory=uuid4) + type: Literal["request", "response", "notification", "broadcast", "stream"] + + # Routing + sender: AgentIdentifier + recipient: Optional[AgentIdentifier] = None # None for broadcasts + recipient_role: Optional[str] = None # For role-based routing + recipient_all: bool = False # For project-wide broadcast + + # Action + action: str # e.g., "request_review", "report_bug", "share_finding" + + # Content + content: str # Natural language message + attachments: list[Attachment] = [] # Files, code snippets, etc. + + # Context + context: MessageContext + + # Metadata + metadata: MessageMetadata = Field(default_factory=MessageMetadata) + + # Response linking + in_response_to: Optional[UUID] = None + +class Attachment(BaseModel): + """Attachments to messages.""" + type: Literal["code", "file", "image", "document", "reference"] + name: str + content: Optional[str] = None # For inline content + url: Optional[str] = None # For file references + mime_type: Optional[str] = None +``` + +### Action Types + +```python +class MessageAction(str, Enum): + # Request actions + REQUEST_REVIEW = "request_review" + REQUEST_GUIDANCE = "request_guidance" + REQUEST_CLARIFICATION = "request_clarification" + REQUEST_APPROVAL = "request_approval" + DELEGATE_TASK = "delegate_task" + + # Response actions + PROVIDE_REVIEW = "provide_review" + PROVIDE_GUIDANCE = "provide_guidance" + PROVIDE_CLARIFICATION = "provide_clarification" + APPROVE = "approve" + REJECT = "reject" + + # Notification actions + REPORT_BUG = "report_bug" + REPORT_PROGRESS = "report_progress" + REPORT_BLOCKER = "report_blocker" + SHARE_FINDING = "share_finding" + ANNOUNCE_DECISION = "announce_decision" + + # Collaboration actions + PROPOSE_APPROACH = "propose_approach" + CHALLENGE_APPROACH = "challenge_approach" + AGREE_WITH = "agree_with" + DISAGREE_WITH = "disagree_with" +``` + +--- + +## Communication Patterns + +### Pattern 1: Request-Response + +**Use Case:** Engineer asks Architect for design guidance. + +```python +# Engineer sends request +request = AgentMessage( + type="request", + sender=AgentIdentifier(agent_id=eng_id, role="Engineer", name="Dave"), + recipient=AgentIdentifier(agent_id=arch_id, role="Architect", name="Alex"), + action="request_guidance", + content="I need to implement caching for the user service. Should I use Redis with write-through or write-behind strategy?", + context=MessageContext( + project_id=project_id, + conversation_id=conv_id, + issue_id=issue_id + ), + metadata=MessageMetadata( + priority="normal", + requires_response=True, + response_timeout_seconds=300 + ) +) + +# Send and await response +response = await message_router.send_and_wait(request, timeout=300) + +# Architect's response +response = AgentMessage( + type="response", + sender=AgentIdentifier(agent_id=arch_id, role="Architect", name="Alex"), + recipient=request.sender, + action="provide_guidance", + content="Given our consistency requirements, use write-through caching with Redis. Here's the pattern...", + in_response_to=request.id, + context=request.context, + attachments=[ + Attachment(type="code", name="cache_pattern.py", content="...") + ] +) +``` + +### Pattern 2: Broadcast + +**Use Case:** Product Owner announces sprint goals. + +```python +broadcast = AgentMessage( + type="broadcast", + sender=AgentIdentifier(agent_id=po_id, role="ProductOwner", name="Sarah"), + recipient_all=True, + action="announce_decision", + content="Sprint 3 goals: 1) Complete auth module 2) Add user settings page 3) Fix critical bugs from QA", + context=MessageContext( + project_id=project_id, + conversation_id=conv_id, + sprint_id=sprint_id + ), + metadata=MessageMetadata( + priority="high", + requires_response=False + ) +) + +await message_router.broadcast(broadcast) +``` + +### Pattern 3: Role-Based Routing + +**Use Case:** QA reports a bug to all Engineers. + +```python +bug_report = AgentMessage( + type="notification", + sender=AgentIdentifier(agent_id=qa_id, role="QA", name="Quinn"), + recipient_role="Engineer", # All engineers receive this + action="report_bug", + content="Found a critical bug in the auth flow: users can bypass 2FA by...", + context=MessageContext( + project_id=project_id, + conversation_id=conv_id, + issue_id=bug_issue_id + ), + metadata=MessageMetadata( + priority="urgent", + requires_response=True + ) +) + +await message_router.send_to_role(bug_report) +``` + +### Pattern 4: Task Delegation + +**Use Case:** Product Owner assigns work to Engineer. + +```python +delegation = AgentMessage( + type="request", + sender=AgentIdentifier(agent_id=po_id, role="ProductOwner", name="Sarah"), + recipient=AgentIdentifier(agent_id=eng_id, role="Engineer", name="Dave"), + action="delegate_task", + content="Please implement issue #45: Add password reset functionality", + context=MessageContext( + project_id=project_id, + conversation_id=conv_id, + issue_id=issue_45_id + ), + metadata=MessageMetadata( + priority="normal", + requires_response=True + ) +) + +# This creates a Celery task for async execution +await message_router.delegate_task(delegation) +``` + +### Pattern 5: Streaming Updates + +**Use Case:** Engineer shares progress during long task. + +```python +async def stream_progress(agent_id: str, task_id: str): + """Stream progress updates during task execution.""" + + for step in task_steps: + # Execute step + result = await execute_step(step) + + # Stream update + update = AgentMessage( + type="stream", + sender=AgentIdentifier(agent_id=agent_id, ...), + recipient_all=True, + action="report_progress", + content=f"Completed: {step.name}. Progress: {step.percent}%", + context=MessageContext(project_id=project_id, ...), + ) + + await message_router.stream(update) +``` + +--- + +## Database Schema + +### Message Storage + +```python +# app/models/agent_message.py +from sqlalchemy import Column, String, Text, DateTime, Enum, ForeignKey, Index +from sqlalchemy.dialects.postgresql import UUID, JSONB +from app.db.base import Base + +class AgentMessage(Base): + """Persistent storage for all agent messages.""" + + __tablename__ = "agent_messages" + + # Primary key + id = Column(UUID, primary_key=True, default=uuid4) + + # Message type + type = Column( + Enum("request", "response", "notification", "broadcast", "stream", + name="message_type"), + nullable=False + ) + + # Sender + sender_agent_id = Column(UUID, ForeignKey("agent_instances.id"), nullable=False) + sender_role = Column(String(50), nullable=False) + sender_name = Column(String(100), nullable=False) + + # Recipient (nullable for broadcasts) + recipient_agent_id = Column(UUID, ForeignKey("agent_instances.id"), nullable=True) + recipient_role = Column(String(50), nullable=True) # For role-based routing + is_broadcast = Column(Boolean, default=False) + + # Action + action = Column(String(50), nullable=False) + + # Content + content = Column(Text, nullable=False) + content_hash = Column(String(64), nullable=False) # SHA-256 + attachments = Column(JSONB, default=list) + + # Context + project_id = Column(UUID, ForeignKey("projects.id"), nullable=False) + conversation_id = Column(UUID, nullable=False) + issue_id = Column(UUID, ForeignKey("issues.id"), nullable=True) + sprint_id = Column(UUID, ForeignKey("sprints.id"), nullable=True) + parent_message_id = Column(UUID, ForeignKey("agent_messages.id"), nullable=True) + + # Response linking + in_response_to_id = Column(UUID, ForeignKey("agent_messages.id"), nullable=True) + + # Metadata + priority = Column( + Enum("urgent", "high", "normal", "low", name="priority_level"), + default="normal" + ) + requires_response = Column(Boolean, default=False) + response_timeout_seconds = Column(Integer, nullable=True) + + # Status tracking + status = Column( + Enum("pending", "delivered", "read", "responded", "expired", "failed", + name="message_status"), + default="pending" + ) + + # Timestamps + created_at = Column(DateTime, default=datetime.utcnow, nullable=False) + delivered_at = Column(DateTime, nullable=True) + read_at = Column(DateTime, nullable=True) + responded_at = Column(DateTime, nullable=True) + expires_at = Column(DateTime, nullable=True) + + # Indexes + __table_args__ = ( + Index("ix_agent_messages_project_conversation", "project_id", "conversation_id"), + Index("ix_agent_messages_sender", "sender_agent_id", "created_at"), + Index("ix_agent_messages_recipient", "recipient_agent_id", "status"), + Index("ix_agent_messages_conversation", "conversation_id", "created_at"), + ) + + +class Conversation(Base): + """Groups related messages into conversations.""" + + __tablename__ = "conversations" + + id = Column(UUID, primary_key=True, default=uuid4) + + # Context + project_id = Column(UUID, ForeignKey("projects.id"), nullable=False) + issue_id = Column(UUID, ForeignKey("issues.id"), nullable=True) + sprint_id = Column(UUID, ForeignKey("sprints.id"), nullable=True) + + # Participants + participant_ids = Column(JSONB, default=list) # List of agent IDs + + # Topic + topic = Column(String(200), nullable=True) + + # Status + status = Column( + Enum("active", "resolved", "archived", name="conversation_status"), + default="active" + ) + + # Timestamps + created_at = Column(DateTime, default=datetime.utcnow) + updated_at = Column(DateTime, onupdate=datetime.utcnow) + resolved_at = Column(DateTime, nullable=True) + + # Summary (generated) + summary = Column(Text, nullable=True) + + +class MessageDelivery(Base): + """Tracks delivery status for broadcast messages.""" + + __tablename__ = "message_deliveries" + + id = Column(UUID, primary_key=True, default=uuid4) + message_id = Column(UUID, ForeignKey("agent_messages.id"), nullable=False) + recipient_agent_id = Column(UUID, ForeignKey("agent_instances.id"), nullable=False) + + # Status + delivered_at = Column(DateTime, nullable=True) + read_at = Column(DateTime, nullable=True) + responded_at = Column(DateTime, nullable=True) +``` + +### Alembic Migration + +```python +# migrations/versions/xxx_add_agent_messages.py +def upgrade(): + # Create enum types + op.execute(""" + CREATE TYPE message_type AS ENUM ( + 'request', 'response', 'notification', 'broadcast', 'stream' + ) + """) + op.execute(""" + CREATE TYPE priority_level AS ENUM ('urgent', 'high', 'normal', 'low') + """) + op.execute(""" + CREATE TYPE message_status AS ENUM ( + 'pending', 'delivered', 'read', 'responded', 'expired', 'failed' + ) + """) + op.execute(""" + CREATE TYPE conversation_status AS ENUM ('active', 'resolved', 'archived') + """) + + # Create tables + op.create_table( + 'conversations', + sa.Column('id', UUID, primary_key=True), + sa.Column('project_id', UUID, sa.ForeignKey('projects.id'), nullable=False), + sa.Column('issue_id', UUID, sa.ForeignKey('issues.id'), nullable=True), + sa.Column('sprint_id', UUID, sa.ForeignKey('sprints.id'), nullable=True), + sa.Column('participant_ids', JSONB, default=[]), + sa.Column('topic', sa.String(200), nullable=True), + sa.Column('status', sa.Enum('active', 'resolved', 'archived', + name='conversation_status'), default='active'), + sa.Column('created_at', sa.DateTime, default=datetime.utcnow), + sa.Column('updated_at', sa.DateTime, onupdate=datetime.utcnow), + sa.Column('resolved_at', sa.DateTime, nullable=True), + sa.Column('summary', sa.Text, nullable=True), + ) + + op.create_table( + 'agent_messages', + # ... columns as defined above + ) + + op.create_table( + 'message_deliveries', + # ... columns as defined above + ) +``` + +--- + +## Code Examples + +### Message Router Service + +```python +# app/services/message_router.py +from typing import Optional, AsyncIterator +from uuid import UUID +import asyncio +import hashlib +import json + +from app.models.agent_message import AgentMessage as AgentMessageModel +from app.schemas.messages import AgentMessage, AgentIdentifier +from app.services.events import EventBus +from app.db.session import AsyncSession +from app.core.celery_app import celery_app + +class MessageRouter: + """Routes messages between agents.""" + + def __init__( + self, + db: AsyncSession, + event_bus: EventBus, + orchestrator: "AgentOrchestrator" + ): + self.db = db + self.event_bus = event_bus + self.orchestrator = orchestrator + + async def send(self, message: AgentMessage) -> AgentMessageModel: + """Send a message to a specific agent.""" + # Validate recipient exists and is active + recipient = await self.orchestrator.get_instance(message.recipient.agent_id) + if not recipient or recipient.status != "active": + raise AgentNotAvailableError(f"Agent {message.recipient.agent_id} not available") + + # Persist message + db_message = await self._persist_message(message) + + # Publish to recipient's channel + await self.event_bus.publish( + f"agent:{message.recipient.agent_id}", + { + "type": "message_received", + "message_id": str(db_message.id), + "sender": message.sender.dict(), + "action": message.action, + "priority": message.metadata.priority, + "preview": message.content[:100] + } + ) + + # Also publish to project channel for monitoring + await self.event_bus.publish( + f"project:{message.context.project_id}", + { + "type": "agent_message", + "from": message.sender.name, + "to": message.recipient.name, + "action": message.action + } + ) + + return db_message + + async def send_and_wait( + self, + message: AgentMessage, + timeout: int = 300 + ) -> Optional[AgentMessage]: + """Send a message and wait for response.""" + db_message = await self.send(message) + + # Subscribe to response channel + response_channel = f"response:{db_message.id}" + subscriber = await self.event_bus.subscribe(response_channel) + + try: + response_event = await asyncio.wait_for( + subscriber.get_event(), + timeout=timeout + ) + response_id = response_event.data["response_id"] + return await self.get_message(response_id) + except asyncio.TimeoutError: + # Update message status to expired + await self._mark_expired(db_message.id) + return None + finally: + await subscriber.unsubscribe() + + async def send_to_role(self, message: AgentMessage) -> list[AgentMessageModel]: + """Send message to all agents of a specific role.""" + # Get all active agents with the target role + recipients = await self.orchestrator.get_instances_by_role( + project_id=message.context.project_id, + role=message.recipient_role + ) + + messages = [] + for recipient in recipients: + msg = message.copy() + msg.recipient = AgentIdentifier( + agent_id=recipient.id, + role=recipient.agent_type.role, + name=recipient.name + ) + db_message = await self.send(msg) + messages.append(db_message) + + return messages + + async def broadcast(self, message: AgentMessage) -> AgentMessageModel: + """Broadcast message to all project agents.""" + # Persist the broadcast message + db_message = await self._persist_message(message) + + # Get all active agents in project + agents = await self.orchestrator.get_project_agents( + message.context.project_id + ) + + # Create delivery records + for agent in agents: + if agent.id != message.sender.agent_id: + await self._create_delivery_record(db_message.id, agent.id) + + # Publish to project channel + await self.event_bus.publish( + f"project:{message.context.project_id}", + { + "type": "broadcast", + "message_id": str(db_message.id), + "sender": message.sender.dict(), + "action": message.action, + "content": message.content + } + ) + + return db_message + + async def delegate_task(self, message: AgentMessage) -> str: + """Delegate a task for async execution via Celery.""" + # Persist message + db_message = await self._persist_message(message) + + # Create Celery task + task = celery_app.send_task( + "app.tasks.agent.execute_delegated_task", + args=[str(db_message.id), str(message.recipient.agent_id)], + queue="agent_tasks", + priority=self._priority_to_int(message.metadata.priority) + ) + + return task.id + + async def respond( + self, + original_message_id: UUID, + response: AgentMessage + ) -> AgentMessageModel: + """Send a response to a previous message.""" + original = await self.get_message(original_message_id) + + response.in_response_to = original_message_id + response.context = original.context + + db_response = await self.send(response) + + # Update original message status + await self._mark_responded(original_message_id) + + # Publish response notification + await self.event_bus.publish( + f"response:{original_message_id}", + { + "type": "response_received", + "response_id": str(db_response.id) + } + ) + + return db_response + + async def get_conversation_history( + self, + conversation_id: UUID, + limit: int = 50, + before: Optional[datetime] = None + ) -> list[AgentMessageModel]: + """Get messages in a conversation.""" + query = select(AgentMessageModel).where( + AgentMessageModel.conversation_id == conversation_id + ) + + if before: + query = query.where(AgentMessageModel.created_at < before) + + query = query.order_by(AgentMessageModel.created_at.desc()).limit(limit) + + result = await self.db.execute(query) + return list(reversed(result.scalars().all())) + + async def _persist_message(self, message: AgentMessage) -> AgentMessageModel: + """Persist message to database.""" + content_hash = hashlib.sha256(message.content.encode()).hexdigest() + + db_message = AgentMessageModel( + id=message.id, + type=message.type, + sender_agent_id=message.sender.agent_id, + sender_role=message.sender.role, + sender_name=message.sender.name, + recipient_agent_id=message.recipient.agent_id if message.recipient else None, + recipient_role=message.recipient_role, + is_broadcast=message.recipient_all, + action=message.action, + content=message.content, + content_hash=content_hash, + attachments=[a.dict() for a in message.attachments], + project_id=message.context.project_id, + conversation_id=message.context.conversation_id, + issue_id=message.context.issue_id, + sprint_id=message.context.sprint_id, + parent_message_id=message.context.parent_message_id, + in_response_to_id=message.in_response_to, + priority=message.metadata.priority, + requires_response=message.metadata.requires_response, + response_timeout_seconds=message.metadata.response_timeout_seconds, + expires_at=message.metadata.expires_at, + ) + + self.db.add(db_message) + await self.db.commit() + await self.db.refresh(db_message) + + return db_message + + def _priority_to_int(self, priority: str) -> int: + """Convert priority to Celery priority (0=highest).""" + return {"urgent": 0, "high": 3, "normal": 6, "low": 9}.get(priority, 6) +``` + +### Receiving and Processing Messages + +```python +# app/services/agent_inbox.py +from typing import Optional, Callable, Awaitable + +class AgentInbox: + """Manages incoming messages for an agent.""" + + def __init__( + self, + agent_id: UUID, + event_bus: EventBus, + message_router: MessageRouter + ): + self.agent_id = agent_id + self.event_bus = event_bus + self.message_router = message_router + self._handlers: dict[str, Callable] = {} + + def on_action( + self, + action: str, + handler: Callable[[AgentMessage], Awaitable[Optional[AgentMessage]]] + ): + """Register a handler for a specific action type.""" + self._handlers[action] = handler + + async def start_listening(self): + """Start listening for incoming messages.""" + channel = f"agent:{self.agent_id}" + subscriber = await self.event_bus.subscribe(channel) + + while True: + event = await subscriber.get_event() + + if event.type == "message_received": + message = await self.message_router.get_message( + event.data["message_id"] + ) + await self._process_message(message) + + async def _process_message(self, message: AgentMessage): + """Process an incoming message.""" + # Mark as read + await self.message_router.mark_read(message.id) + + # Find handler + handler = self._handlers.get(message.action) + if not handler: + handler = self._handlers.get("default") + + if handler: + response = await handler(message) + + if response and message.requires_response: + await self.message_router.respond(message.id, response) + + +# Usage in agent runner +class AgentRunner: + async def setup_message_handlers(self): + """Setup handlers for different message types.""" + + @self.inbox.on_action("request_review") + async def handle_review_request(message: AgentMessage) -> AgentMessage: + # Execute code review + review_result = await self.execute("review_code", { + "content": message.content, + "attachments": message.attachments + }) + + return AgentMessage( + type="response", + sender=self.identifier, + recipient=message.sender, + action="provide_review", + content=review_result["content"], + attachments=review_result.get("attachments", []) + ) + + @self.inbox.on_action("delegate_task") + async def handle_task_delegation(message: AgentMessage) -> AgentMessage: + # Accept and start working on task + await self.update_status("working") + + # Execute task + result = await self.execute("implement", { + "task_description": message.content, + "issue_id": message.context.issue_id + }) + + return AgentMessage( + type="response", + sender=self.identifier, + action="task_completed", + content=f"Task completed. {result['summary']}" + ) +``` + +### API Endpoints + +```python +# app/api/v1/messages.py +from fastapi import APIRouter, Depends, HTTPException +from uuid import UUID + +from app.schemas.messages import AgentMessage, MessageCreate, MessageResponse +from app.services.message_router import MessageRouter +from app.api.deps import get_message_router, get_current_user + +router = APIRouter(prefix="/messages", tags=["messages"]) + +@router.post("/send", response_model=MessageResponse) +async def send_message( + message_in: MessageCreate, + router: MessageRouter = Depends(get_message_router), + current_user: User = Depends(get_current_user) +): + """Send a message between agents (admin/system use).""" + message = AgentMessage(**message_in.dict()) + result = await router.send(message) + return MessageResponse.from_orm(result) + +@router.post("/broadcast", response_model=MessageResponse) +async def broadcast_message( + message_in: MessageCreate, + router: MessageRouter = Depends(get_message_router), + current_user: User = Depends(get_current_user) +): + """Broadcast a message to all project agents.""" + message = AgentMessage(**message_in.dict(), recipient_all=True) + result = await router.broadcast(message) + return MessageResponse.from_orm(result) + +@router.get("/conversation/{conversation_id}", response_model=list[MessageResponse]) +async def get_conversation( + conversation_id: UUID, + limit: int = 50, + router: MessageRouter = Depends(get_message_router), + current_user: User = Depends(get_current_user) +): + """Get messages in a conversation.""" + messages = await router.get_conversation_history(conversation_id, limit=limit) + return [MessageResponse.from_orm(m) for m in messages] + +@router.get("/agent/{agent_id}/inbox", response_model=list[MessageResponse]) +async def get_agent_inbox( + agent_id: UUID, + status: Optional[str] = None, + limit: int = 50, + router: MessageRouter = Depends(get_message_router), + current_user: User = Depends(get_current_user) +): + """Get messages in an agent's inbox.""" + messages = await router.get_agent_messages(agent_id, status=status, limit=limit) + return [MessageResponse.from_orm(m) for m in messages] +``` + +--- + +## SSE Integration + +### Real-time Message Events + +Messages integrate with the existing SSE event system (ADR-002): + +```python +# Extended event types for messages +class EventType(str, Enum): + # Existing events... + + # Message Events + MESSAGE_SENT = "message_sent" + MESSAGE_RECEIVED = "message_received" + MESSAGE_READ = "message_read" + MESSAGE_RESPONDED = "message_responded" + + # Conversation Events + CONVERSATION_STARTED = "conversation_started" + CONVERSATION_RESOLVED = "conversation_resolved" + + # Conflict Events + CONFLICT_DETECTED = "conflict_detected" + CONFLICT_ESCALATED = "conflict_escalated" + CONFLICT_RESOLVED = "conflict_resolved" +``` + +### Message Event Stream + +```python +# app/api/v1/events.py (extended) + +@router.get("/projects/{project_id}/messages/events") +async def message_events( + project_id: UUID, + request: Request, + current_user: User = Depends(get_current_user) +): + """Stream message events for a project.""" + + async def event_generator(): + subscriber = await event_bus.subscribe(f"project:{project_id}:messages") + + try: + while not await request.is_disconnected(): + try: + event = await asyncio.wait_for( + subscriber.get_event(), + timeout=30.0 + ) + + # Filter by event type + if event.type in [ + "message_sent", "message_received", + "conversation_started", "conflict_detected" + ]: + yield f"event: {event.type}\ndata: {json.dumps(event.data)}\n\n" + + except asyncio.TimeoutError: + yield ": keepalive\n\n" + finally: + await subscriber.unsubscribe() + + return StreamingResponse( + event_generator(), + media_type="text/event-stream" + ) +``` + +### Frontend Integration + +```typescript +// frontend/lib/messageEvents.ts +import { useEffect, useCallback } from 'react'; + +interface MessageEvent { + type: 'message_sent' | 'message_received' | 'conversation_started'; + data: { + message_id?: string; + from?: AgentIdentifier; + to?: AgentIdentifier; + action?: string; + preview?: string; + }; +} + +export function useMessageEvents( + projectId: string, + onMessage: (event: MessageEvent) => void +) { + useEffect(() => { + const eventSource = new EventSource( + `/api/v1/projects/${projectId}/messages/events`, + { withCredentials: true } + ); + + eventSource.addEventListener('message_sent', (e) => { + onMessage({ type: 'message_sent', data: JSON.parse(e.data) }); + }); + + eventSource.addEventListener('message_received', (e) => { + onMessage({ type: 'message_received', data: JSON.parse(e.data) }); + }); + + eventSource.addEventListener('conversation_started', (e) => { + onMessage({ type: 'conversation_started', data: JSON.parse(e.data) }); + }); + + eventSource.onerror = () => { + // Reconnection handled by EventSource + console.log('Message SSE reconnecting...'); + }; + + return () => eventSource.close(); + }, [projectId, onMessage]); +} +``` + +--- + +## Syndarix-Specific Requirements + +### @Mentions Support + +```python +class MentionParser: + """Parse @mentions in message content.""" + + MENTION_PATTERN = r'@(\w+)' + ROLE_PATTERN = r'@(engineers?|architects?|qa|po|pm|devops)' + + def parse(self, content: str) -> list[Mention]: + mentions = [] + + # Parse agent mentions + for match in re.finditer(self.MENTION_PATTERN, content): + name = match.group(1) + mentions.append(Mention(type="agent", name=name, position=match.start())) + + # Parse role mentions + for match in re.finditer(self.ROLE_PATTERN, content, re.IGNORECASE): + role = match.group(1) + mentions.append(Mention(type="role", role=role, position=match.start())) + + return mentions + + async def resolve_mentions( + self, + mentions: list[Mention], + project_id: UUID, + orchestrator: AgentOrchestrator + ) -> list[UUID]: + """Resolve mentions to agent IDs.""" + agent_ids = set() + + for mention in mentions: + if mention.type == "agent": + agent = await orchestrator.get_by_name(project_id, mention.name) + if agent: + agent_ids.add(agent.id) + elif mention.type == "role": + agents = await orchestrator.get_instances_by_role( + project_id, mention.role + ) + agent_ids.update(a.id for a in agents) + + return list(agent_ids) +``` + +### Priority Message Handling + +```python +class PriorityMessageHandler: + """Handle urgent messages with priority.""" + + PRIORITY_INTERRUPTS = {"urgent", "high"} + + async def process_with_priority( + self, + agent_id: UUID, + message: AgentMessage + ): + """Process message based on priority.""" + if message.metadata.priority in self.PRIORITY_INTERRUPTS: + # Interrupt current task + await self.orchestrator.interrupt_agent(agent_id) + + # Publish urgent notification + await self.event_bus.publish( + f"project:{message.context.project_id}", + { + "type": "urgent_message", + "agent_id": str(agent_id), + "message_id": str(message.id), + "from": message.sender.name, + "preview": message.content[:100] + } + ) + + # Process immediately + await self.process_immediately(agent_id, message) + else: + # Queue for normal processing + await self.queue_for_processing(agent_id, message) +``` + +### Communication Traceability + +```python +class CommunicationTracer: + """Trace all inter-agent communication.""" + + async def trace_message_flow( + self, + conversation_id: UUID + ) -> MessageFlowTrace: + """Generate a trace of message flow in a conversation.""" + messages = await self.get_conversation_messages(conversation_id) + + nodes = [] + edges = [] + + for msg in messages: + nodes.append({ + "id": str(msg.id), + "agent": msg.sender_name, + "role": msg.sender_role, + "action": msg.action, + "timestamp": msg.created_at.isoformat() + }) + + if msg.in_response_to_id: + edges.append({ + "from": str(msg.in_response_to_id), + "to": str(msg.id), + "type": "response" + }) + + if msg.parent_message_id: + edges.append({ + "from": str(msg.parent_message_id), + "to": str(msg.id), + "type": "thread" + }) + + return MessageFlowTrace( + conversation_id=conversation_id, + nodes=nodes, + edges=edges, + summary=await self.generate_summary(messages) + ) +``` + +--- + +## Performance Considerations + +### Message Throughput + +| Scenario | Target | Implementation | +|----------|--------|----------------| +| Direct messages | 100/sec | Redis Pub/Sub | +| Broadcasts | 10/sec | Batched delivery | +| Task delegation | 50/sec | Celery queue | +| Message persistence | 200/sec | Batch inserts | + +### Scalability + +```python +# Message batching for high-throughput scenarios +class MessageBatcher: + def __init__(self, batch_size: int = 100, flush_interval: float = 0.1): + self.batch_size = batch_size + self.flush_interval = flush_interval + self._buffer: list[AgentMessage] = [] + self._lock = asyncio.Lock() + + async def add(self, message: AgentMessage): + async with self._lock: + self._buffer.append(message) + + if len(self._buffer) >= self.batch_size: + await self._flush() + + async def _flush(self): + if not self._buffer: + return + + messages = self._buffer + self._buffer = [] + + # Batch insert + await self.db.execute( + insert(AgentMessageModel), + [self._to_dict(m) for m in messages] + ) + await self.db.commit() +``` + +### Memory Management + +- Conversation history limited to last 50 messages in-memory +- Older messages retrieved on-demand from database +- Redis TTL of 1 hour for active conversation cache +- Periodic archival of resolved conversations + +--- + +## References + +### Industry Protocols Researched + +- [Google A2A Protocol](https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/) - Agent-to-Agent Protocol +- [IBM ACP](https://arxiv.org/html/2505.02279v1) - Agent Communication Protocol +- [Anthropic MCP](https://spec.modelcontextprotocol.io/) - Model Context Protocol +- [Linux Foundation A2A Project](https://www.linuxfoundation.org/press/linux-foundation-launches-the-agent2agent-protocol-project-to-enable-secure-intelligent-communication-between-ai-agents) + +### Multi-Agent System Research + +- [Multi-Agent Collaboration Mechanisms Survey](https://arxiv.org/html/2501.06322v1) +- [Conflict Resolution in Multi-Agent Systems](https://zilliz.com/ai-faq/how-do-multiagent-systems-manage-conflict-resolution) +- [Agent Memory Management](https://www.letta.com/blog/agent-memory) +- [Context Engineering for LLM Agents](https://developers.googleblog.com/architecting-efficient-context-aware-multi-agent-framework-for-production/) + +### Syndarix Architecture + +- [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md) +- [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md) +- [SPIKE-003: Real-time Updates](./SPIKE-003-realtime-updates.md) + +--- + +## Decision + +**Adopt a hybrid message-based communication protocol** with: + +1. **Structured JSON messages** with natural language content +2. **Three routing strategies**: Direct, Role-based, Broadcast +3. **Redis Pub/Sub** for real-time delivery +4. **PostgreSQL** for message persistence and audit +5. **Celery** for async task delegation +6. **SSE** for client notifications +7. **Three-tier context management** for conversation continuity +8. **Hierarchical conflict resolution** with human escalation + +This protocol integrates seamlessly with existing Syndarix infrastructure while following industry best practices from A2A and ACP protocols. + +--- + +*Spike completed. Findings will inform implementation of inter-agent communication in the Agent Orchestration layer.* diff --git a/docs/spikes/SPIKE-008-workflow-state-machine.md b/docs/spikes/SPIKE-008-workflow-state-machine.md new file mode 100644 index 0000000..06f4513 --- /dev/null +++ b/docs/spikes/SPIKE-008-workflow-state-machine.md @@ -0,0 +1,1513 @@ +# SPIKE-008: Workflow State Machine Architecture + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #8 + +--- + +## Executive Summary + +Syndarix requires durable state machine capabilities to orchestrate long-lived workflows spanning hours to days (sprint execution, story implementation, PR review cycles). After evaluating multiple approaches, we recommend a **hybrid architecture**: + +1. **`transitions` library** for state machine logic (lightweight, Pythonic, well-tested) +2. **PostgreSQL** for state persistence with event sourcing for audit trail +3. **Celery** for task execution (already planned in SPIKE-004) +4. **Custom workflow engine** built on these primitives + +This approach balances simplicity with durability, avoiding the operational complexity of dedicated workflow engines like Temporal while providing the reliability Syndarix requires. + +--- + +## Research Questions & Findings + +### 1. Best Python State Machine Libraries (2024-2025) + +| Library | Stars | Last Update | Async | Persistence | Visualization | Best For | +|---------|-------|-------------|-------|-------------|---------------|----------| +| [transitions](https://github.com/pytransitions/transitions) | 5.5k+ | Active | Yes | Manual | Graphviz | General FSM | +| [python-statemachine](https://github.com/fgmacedo/python-statemachine) | 800+ | Active | Yes | Django mixin | Graphviz | Django projects | +| [sismic](https://github.com/AlexandreDecan/sismic) | 400+ | Active | No | Manual | PlantUML | Complex statecharts | +| [automat](https://github.com/glyph/automat) | 300+ | Mature | No | No | No | Protocol implementations | + +**Recommendation:** `transitions` - Most mature, flexible, excellent documentation, supports hierarchical states and callbacks. + +### 2. Framework Comparison + +#### transitions (Recommended for State Logic) + +**Pros:** +- Lightweight, no external dependencies +- Hierarchical (nested) states support +- Rich callback system (before/after/on_enter/on_exit) +- Machine factory for persistence +- Graphviz diagram generation +- Active maintenance, well-tested + +**Cons:** +- No built-in persistence (by design - flexible) +- No distributed coordination +- Manual integration with task queues + +```python +from transitions import Machine + +class StoryWorkflow: + states = ['analysis', 'design', 'implementation', 'review', 'testing', 'done'] + + def __init__(self, story_id: str): + self.story_id = story_id + self.machine = Machine(model=self, states=self.states, initial='analysis') + + # Define transitions + self.machine.add_transition('design_ready', 'analysis', 'design') + self.machine.add_transition('implementation_ready', 'design', 'implementation') + self.machine.add_transition('submit_for_review', 'implementation', 'review') + self.machine.add_transition('request_changes', 'review', 'implementation') + self.machine.add_transition('approve', 'review', 'testing') + self.machine.add_transition('tests_pass', 'testing', 'done') + self.machine.add_transition('tests_fail', 'testing', 'implementation') +``` + +#### Temporal (Considered but Not Recommended) + +**Pros:** +- Durable execution out of the box +- Handles long-running workflows (months/years) +- Built-in retries, timeouts, versioning +- Excellent Python SDK with asyncio integration +- Automatic state persistence + +**Cons:** +- **Heavy infrastructure requirement** (Temporal Server cluster) +- Vendor lock-in to Temporal's model +- Learning curve for Temporal-specific patterns +- Overkill for Syndarix's scale +- Additional operational burden + +**When to Choose Temporal:** +- Mission-critical financial workflows +- 10,000+ concurrent workflows +- Team with dedicated infrastructure capacity + +#### Prefect (Not Recommended for This Use Case) + +**Pros:** +- Great for ETL/data pipelines +- Nice UI for workflow visualization +- Good scheduling capabilities + +**Cons:** +- Designed for batch data processing, not interactive workflows +- State model doesn't map well to business workflows +- Would require significant adaptation + +#### Custom + Celery (Recommended Approach) + +Combine `transitions` state machine logic with Celery task execution: +- State machine handles workflow logic +- PostgreSQL persists state +- Celery executes tasks +- Redis Pub/Sub broadcasts state changes + +### 3. State Persistence Strategy + +#### Database Schema + +```python +# app/models/workflow.py +from enum import Enum +from sqlalchemy import Column, String, Enum as SQLEnum, JSON, ForeignKey, Integer +from sqlalchemy.orm import relationship +from app.models.base import Base, TimestampMixin, UUIDMixin + +class WorkflowType(str, Enum): + SPRINT = "sprint" + STORY = "story" + PULL_REQUEST = "pull_request" + AGENT_TASK = "agent_task" + +class WorkflowInstance(Base, UUIDMixin, TimestampMixin): + """Represents a running workflow instance.""" + __tablename__ = "workflow_instances" + + workflow_type = Column(SQLEnum(WorkflowType), nullable=False, index=True) + current_state = Column(String(100), nullable=False, index=True) + entity_id = Column(String(100), nullable=False, index=True) # story_id, sprint_id, etc. + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=False) + context = Column(JSON, default=dict) # Workflow-specific context + error = Column(String(1000), nullable=True) + retry_count = Column(Integer, default=0) + + # Relationships + project = relationship("Project", back_populates="workflows") + transitions = relationship("WorkflowTransition", back_populates="workflow", order_by="WorkflowTransition.created_at") + +class WorkflowTransition(Base, UUIDMixin, TimestampMixin): + """Event sourcing table for workflow state changes.""" + __tablename__ = "workflow_transitions" + + workflow_id = Column(UUID(as_uuid=True), ForeignKey("workflow_instances.id"), nullable=False, index=True) + from_state = Column(String(100), nullable=False) + to_state = Column(String(100), nullable=False) + trigger = Column(String(100), nullable=False) # The transition name + triggered_by = Column(String(100), nullable=True) # agent_id, user_id, or "system" + metadata = Column(JSON, default=dict) # Additional context + + # Relationship + workflow = relationship("WorkflowInstance", back_populates="transitions") +``` + +#### Migration Example + +```python +# alembic/versions/xxx_add_workflow_tables.py +def upgrade(): + op.create_table( + 'workflow_instances', + sa.Column('id', postgresql.UUID(as_uuid=True), primary_key=True), + sa.Column('workflow_type', sa.Enum('sprint', 'story', 'pull_request', 'agent_task', name='workflowtype'), nullable=False), + sa.Column('current_state', sa.String(100), nullable=False), + sa.Column('entity_id', sa.String(100), nullable=False), + sa.Column('project_id', postgresql.UUID(as_uuid=True), sa.ForeignKey('projects.id'), nullable=False), + sa.Column('context', postgresql.JSON, server_default='{}'), + sa.Column('error', sa.String(1000), nullable=True), + sa.Column('retry_count', sa.Integer, server_default='0'), + sa.Column('created_at', sa.DateTime(timezone=True), nullable=False), + sa.Column('updated_at', sa.DateTime(timezone=True), nullable=False), + ) + op.create_index('ix_workflow_instances_type_state', 'workflow_instances', ['workflow_type', 'current_state']) + op.create_index('ix_workflow_instances_entity', 'workflow_instances', ['entity_id']) + + op.create_table( + 'workflow_transitions', + sa.Column('id', postgresql.UUID(as_uuid=True), primary_key=True), + sa.Column('workflow_id', postgresql.UUID(as_uuid=True), sa.ForeignKey('workflow_instances.id'), nullable=False), + sa.Column('from_state', sa.String(100), nullable=False), + sa.Column('to_state', sa.String(100), nullable=False), + sa.Column('trigger', sa.String(100), nullable=False), + sa.Column('triggered_by', sa.String(100), nullable=True), + sa.Column('metadata', postgresql.JSON, server_default='{}'), + sa.Column('created_at', sa.DateTime(timezone=True), nullable=False), + sa.Column('updated_at', sa.DateTime(timezone=True), nullable=False), + ) + op.create_index('ix_workflow_transitions_workflow', 'workflow_transitions', ['workflow_id']) +``` + +### 4. Syndarix Workflow State Machines + +#### Sprint Workflow + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” start β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” complete β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Planning β”‚ ─────────────► β”‚Developmentβ”‚ ──────────────► β”‚ Testing β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ + β”‚ β”‚ block β”‚ pass + β”‚ β–Ό β–Ό + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ β”‚ Blocked β”‚ β”‚ Demo β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ unblock β”‚ + β”‚ β–Ό β”‚ accept + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό + β”‚ β”‚Developmentβ”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚Retrospective β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ cancel β”‚ + β–Ό β”‚ complete +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό +β”‚ Cancelled β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Completed β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +```python +# app/workflows/sprint_workflow.py +from transitions import Machine +from typing import Optional +from app.models.workflow import WorkflowInstance, WorkflowTransition + +class SprintWorkflow: + states = [ + 'planning', + 'development', + 'blocked', + 'testing', + 'demo', + 'retrospective', + 'completed', + 'cancelled' + ] + + transitions = [ + # Normal flow + {'trigger': 'start', 'source': 'planning', 'dest': 'development'}, + {'trigger': 'complete_development', 'source': 'development', 'dest': 'testing'}, + {'trigger': 'tests_pass', 'source': 'testing', 'dest': 'demo'}, + {'trigger': 'demo_accepted', 'source': 'demo', 'dest': 'retrospective'}, + {'trigger': 'complete', 'source': 'retrospective', 'dest': 'completed'}, + + # Blocking + {'trigger': 'block', 'source': 'development', 'dest': 'blocked'}, + {'trigger': 'unblock', 'source': 'blocked', 'dest': 'development'}, + + # Test failures + {'trigger': 'tests_fail', 'source': 'testing', 'dest': 'development'}, + + # Demo rejection + {'trigger': 'demo_rejected', 'source': 'demo', 'dest': 'development'}, + + # Cancellation (from any active state) + {'trigger': 'cancel', 'source': ['planning', 'development', 'blocked', 'testing', 'demo'], 'dest': 'cancelled'}, + ] + + def __init__(self, sprint_id: str, project_id: str, initial_state: str = 'planning'): + self.sprint_id = sprint_id + self.project_id = project_id + self.machine = Machine( + model=self, + states=self.states, + transitions=self.transitions, + initial=initial_state, + auto_transitions=False, + send_event=True, # Pass EventData to callbacks + ) + + # Register callbacks for persistence + self.machine.on_enter_development(self._on_enter_development) + self.machine.on_enter_completed(self._on_enter_completed) + self.machine.on_enter_blocked(self._on_enter_blocked) + + def _on_enter_development(self, event): + """Trigger when entering development state.""" + # Could dispatch Celery task to notify agents + pass + + def _on_enter_completed(self, event): + """Trigger when sprint is completed.""" + # Generate sprint report, notify stakeholders + pass + + def _on_enter_blocked(self, event): + """Trigger when sprint is blocked.""" + # Alert human operator + pass +``` + +#### Story Workflow + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” ready β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” ready β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Analysis β”‚ ────────► β”‚ Design β”‚ ────────► β”‚ Implementation β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”‚ submit + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β” tests_pass β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” approve β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Done β”‚ ◄─────────── β”‚ Testing β”‚ ◄──────── β”‚ Review β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ + β”‚ tests_fail β”‚ request_changes + β–Ό β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Implementation β”‚ β”‚ Implementation β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +```python +# app/workflows/story_workflow.py +from transitions import Machine + +class StoryWorkflow: + states = [ + {'name': 'backlog', 'on_enter': '_notify_backlog'}, + {'name': 'analysis', 'on_enter': '_start_analysis_task'}, + {'name': 'design', 'on_enter': '_start_design_task'}, + {'name': 'implementation', 'on_enter': '_start_implementation_task'}, + {'name': 'review', 'on_enter': '_create_pr'}, + {'name': 'testing', 'on_enter': '_run_tests'}, + {'name': 'done', 'on_enter': '_notify_completion'}, + {'name': 'blocked', 'on_enter': '_escalate_block'}, + ] + + transitions = [ + # Happy path + {'trigger': 'start_analysis', 'source': 'backlog', 'dest': 'analysis'}, + {'trigger': 'analysis_complete', 'source': 'analysis', 'dest': 'design'}, + {'trigger': 'design_complete', 'source': 'design', 'dest': 'implementation'}, + {'trigger': 'submit_for_review', 'source': 'implementation', 'dest': 'review'}, + {'trigger': 'approve', 'source': 'review', 'dest': 'testing'}, + {'trigger': 'tests_pass', 'source': 'testing', 'dest': 'done'}, + + # Revision loops + {'trigger': 'request_changes', 'source': 'review', 'dest': 'implementation'}, + {'trigger': 'tests_fail', 'source': 'testing', 'dest': 'implementation'}, + + # Blocking (from any active state) + {'trigger': 'block', 'source': ['analysis', 'design', 'implementation', 'review', 'testing'], 'dest': 'blocked'}, + {'trigger': 'unblock', 'source': 'blocked', 'dest': 'implementation', 'before': '_restore_previous_state'}, + + # Skip to done (for trivial stories) + {'trigger': 'skip_to_done', 'source': '*', 'dest': 'done', 'conditions': '_is_trivial'}, + ] + + def __init__(self, story_id: str, project_id: str, initial_state: str = 'backlog'): + self.story_id = story_id + self.project_id = project_id + self.previous_state = None + + self.machine = Machine( + model=self, + states=self.states, + transitions=self.transitions, + initial=initial_state, + auto_transitions=False, + ) + + def _is_trivial(self): + """Condition: Check if story is marked as trivial.""" + return False # Would check story metadata + + def _start_analysis_task(self): + """Dispatch analysis to BA agent via Celery.""" + from app.tasks.agent_tasks import run_agent_action + run_agent_action.delay( + agent_type="business_analyst", + project_id=self.project_id, + action="analyze_story", + context={"story_id": self.story_id} + ) + + def _start_design_task(self): + """Dispatch design to Architect agent via Celery.""" + from app.tasks.agent_tasks import run_agent_action + run_agent_action.delay( + agent_type="architect", + project_id=self.project_id, + action="design_solution", + context={"story_id": self.story_id} + ) + + def _start_implementation_task(self): + """Dispatch implementation to Engineer agent via Celery.""" + from app.tasks.agent_tasks import run_agent_action + run_agent_action.delay( + agent_type="engineer", + project_id=self.project_id, + action="implement_story", + context={"story_id": self.story_id} + ) + + def _create_pr(self): + """Create pull request for review.""" + pass + + def _run_tests(self): + """Trigger test suite via Celery.""" + from app.tasks.cicd_tasks import run_test_suite + run_test_suite.delay( + project_id=self.project_id, + story_id=self.story_id + ) + + def _notify_completion(self): + """Notify stakeholders of story completion.""" + pass + + def _escalate_block(self): + """Escalate blocked story to human.""" + pass + + def _notify_backlog(self): + """Story added to backlog notification.""" + pass + + def _restore_previous_state(self): + """Restore state before block.""" + pass +``` + +#### PR Workflow + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” submit β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” approve β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Created β”‚ ────────► β”‚ Review β”‚ ─────────► β”‚ Approved β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ + β”‚ request_changes β”‚ merge + β–Ό β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Changes Requested β”‚ β”‚ Merged β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”‚ resubmit + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Review β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +```python +# app/workflows/pr_workflow.py +from transitions import Machine + +class PRWorkflow: + states = ['created', 'review', 'changes_requested', 'approved', 'merged', 'closed'] + + transitions = [ + {'trigger': 'submit_for_review', 'source': 'created', 'dest': 'review'}, + {'trigger': 'request_changes', 'source': 'review', 'dest': 'changes_requested'}, + {'trigger': 'resubmit', 'source': 'changes_requested', 'dest': 'review'}, + {'trigger': 'approve', 'source': 'review', 'dest': 'approved'}, + {'trigger': 'merge', 'source': 'approved', 'dest': 'merged'}, + {'trigger': 'close', 'source': ['created', 'review', 'changes_requested', 'approved'], 'dest': 'closed'}, + ] + + def __init__(self, pr_id: str, project_id: str, initial_state: str = 'created'): + self.pr_id = pr_id + self.project_id = project_id + self.machine = Machine( + model=self, + states=self.states, + transitions=self.transitions, + initial=initial_state, + ) +``` + +#### Agent Task Workflow + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” start β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” complete β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Assigned β”‚ ───────► β”‚ In Progress β”‚ ──────────► β”‚ Completed β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”‚ block / fail + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Blocked β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β”‚ retry / escalate + β–Ό + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” or β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ In Progress β”‚ β”‚ Escalated β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +```python +# app/workflows/agent_task_workflow.py +from transitions import Machine + +class AgentTaskWorkflow: + states = ['assigned', 'in_progress', 'blocked', 'completed', 'failed', 'escalated'] + + transitions = [ + {'trigger': 'start', 'source': 'assigned', 'dest': 'in_progress'}, + {'trigger': 'complete', 'source': 'in_progress', 'dest': 'completed'}, + {'trigger': 'block', 'source': 'in_progress', 'dest': 'blocked'}, + {'trigger': 'fail', 'source': 'in_progress', 'dest': 'failed'}, + {'trigger': 'retry', 'source': ['blocked', 'failed'], 'dest': 'in_progress', 'conditions': '_can_retry'}, + {'trigger': 'escalate', 'source': ['blocked', 'failed'], 'dest': 'escalated'}, + ] + + def __init__(self, task_id: str, agent_id: str, initial_state: str = 'assigned'): + self.task_id = task_id + self.agent_id = agent_id + self.retry_count = 0 + self.max_retries = 3 + + self.machine = Machine( + model=self, + states=self.states, + transitions=self.transitions, + initial=initial_state, + ) + + def _can_retry(self): + """Check if retry is allowed.""" + return self.retry_count < self.max_retries +``` + +### 5. Durable Workflow Engine + +```python +# app/services/workflow_engine.py +from typing import Type, Optional +from uuid import UUID +from sqlalchemy.ext.asyncio import AsyncSession +from transitions import Machine + +from app.models.workflow import WorkflowInstance, WorkflowTransition, WorkflowType +from app.workflows.sprint_workflow import SprintWorkflow +from app.workflows.story_workflow import StoryWorkflow +from app.workflows.pr_workflow import PRWorkflow +from app.workflows.agent_task_workflow import AgentTaskWorkflow +from app.services.events import EventBus + +WORKFLOW_CLASSES = { + WorkflowType.SPRINT: SprintWorkflow, + WorkflowType.STORY: StoryWorkflow, + WorkflowType.PULL_REQUEST: PRWorkflow, + WorkflowType.AGENT_TASK: AgentTaskWorkflow, +} + +class WorkflowEngine: + """ + Durable workflow engine that persists state to PostgreSQL. + + Usage: + engine = WorkflowEngine(session, event_bus) + + # Create new workflow + workflow = await engine.create( + workflow_type=WorkflowType.STORY, + entity_id="story-123", + project_id=project.id + ) + + # Load existing workflow + workflow = await engine.load(workflow_id) + + # Trigger transition + await engine.trigger(workflow_id, "start_analysis", triggered_by="agent-456") + """ + + def __init__(self, session: AsyncSession, event_bus: Optional[EventBus] = None): + self.session = session + self.event_bus = event_bus + + async def create( + self, + workflow_type: WorkflowType, + entity_id: str, + project_id: UUID, + initial_context: dict = None + ) -> WorkflowInstance: + """Create a new workflow instance.""" + + workflow_class = WORKFLOW_CLASSES[workflow_type] + initial_state = workflow_class.states[0] + if isinstance(initial_state, dict): + initial_state = initial_state['name'] + + instance = WorkflowInstance( + workflow_type=workflow_type, + current_state=initial_state, + entity_id=entity_id, + project_id=project_id, + context=initial_context or {} + ) + + self.session.add(instance) + await self.session.commit() + await self.session.refresh(instance) + + # Publish creation event + if self.event_bus: + await self.event_bus.publish(f"project:{project_id}", { + "type": "workflow_created", + "workflow_id": str(instance.id), + "workflow_type": workflow_type.value, + "entity_id": entity_id, + "state": initial_state + }) + + return instance + + async def load(self, workflow_id: UUID) -> Optional[WorkflowInstance]: + """Load a workflow instance from the database.""" + return await self.session.get(WorkflowInstance, workflow_id) + + async def get_machine(self, instance: WorkflowInstance) -> Machine: + """Reconstruct the state machine from persisted instance.""" + workflow_class = WORKFLOW_CLASSES[instance.workflow_type] + workflow = workflow_class( + entity_id=instance.entity_id, + project_id=str(instance.project_id), + initial_state=instance.current_state + ) + return workflow + + async def trigger( + self, + workflow_id: UUID, + trigger: str, + triggered_by: str = "system", + metadata: dict = None + ) -> bool: + """ + Trigger a state transition on a workflow. + + Returns True if transition succeeded, False if not valid. + """ + instance = await self.load(workflow_id) + if not instance: + raise ValueError(f"Workflow {workflow_id} not found") + + workflow = await self.get_machine(instance) + from_state = instance.current_state + + # Check if transition is valid + trigger_method = getattr(workflow, trigger, None) + if not trigger_method or not callable(trigger_method): + return False + + # Check if transition is allowed from current state + if not workflow.machine.may_trigger(trigger): + return False + + # Execute transition + try: + trigger_method() + to_state = workflow.state + except Exception as e: + # Transition failed + instance.error = str(e) + await self.session.commit() + return False + + # Persist new state + instance.current_state = to_state + instance.error = None + + # Record transition (event sourcing) + transition = WorkflowTransition( + workflow_id=instance.id, + from_state=from_state, + to_state=to_state, + trigger=trigger, + triggered_by=triggered_by, + metadata=metadata or {} + ) + self.session.add(transition) + await self.session.commit() + + # Publish state change event + if self.event_bus: + await self.event_bus.publish(f"project:{instance.project_id}", { + "type": "workflow_state_changed", + "workflow_id": str(instance.id), + "workflow_type": instance.workflow_type.value, + "entity_id": instance.entity_id, + "from_state": from_state, + "to_state": to_state, + "trigger": trigger, + "triggered_by": triggered_by + }) + + return True + + async def get_history(self, workflow_id: UUID) -> list[WorkflowTransition]: + """Get full transition history for a workflow.""" + instance = await self.load(workflow_id) + if not instance: + raise ValueError(f"Workflow {workflow_id} not found") + + await self.session.refresh(instance, ["transitions"]) + return instance.transitions + + async def get_active_by_type( + self, + project_id: UUID, + workflow_type: WorkflowType + ) -> list[WorkflowInstance]: + """Get all active workflows of a type for a project.""" + from sqlalchemy import select + + workflow_class = WORKFLOW_CLASSES[workflow_type] + terminal_states = ['completed', 'cancelled', 'merged', 'closed', 'done'] + + result = await self.session.execute( + select(WorkflowInstance) + .where(WorkflowInstance.project_id == project_id) + .where(WorkflowInstance.workflow_type == workflow_type) + .where(~WorkflowInstance.current_state.in_(terminal_states)) + ) + return result.scalars().all() +``` + +### 6. Retry and Compensation Patterns + +#### Retry Configuration + +```python +# app/workflows/retry_config.py +from dataclasses import dataclass +from typing import Callable, Optional + +@dataclass +class RetryPolicy: + """Configuration for retry behavior.""" + max_retries: int = 3 + initial_delay: float = 1.0 # seconds + max_delay: float = 300.0 # 5 minutes + exponential_base: float = 2.0 + jitter: bool = True + retryable_errors: tuple = (ConnectionError, TimeoutError) + +class RetryableWorkflow: + """Mixin for workflows with retry support.""" + + retry_policy: RetryPolicy = RetryPolicy() + + def calculate_retry_delay(self, attempt: int) -> float: + """Calculate delay for next retry attempt.""" + delay = self.retry_policy.initial_delay * (self.retry_policy.exponential_base ** attempt) + delay = min(delay, self.retry_policy.max_delay) + + if self.retry_policy.jitter: + import random + delay = delay * (0.5 + random.random()) + + return delay + + def should_retry(self, error: Exception, attempt: int) -> bool: + """Determine if error should trigger retry.""" + if attempt >= self.retry_policy.max_retries: + return False + return isinstance(error, self.retry_policy.retryable_errors) +``` + +#### Saga Pattern for Compensation + +```python +# app/workflows/saga.py +from dataclasses import dataclass +from typing import Callable, Any, Optional +from abc import ABC, abstractmethod + +@dataclass +class SagaStep: + """A single step in a saga with its compensation.""" + name: str + action: Callable[..., Any] + compensation: Callable[..., Any] + +class Saga: + """ + Implements the Saga pattern for distributed transactions. + + If a step fails, compensating actions are executed in reverse order. + """ + + def __init__(self, steps: list[SagaStep]): + self.steps = steps + self.completed_steps: list[SagaStep] = [] + self.context: dict = {} + + async def execute(self, initial_context: dict = None) -> dict: + """Execute the saga, with automatic compensation on failure.""" + self.context = initial_context or {} + + for step in self.steps: + try: + result = await step.action(self.context) + self.context.update(result or {}) + self.completed_steps.append(step) + except Exception as e: + # Compensate in reverse order + await self._compensate() + raise SagaFailure( + failed_step=step.name, + original_error=e, + compensation_results=self.context.get('_compensation_results', []) + ) + + return self.context + + async def _compensate(self): + """Execute compensation for all completed steps.""" + compensation_results = [] + + for step in reversed(self.completed_steps): + try: + await step.compensation(self.context) + compensation_results.append({ + 'step': step.name, + 'status': 'compensated' + }) + except Exception as e: + compensation_results.append({ + 'step': step.name, + 'status': 'failed', + 'error': str(e) + }) + + self.context['_compensation_results'] = compensation_results + +class SagaFailure(Exception): + """Raised when a saga fails and compensation is executed.""" + + def __init__(self, failed_step: str, original_error: Exception, compensation_results: list): + self.failed_step = failed_step + self.original_error = original_error + self.compensation_results = compensation_results + super().__init__(f"Saga failed at step '{failed_step}': {original_error}") + +# Example: Story implementation saga +async def create_story_implementation_saga(story_id: str, project_id: str) -> Saga: + """Create saga for implementing a story with compensation.""" + + steps = [ + SagaStep( + name="create_branch", + action=lambda ctx: create_feature_branch(ctx['story_id']), + compensation=lambda ctx: delete_branch(ctx.get('branch_name')) + ), + SagaStep( + name="implement_code", + action=lambda ctx: generate_code(ctx['story_id'], ctx['branch_name']), + compensation=lambda ctx: revert_commits(ctx.get('commit_shas', [])) + ), + SagaStep( + name="run_tests", + action=lambda ctx: run_test_suite(ctx['branch_name']), + compensation=lambda ctx: None # Tests don't need compensation + ), + SagaStep( + name="create_pr", + action=lambda ctx: create_pull_request(ctx['branch_name'], ctx['story_id']), + compensation=lambda ctx: close_pull_request(ctx.get('pr_id')) + ), + ] + + return Saga(steps) +``` + +### 7. Celery Integration + +```python +# app/tasks/workflow_tasks.py +from celery import Task +from app.core.celery import celery_app +from app.core.database import async_session_maker +from app.services.workflow_engine import WorkflowEngine +from app.services.events import EventBus +from app.models.workflow import WorkflowType + +class WorkflowTask(Task): + """Base task for workflow operations with database session.""" + + _session = None + _event_bus = None + + @property + def session(self): + if self._session is None: + self._session = async_session_maker() + return self._session + + @property + def event_bus(self): + if self._event_bus is None: + self._event_bus = EventBus() + return self._event_bus + + def after_return(self, status, retval, task_id, args, kwargs, einfo): + """Cleanup after task completion.""" + if self._session: + self._session.close() + +@celery_app.task(bind=True, base=WorkflowTask) +async def trigger_workflow_transition( + self, + workflow_id: str, + trigger: str, + triggered_by: str = "system", + metadata: dict = None +): + """ + Trigger a workflow transition as a background task. + + Used when transitions need to happen asynchronously. + """ + from uuid import UUID + + async with self.session as session: + engine = WorkflowEngine(session, self.event_bus) + success = await engine.trigger( + workflow_id=UUID(workflow_id), + trigger=trigger, + triggered_by=triggered_by, + metadata=metadata + ) + + if not success: + raise ValueError(f"Transition '{trigger}' failed for workflow {workflow_id}") + + return {"workflow_id": workflow_id, "new_trigger": trigger, "success": True} + +@celery_app.task(bind=True, base=WorkflowTask) +async def process_story_workflow_step( + self, + workflow_id: str, + step: str, + context: dict +): + """ + Process a single step in a story workflow. + + This task runs the actual work for each state. + """ + from uuid import UUID + + async with self.session as session: + engine = WorkflowEngine(session, self.event_bus) + instance = await engine.load(UUID(workflow_id)) + + if not instance: + raise ValueError(f"Workflow {workflow_id} not found") + + # Execute step-specific logic + if step == "analysis": + await run_analysis(instance.entity_id, context) + await engine.trigger(UUID(workflow_id), "analysis_complete", "agent:ba") + + elif step == "design": + await run_design(instance.entity_id, context) + await engine.trigger(UUID(workflow_id), "design_complete", "agent:architect") + + elif step == "implementation": + await run_implementation(instance.entity_id, context) + await engine.trigger(UUID(workflow_id), "submit_for_review", "agent:engineer") + + return {"workflow_id": workflow_id, "step_completed": step} + +@celery_app.task +def check_stalled_workflows(): + """ + Periodic task to check for stalled workflows. + + Runs via Celery Beat to identify workflows stuck in non-terminal states. + """ + from datetime import datetime, timedelta + from sqlalchemy import select + from app.models.workflow import WorkflowInstance + + # Consider workflows stalled if no transition in 1 hour + stale_threshold = datetime.utcnow() - timedelta(hours=1) + + # Query for potentially stalled workflows + # (Implementation would check updated_at and escalate) + pass +``` + +### 8. Visualization Approach + +#### Graphviz Diagram Generation + +```python +# app/services/workflow_visualizer.py +from transitions.extensions import GraphMachine +from typing import Optional +import io + +class WorkflowVisualizer: + """Generate visual diagrams of workflow state machines.""" + + @staticmethod + def generate_diagram( + workflow_class, + current_state: Optional[str] = None, + format: str = 'svg' + ) -> bytes: + """ + Generate a diagram for a workflow. + + Args: + workflow_class: The workflow class to visualize + current_state: Highlight current state (optional) + format: Output format ('svg', 'png', 'pdf') + + Returns: + Diagram as bytes + """ + class DiagramModel: + pass + + model = DiagramModel() + machine = GraphMachine( + model=model, + states=workflow_class.states, + transitions=workflow_class.transitions, + initial=workflow_class.states[0] if isinstance(workflow_class.states[0], str) else workflow_class.states[0]['name'], + show_conditions=True, + show_state_attributes=True, + title=workflow_class.__name__ + ) + + # Highlight current state + if current_state: + machine.model_graphs[id(model)].custom_styles['node'][current_state] = { + 'fillcolor': '#90EE90', + 'style': 'filled' + } + + # Generate graph + graph = machine.get_graph() + + # Render to bytes + output = io.BytesIO() + graph.draw(output, format=format, prog='dot') + return output.getvalue() + + @staticmethod + def get_mermaid_definition(workflow_class) -> str: + """ + Generate Mermaid.js state diagram definition. + + Useful for embedding in markdown documentation or web UIs. + """ + lines = ["stateDiagram-v2"] + + # Get initial state + initial = workflow_class.states[0] + if isinstance(initial, dict): + initial = initial['name'] + lines.append(f" [*] --> {initial}") + + # Add transitions + for t in workflow_class.transitions: + sources = t['source'] if isinstance(t['source'], list) else [t['source']] + for source in sources: + if source == '*': + continue + lines.append(f" {source} --> {t['dest']}: {t['trigger']}") + + # Mark terminal states + terminal_states = ['completed', 'cancelled', 'done', 'merged', 'closed'] + for state in workflow_class.states: + state_name = state if isinstance(state, str) else state['name'] + if state_name in terminal_states: + lines.append(f" {state_name} --> [*]") + + return "\n".join(lines) + +# API endpoint for diagram generation +# app/api/v1/workflows.py +from fastapi import APIRouter +from fastapi.responses import Response + +router = APIRouter() + +@router.get("/workflows/{workflow_type}/diagram") +async def get_workflow_diagram( + workflow_type: WorkflowType, + format: str = "svg", + current_state: Optional[str] = None +): + """Get visual diagram of a workflow state machine.""" + + workflow_class = WORKFLOW_CLASSES[workflow_type] + diagram_bytes = WorkflowVisualizer.generate_diagram( + workflow_class, + current_state=current_state, + format=format + ) + + media_types = { + 'svg': 'image/svg+xml', + 'png': 'image/png', + 'pdf': 'application/pdf' + } + + return Response( + content=diagram_bytes, + media_type=media_types.get(format, 'application/octet-stream') + ) + +@router.get("/workflows/{workflow_type}/mermaid") +async def get_workflow_mermaid(workflow_type: WorkflowType): + """Get Mermaid.js definition for a workflow.""" + + workflow_class = WORKFLOW_CLASSES[workflow_type] + mermaid_def = WorkflowVisualizer.get_mermaid_definition(workflow_class) + + return {"mermaid": mermaid_def} +``` + +#### Frontend Visualization Component + +```typescript +// frontend/components/WorkflowDiagram.tsx +import React from 'react'; +import mermaid from 'mermaid'; + +interface WorkflowDiagramProps { + workflowType: 'sprint' | 'story' | 'pull_request' | 'agent_task'; + currentState?: string; +} + +export function WorkflowDiagram({ workflowType, currentState }: WorkflowDiagramProps) { + const [diagram, setDiagram] = React.useState(''); + + React.useEffect(() => { + async function fetchDiagram() { + const response = await fetch(`/api/v1/workflows/${workflowType}/mermaid`); + const data = await response.json(); + + // Highlight current state + let mermaidDef = data.mermaid; + if (currentState) { + mermaidDef += `\n style ${currentState} fill:#90EE90`; + } + + const { svg } = await mermaid.render('workflow-diagram', mermaidDef); + setDiagram(svg); + } + + fetchDiagram(); + }, [workflowType, currentState]); + + return ( +
+ ); +} +``` + +### 9. Long-Running Workflow Patterns + +#### Handling Hours/Days Duration + +```python +# app/services/long_running_workflow.py +from datetime import datetime, timedelta +from typing import Optional +import asyncio + +class LongRunningWorkflowManager: + """ + Manager for workflows that span hours or days. + + Key patterns: + 1. Checkpoint persistence - Save progress frequently + 2. Heartbeat monitoring - Detect stalled workflows + 3. Resume capability - Continue from last checkpoint + 4. Timeout handling - Auto-escalate on SLA breach + """ + + def __init__(self, workflow_engine: WorkflowEngine): + self.engine = workflow_engine + self.sla_configs = { + WorkflowType.SPRINT: timedelta(weeks=2), # 2-week sprint + WorkflowType.STORY: timedelta(days=5), # 5 days for a story + WorkflowType.PULL_REQUEST: timedelta(hours=24), # 24h for PR review + WorkflowType.AGENT_TASK: timedelta(hours=1), # 1h for agent task + } + + async def check_sla_breaches(self): + """Check for workflows that have breached their SLA.""" + from sqlalchemy import select + from app.models.workflow import WorkflowInstance + + breached = [] + + for workflow_type, sla in self.sla_configs.items(): + threshold = datetime.utcnow() - sla + + # Find active workflows created before threshold + # with no recent transitions + result = await self.engine.session.execute( + select(WorkflowInstance) + .where(WorkflowInstance.workflow_type == workflow_type) + .where(WorkflowInstance.created_at < threshold) + .where(~WorkflowInstance.current_state.in_([ + 'completed', 'cancelled', 'done', 'merged', 'closed' + ])) + ) + + for instance in result.scalars(): + breached.append({ + 'workflow_id': instance.id, + 'type': workflow_type.value, + 'entity_id': instance.entity_id, + 'current_state': instance.current_state, + 'age': datetime.utcnow() - instance.created_at, + 'sla': sla + }) + + return breached + + async def create_checkpoint( + self, + workflow_id: UUID, + checkpoint_data: dict + ): + """ + Save a checkpoint for long-running workflow. + + Allows resumption from this point if workflow is interrupted. + """ + instance = await self.engine.load(workflow_id) + if instance: + instance.context = { + **instance.context, + '_checkpoint': { + 'data': checkpoint_data, + 'timestamp': datetime.utcnow().isoformat() + } + } + await self.engine.session.commit() + + async def resume_from_checkpoint(self, workflow_id: UUID) -> Optional[dict]: + """ + Resume a workflow from its last checkpoint. + + Returns checkpoint data if available. + """ + instance = await self.engine.load(workflow_id) + if instance and instance.context.get('_checkpoint'): + return instance.context['_checkpoint']['data'] + return None +``` + +#### Sprint Workflow with Checkpoints + +```python +# app/workflows/sprint_workflow_runner.py +from celery import chain, group +from app.tasks.workflow_tasks import trigger_workflow_transition + +class SprintWorkflowRunner: + """ + Orchestrates a full sprint lifecycle. + + A sprint runs for ~2 weeks, with daily standups and continuous work. + This runner manages the long-duration process with checkpoints. + """ + + def __init__(self, sprint_id: str, project_id: str, workflow_id: str): + self.sprint_id = sprint_id + self.project_id = project_id + self.workflow_id = workflow_id + + async def start_sprint(self, stories: list[str]): + """ + Start the sprint with initial stories. + + Creates story workflows for each story and begins development. + """ + # Transition sprint to development + await trigger_workflow_transition.delay( + self.workflow_id, + trigger="start", + triggered_by="system", + metadata={"stories": stories} + ) + + # Create story workflows for each story + story_tasks = [] + for story_id in stories: + story_tasks.append( + create_story_workflow.s(story_id, self.project_id) + ) + + # Execute story creations in parallel + group(story_tasks).apply_async() + + async def run_daily_standup(self): + """ + Daily standup checkpoint. + + Collects status from all active story workflows. + """ + # Get all active story workflows + active_stories = await self.get_active_stories() + + report = { + 'date': datetime.utcnow().isoformat(), + 'sprint_id': self.sprint_id, + 'stories': [] + } + + for story in active_stories: + report['stories'].append({ + 'story_id': story.entity_id, + 'state': story.current_state, + 'blocked': story.current_state == 'blocked' + }) + + # Save checkpoint + await self.save_checkpoint(report) + + return report + + async def complete_sprint(self): + """ + Complete the sprint and generate retrospective data. + """ + # Collect all transitions for analysis + history = await self.engine.get_history(UUID(self.workflow_id)) + + # Calculate metrics + metrics = { + 'total_transitions': len(history), + 'duration': (datetime.utcnow() - history[0].created_at).days, + 'blocks_encountered': sum(1 for t in history if t.to_state == 'blocked'), + } + + await trigger_workflow_transition.delay( + self.workflow_id, + trigger="complete", + triggered_by="system", + metadata={"metrics": metrics} + ) +``` + +--- + +## Dependencies + +Add to `pyproject.toml`: + +```toml +[project.dependencies] +transitions = "^0.9.0" +graphviz = "^0.20.1" # For diagram generation + +[project.optional-dependencies] +diagrams = [ + "pygraphviz>=1.11", # Alternative Graphviz binding +] +``` + +--- + +## Implementation Roadmap + +### Phase 1: Foundation (Week 1) +1. Add `transitions` library dependency +2. Create workflow database models and migrations +3. Implement basic `WorkflowEngine` class +4. Write unit tests for state machines + +### Phase 2: Core Workflows (Week 2) +1. Implement `StoryWorkflow` with all transitions +2. Implement `SprintWorkflow` with checkpoints +3. Implement `PRWorkflow` for code review +4. Integrate with Celery tasks + +### Phase 3: Durability (Week 3) +1. Add retry and compensation patterns +2. Implement SLA monitoring +3. Add checkpoint/resume capability +4. Integrate with EventBus for real-time updates + +### Phase 4: Visualization (Week 4) +1. Add diagram generation endpoints +2. Create frontend visualization component +3. Add workflow monitoring dashboard +4. Documentation and examples + +--- + +## Testing Strategy + +```python +# tests/workflows/test_story_workflow.py +import pytest +from app.workflows.story_workflow import StoryWorkflow + +class TestStoryWorkflow: + def test_happy_path(self): + """Test normal story progression.""" + workflow = StoryWorkflow("story-1", "project-1") + + assert workflow.state == "backlog" + + workflow.start_analysis() + assert workflow.state == "analysis" + + workflow.analysis_complete() + assert workflow.state == "design" + + workflow.design_complete() + assert workflow.state == "implementation" + + workflow.submit_for_review() + assert workflow.state == "review" + + workflow.approve() + assert workflow.state == "testing" + + workflow.tests_pass() + assert workflow.state == "done" + + def test_review_rejection(self): + """Test review rejection loop.""" + workflow = StoryWorkflow("story-1", "project-1", initial_state="review") + + workflow.request_changes() + assert workflow.state == "implementation" + + workflow.submit_for_review() + assert workflow.state == "review" + + def test_invalid_transition(self): + """Test that invalid transitions are rejected.""" + workflow = StoryWorkflow("story-1", "project-1") + + # Can't go from backlog to review + with pytest.raises(Exception): + workflow.approve() + + assert workflow.state == "backlog" # State unchanged + + def test_blocking(self): + """Test blocking from any active state.""" + for initial_state in ['analysis', 'design', 'implementation']: + workflow = StoryWorkflow("story-1", "project-1", initial_state=initial_state) + workflow.block() + assert workflow.state == "blocked" +``` + +--- + +## Risks and Mitigations + +| Risk | Impact | Likelihood | Mitigation | +|------|--------|------------|------------| +| State corruption on crash | High | Low | Event sourcing allows state reconstruction | +| Long-running task timeout | Medium | Medium | Celery soft limits + checkpointing | +| Race conditions on concurrent transitions | High | Medium | PostgreSQL row-level locking | +| Complex workflow debugging | Medium | High | Comprehensive logging + visualization | + +--- + +## Decision + +**Adopt `transitions` library + PostgreSQL persistence** for Syndarix workflow state machines. + +**Rationale:** +1. **Simplicity** - No additional infrastructure (vs Temporal) +2. **Flexibility** - Full control over persistence and task execution +3. **Integration** - Natural fit with existing Celery + Redis stack (SPIKE-004) +4. **Durability** - Event sourcing provides audit trail and recovery +5. **Visualization** - Built-in Graphviz support + Mermaid for web + +**Trade-offs Accepted:** +- More custom code vs using Temporal's built-in features +- Manual handling of distributed coordination +- Custom SLA monitoring implementation + +--- + +## References + +- [transitions Documentation](https://github.com/pytransitions/transitions) +- [Temporal Python SDK](https://github.com/temporalio/sdk-python) +- [Managing Long-Running Workflows with Temporal](https://temporal.io/blog/very-long-running-workflows) +- [Saga Pattern](https://microservices.io/patterns/data/saga.html) +- [Event Sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) +- [Celery Dyrygent](https://github.com/ovh/celery-dyrygent) - Complex workflow orchestration +- [Selinon](https://github.com/selinon/selinon) - Advanced flow management on Celery +- [Toptal Celery Orchestration Guide](https://www.toptal.com/python/orchestrating-celery-python-background-jobs) + +--- + +*Spike completed. Findings will inform ADR-008: Workflow State Machine Architecture.* diff --git a/docs/spikes/SPIKE-009-issue-synchronization.md b/docs/spikes/SPIKE-009-issue-synchronization.md new file mode 100644 index 0000000..d3df824 --- /dev/null +++ b/docs/spikes/SPIKE-009-issue-synchronization.md @@ -0,0 +1,1494 @@ +# SPIKE-009: Issue Synchronization with External Trackers + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #9 + +--- + +## Executive Summary + +This spike researches bi-directional issue synchronization between Syndarix and external issue trackers (Gitea, GitHub, GitLab). After analyzing sync patterns, conflict resolution strategies, and API capabilities of each platform, we recommend: + +**Primary Recommendation:** Implement a **webhook-first, polling-fallback** architecture with **Last-Writer-Wins (LWW)** conflict resolution using vector clocks for causality tracking. External trackers serve as the **source of truth** with Syndarix maintaining local mirrors for unified agent access. + +**Key Decisions:** +1. Use webhooks for real-time sync; polling for reconciliation and initial import +2. Implement version vectors for conflict detection with LWW resolution +3. Store sync metadata in dedicated `issue_sync_log` table for audit and recovery +4. Abstract provider differences behind a unified `IssueProvider` interface +5. Use Redis for webhook event queuing and deduplication + +--- + +## Table of Contents + +1. [Research Questions & Answers](#research-questions--answers) +2. [Sync Architecture](#sync-architecture) +3. [Conflict Resolution Strategy](#conflict-resolution-strategy) +4. [Webhook Handling Design](#webhook-handling-design) +5. [Provider API Comparison](#provider-api-comparison) +6. [Database Schema](#database-schema) +7. [Field Mapping Specification](#field-mapping-specification) +8. [Code Examples](#code-examples) +9. [Error Handling & Recovery](#error-handling--recovery) +10. [Implementation Roadmap](#implementation-roadmap) +11. [References](#references) + +--- + +## Research Questions & Answers + +### 1. Best patterns for bi-directional data sync? + +**Answer:** The **Hub-and-Spoke** pattern with Syndarix as the hub is recommended. This pattern: +- Designates external trackers as authoritative sources +- Maintains local mirrors for unified access +- Synchronizes changes bidirectionally through a central sync engine +- Uses explicit ownership rules per field to prevent conflicts + +Modern implementations leverage **Conflict-Free Replicated Data Types (CRDTs)** for automatic conflict resolution, but for issue tracking where human review may be desired, **version vectors with LWW** provides better control. + +### 2. Handling sync conflicts (edited in both places)? + +**Answer:** Implement a **tiered conflict resolution strategy**: + +| Scenario | Resolution | +|----------|------------| +| Same field, different times | Last-Writer-Wins (LWW) | +| Same field, concurrent edits | Mark as conflict, notify user | +| Different fields | Merge both changes | +| Delete vs Update | Delete wins (configurable) | + +Use **version vectors** to detect concurrent modifications. Each system maintains a version counter, and conflicts are identified when neither version dominates the other. + +### 3. Webhook vs polling strategies? + +**Answer:** **Hybrid approach** - webhooks primary, polling secondary. + +| Strategy | Use Case | Latency | Resource Cost | +|----------|----------|---------|---------------| +| Webhooks | Real-time sync | <1s | Low | +| Polling | Initial import, reconciliation, fallback | Minutes | Medium | +| On-demand | User-triggered refresh | Immediate | Minimal | + +**Rationale:** Webhooks provide real-time updates but may miss events during outages. Periodic polling (every 15-30 minutes) ensures eventual consistency. + +### 4. Rate limiting and API quota management? + +**Answer:** Implement a **token bucket with adaptive throttling**: + +| Provider | Auth Rate Limit | Unauthenticated | +|----------|-----------------|-----------------| +| GitHub | 5,000/hour | 60/hour | +| GitLab | 600/minute | 10/minute | +| Gitea | Configurable (default: 50 items/response) | N/A | + +**Strategies:** +- Use conditional requests (`If-None-Match`, `If-Modified-Since`) to avoid counting unchanged responses +- Implement exponential backoff on 429/403 responses +- Cache responses with ETags +- Batch operations where possible +- Monitor `X-RateLimit-Remaining` headers + +### 5. Eventual consistency vs strong consistency tradeoffs? + +**Answer:** **Eventual consistency** is acceptable and recommended for issue sync. + +| Consistency | Pros | Cons | +|-------------|------|------| +| **Strong** | Always accurate | Higher latency, complex implementation | +| **Eventual** | Better performance, simpler | Temporary inconsistency | + +**Rationale:** Issue tracking tolerates brief inconsistency windows (seconds to minutes). Users can manually refresh if needed. The simplicity and performance gains outweigh the drawbacks. + +### 6. How to map different field schemas? + +**Answer:** Use a **canonical field model** with provider-specific adapters. + +``` +External Field β†’ Provider Adapter β†’ Canonical Model β†’ Local Storage +Local Field β†’ Canonical Model β†’ Provider Adapter β†’ External Field +``` + +See [Field Mapping Specification](#field-mapping-specification) for detailed mappings. + +### 7. Handling offline/disconnected scenarios? + +**Answer:** Implement an **outbox pattern** with retry queue: + +1. Queue all outgoing changes in local `sync_outbox` table +2. Background worker processes queue with exponential backoff +3. Mark items as `pending`, `in_progress`, `completed`, or `failed` +4. Dead letter queue for items exceeding max retries +5. Manual reconciliation UI for failed items + +--- + +## Sync Architecture + +### High-Level Architecture Diagram + +``` + External Issue Trackers + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Gitea β”‚ GitHub β”‚ GitLab β”‚ + β”‚ (Primary) β”‚ β”‚ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ Webhooks β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Syndarix Backend β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Webhook Handler │◀──────────│ Redis Queue │──────────▢│ Polling Worker β”‚ β”‚ +β”‚ β”‚ (FastAPI Route) β”‚ β”‚ (Event Dedup/Buffer) β”‚ β”‚ (Celery Beat) β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Sync Engine β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Provider Factory β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚GiteaProviderβ”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚GitHubProviderβ”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚GitLabProviderβ”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚Conflict Resolver β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Field Mapper β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ PostgreSQL β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ issues β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ issue_sync_log β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ sync_outbox β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ external_links β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Component Responsibilities + +| Component | Responsibility | +|-----------|----------------| +| **Webhook Handler** | Receive, validate, and queue incoming webhooks | +| **Redis Queue** | Buffer events, deduplicate, handle backpressure | +| **Polling Worker** | Periodic reconciliation, initial import, fallback | +| **Sync Engine** | Orchestrate sync operations, apply business logic | +| **Provider Factory** | Create provider-specific clients | +| **Conflict Resolver** | Detect and resolve sync conflicts | +| **Field Mapper** | Transform between canonical and provider schemas | + +### Data Flow + +**Inbound (External to Syndarix):** +``` +1. Webhook received β†’ Validate signature β†’ Queue in Redis +2. Worker dequeues β†’ Parse payload β†’ Transform to canonical model +3. Check for conflicts β†’ Apply resolution strategy +4. Update local database β†’ Log sync operation +5. Notify connected clients via SSE (per SPIKE-003) +``` + +**Outbound (Syndarix to External):** +``` +1. Local change detected β†’ Transform to provider format +2. Queue in sync_outbox β†’ Worker picks up +3. Call external API β†’ Handle response +4. Update sync metadata β†’ Mark as synced +5. Handle failures β†’ Retry with backoff +``` + +--- + +## Conflict Resolution Strategy + +### Version Vector Implementation + +Each issue maintains a version vector tracking modifications across systems: + +```python +# Version vector structure +{ + "syndarix": 5, # Local modification count + "gitea": 3, # Gitea modification count + "github": 0, # Not synced with GitHub + "gitlab": 2 # GitLab modification count +} +``` + +### Conflict Detection Algorithm + +```python +def detect_conflict(local_version: dict, remote_version: dict) -> str: + """ + Compare version vectors to detect conflicts. + + Returns: + - "local_wins": Local version dominates + - "remote_wins": Remote version dominates + - "conflict": Concurrent modification (neither dominates) + - "equal": No changes + """ + local_dominates = all( + local_version.get(k, 0) >= v + for k, v in remote_version.items() + ) + remote_dominates = all( + remote_version.get(k, 0) >= v + for k, v in local_version.items() + ) + + if local_version == remote_version: + return "equal" + elif local_dominates and not remote_dominates: + return "local_wins" + elif remote_dominates and not local_dominates: + return "remote_wins" + else: + return "conflict" +``` + +### Resolution Strategies + +```python +from enum import Enum + +class ConflictStrategy(str, Enum): + REMOTE_WINS = "remote_wins" # External tracker is source of truth + LOCAL_WINS = "local_wins" # Syndarix changes take precedence + LAST_WRITE_WINS = "lww" # Most recent timestamp wins + MANUAL = "manual" # Flag for human review + MERGE = "merge" # Attempt field-level merge + +# Default strategy per field +FIELD_STRATEGIES = { + "title": ConflictStrategy.LAST_WRITE_WINS, + "description": ConflictStrategy.MERGE, + "status": ConflictStrategy.REMOTE_WINS, + "assignees": ConflictStrategy.MERGE, + "labels": ConflictStrategy.MERGE, + "comments": ConflictStrategy.MERGE, # Append both + "priority": ConflictStrategy.REMOTE_WINS, +} +``` + +### Merge Algorithm for Complex Fields + +```python +def merge_labels(local: list[str], remote: list[str], base: list[str]) -> list[str]: + """Three-way merge for labels.""" + local_added = set(local) - set(base) + local_removed = set(base) - set(local) + remote_added = set(remote) - set(base) + remote_removed = set(base) - set(remote) + + result = set(base) + result |= local_added | remote_added + result -= local_removed | remote_removed + + return sorted(result) +``` + +--- + +## Webhook Handling Design + +### Webhook Endpoint Architecture + +```python +# app/api/v1/webhooks/issues.py +from fastapi import APIRouter, Request, HTTPException, BackgroundTasks +from app.services.sync.webhook_handler import WebhookHandler +from app.core.redis import redis_client +import hashlib +import hmac + +router = APIRouter() + +@router.post("/webhooks/{provider}/{project_id}") +async def receive_webhook( + provider: str, + project_id: str, + request: Request, + background_tasks: BackgroundTasks +): + """ + Unified webhook endpoint for all providers. + + Path: /api/v1/webhooks/{provider}/{project_id} + Providers: gitea, github, gitlab + """ + body = await request.body() + headers = dict(request.headers) + + # Validate webhook signature + handler = WebhookHandler.get_handler(provider) + if not handler.verify_signature(body, headers): + raise HTTPException(status_code=401, detail="Invalid signature") + + # Parse event type + event_type = handler.get_event_type(headers) + if event_type not in handler.supported_events: + return {"status": "ignored", "reason": "unsupported_event"} + + # Deduplicate using event ID + event_id = handler.get_event_id(headers, body) + if await is_duplicate(event_id): + return {"status": "duplicate"} + + # Queue for processing + await redis_client.xadd( + f"webhooks:{project_id}", + { + "provider": provider, + "event_type": event_type, + "payload": body, + "received_at": datetime.utcnow().isoformat() + } + ) + + return {"status": "queued", "event_id": event_id} + + +async def is_duplicate(event_id: str, ttl: int = 3600) -> bool: + """Check if event was already processed (Redis-based dedup).""" + key = f"webhook:processed:{event_id}" + result = await redis_client.set(key, "1", ex=ttl, nx=True) + return result is None # None means key existed +``` + +### Provider-Specific Signature Validation + +```python +# app/services/sync/webhook_validators.py + +class GiteaWebhookValidator: + @staticmethod + def verify_signature(body: bytes, headers: dict, secret: str) -> bool: + """Gitea uses X-Gitea-Signature with HMAC-SHA256.""" + signature = headers.get("x-gitea-signature", "") + expected = hmac.new( + secret.encode(), + body, + hashlib.sha256 + ).hexdigest() + return hmac.compare_digest(signature, expected) + + @staticmethod + def get_event_type(headers: dict) -> str: + return headers.get("x-gitea-event", "") + + +class GitHubWebhookValidator: + @staticmethod + def verify_signature(body: bytes, headers: dict, secret: str) -> bool: + """GitHub uses X-Hub-Signature-256 with sha256=HMAC.""" + signature = headers.get("x-hub-signature-256", "") + if not signature.startswith("sha256="): + return False + expected = "sha256=" + hmac.new( + secret.encode(), + body, + hashlib.sha256 + ).hexdigest() + return hmac.compare_digest(signature, expected) + + @staticmethod + def get_event_type(headers: dict) -> str: + return headers.get("x-github-event", "") + + +class GitLabWebhookValidator: + @staticmethod + def verify_signature(body: bytes, headers: dict, secret: str) -> bool: + """GitLab uses X-Gitlab-Token for simple token matching.""" + token = headers.get("x-gitlab-token", "") + return hmac.compare_digest(token, secret) + + @staticmethod + def get_event_type(headers: dict) -> str: + return headers.get("x-gitlab-event", "") +``` + +### Webhook Event Processing (Celery Worker) + +```python +# app/workers/sync_worker.py +from celery import Celery +from app.services.sync.sync_engine import SyncEngine +from app.core.redis import redis_client + +celery_app = Celery("syndarix") + +@celery_app.task(bind=True, max_retries=3) +def process_webhook_event(self, project_id: str): + """Process queued webhook events for a project.""" + try: + # Read from Redis stream + events = redis_client.xread( + {f"webhooks:{project_id}": "0"}, + count=10, + block=5000 + ) + + sync_engine = SyncEngine() + + for stream_name, messages in events: + for message_id, data in messages: + try: + sync_engine.process_inbound_event( + provider=data["provider"], + event_type=data["event_type"], + payload=data["payload"] + ) + # Acknowledge processed + redis_client.xdel(stream_name, message_id) + except Exception as e: + log.error(f"Failed to process {message_id}: {e}") + # Will retry on next run + + except Exception as exc: + self.retry(exc=exc, countdown=60 * (2 ** self.request.retries)) +``` + +--- + +## Provider API Comparison + +### Issue Field Support Matrix + +| Field | Gitea | GitHub | GitLab | Syndarix Canonical | +|-------|-------|--------|--------|-------------------| +| ID | `id` (int) | `id` (int) | `id` (int) | `external_id` (str) | +| Number | `number` | `number` | `iid` | `external_number` | +| Title | `title` | `title` | `title` | `title` | +| Body | `body` | `body` | `description` | `description` | +| State | `state` (open/closed) | `state` (open/closed) | `state` (opened/closed) | `status` (enum) | +| Assignees | `assignees[]` | `assignees[]` | `assignees[]` | `assignee_ids[]` | +| Labels | `labels[].name` | `labels[].name` | `labels[]` (strings) | `labels[]` | +| Milestone | `milestone.title` | `milestone.title` | `milestone.title` | `milestone` | +| Created | `created_at` | `created_at` | `created_at` | `created_at` | +| Updated | `updated_at` | `updated_at` | `updated_at` | `updated_at` | +| Due Date | `due_date` | N/A | `due_date` | `due_date` | +| Priority | N/A (via labels) | N/A (via labels) | N/A (via labels) | `priority` | +| URL | `html_url` | `html_url` | `web_url` | `remote_url` | + +### Webhook Event Mapping + +| Action | Gitea Event | GitHub Event | GitLab Event | +|--------|-------------|--------------|--------------| +| Create | `issues:opened` | `issues:opened` | `Issue Hook:open` | +| Update | `issues:edited` | `issues:edited` | `Issue Hook:update` | +| Close | `issues:closed` | `issues:closed` | `Issue Hook:close` | +| Reopen | `issues:reopened` | `issues:reopened` | `Issue Hook:reopen` | +| Assign | `issues:assigned` | `issues:assigned` | `Issue Hook:update` | +| Label | `issues:label_updated` | `issues:labeled` | `Issue Hook:update` | +| Comment | `issue_comment:created` | `issue_comment:created` | `Note Hook` | + +### Rate Limits Comparison + +| Provider | Authenticated | Pagination | Conditional Requests | +|----------|---------------|------------|---------------------| +| GitHub | 5,000/hour | 100/page max | ETag, If-Modified-Since | +| GitLab | 600/minute | 100/page max | ETag | +| Gitea | Configurable | 50/page default | Link header | + +--- + +## Database Schema + +### Core Tables + +```sql +-- Extended issues table with sync metadata +ALTER TABLE issues ADD COLUMN IF NOT EXISTS external_id VARCHAR(255); +ALTER TABLE issues ADD COLUMN IF NOT EXISTS external_number INTEGER; +ALTER TABLE issues ADD COLUMN IF NOT EXISTS remote_url TEXT; +ALTER TABLE issues ADD COLUMN IF NOT EXISTS provider VARCHAR(50); +ALTER TABLE issues ADD COLUMN IF NOT EXISTS provider_repo_id VARCHAR(255); +ALTER TABLE issues ADD COLUMN IF NOT EXISTS sync_status VARCHAR(20) DEFAULT 'pending'; +ALTER TABLE issues ADD COLUMN IF NOT EXISTS version_vector JSONB DEFAULT '{}'; +ALTER TABLE issues ADD COLUMN IF NOT EXISTS last_synced_at TIMESTAMP WITH TIME ZONE; +ALTER TABLE issues ADD COLUMN IF NOT EXISTS external_updated_at TIMESTAMP WITH TIME ZONE; + +-- Sync status enum values: synced, pending, conflict, error + +CREATE INDEX idx_issues_external_id ON issues(provider, external_id); +CREATE INDEX idx_issues_sync_status ON issues(sync_status); +``` + +```sql +-- Issue sync log for audit trail +CREATE TABLE issue_sync_log ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + issue_id UUID REFERENCES issues(id) ON DELETE CASCADE, + project_id UUID REFERENCES projects(id) ON DELETE CASCADE, + + -- Sync details + direction VARCHAR(10) NOT NULL, -- 'inbound' or 'outbound' + provider VARCHAR(50) NOT NULL, + event_type VARCHAR(100) NOT NULL, + + -- Change tracking + previous_state JSONB, + new_state JSONB, + diff JSONB, + + -- Conflict info + had_conflict BOOLEAN DEFAULT FALSE, + conflict_resolution VARCHAR(50), + conflict_details JSONB, + + -- Status + status VARCHAR(20) NOT NULL, -- success, failed, skipped + error_message TEXT, + + -- Timestamps + created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), + processed_at TIMESTAMP WITH TIME ZONE, + + -- Webhook metadata + webhook_event_id VARCHAR(255), + webhook_delivery_id VARCHAR(255) +); + +CREATE INDEX idx_sync_log_issue ON issue_sync_log(issue_id); +CREATE INDEX idx_sync_log_project ON issue_sync_log(project_id); +CREATE INDEX idx_sync_log_status ON issue_sync_log(status); +CREATE INDEX idx_sync_log_created ON issue_sync_log(created_at); +``` + +```sql +-- Outbox for pending outbound syncs +CREATE TABLE sync_outbox ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + issue_id UUID REFERENCES issues(id) ON DELETE CASCADE, + project_id UUID REFERENCES projects(id) ON DELETE CASCADE, + + -- Sync target + provider VARCHAR(50) NOT NULL, + operation VARCHAR(20) NOT NULL, -- create, update, delete + payload JSONB NOT NULL, + + -- Processing status + status VARCHAR(20) DEFAULT 'pending', -- pending, in_progress, completed, failed, dead_letter + attempts INTEGER DEFAULT 0, + max_attempts INTEGER DEFAULT 5, + last_error TEXT, + + -- Scheduling + created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), + scheduled_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), + processed_at TIMESTAMP WITH TIME ZONE, + + -- Idempotency + idempotency_key VARCHAR(255) UNIQUE +); + +CREATE INDEX idx_outbox_status ON sync_outbox(status, scheduled_at); +CREATE INDEX idx_outbox_issue ON sync_outbox(issue_id); +``` + +```sql +-- External provider connections +CREATE TABLE external_connections ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + project_id UUID REFERENCES projects(id) ON DELETE CASCADE, + organization_id UUID REFERENCES organizations(id) ON DELETE CASCADE, + + -- Provider details + provider VARCHAR(50) NOT NULL, + provider_url TEXT NOT NULL, -- Base URL (e.g., https://gitea.example.com) + repo_owner VARCHAR(255) NOT NULL, + repo_name VARCHAR(255) NOT NULL, + + -- Authentication + auth_type VARCHAR(20) NOT NULL, -- token, oauth, app + credentials_encrypted TEXT, -- Encrypted token/credentials + + -- Webhook configuration + webhook_secret_encrypted TEXT, + webhook_id VARCHAR(255), -- ID from provider + webhook_active BOOLEAN DEFAULT TRUE, + + -- Sync settings + sync_enabled BOOLEAN DEFAULT TRUE, + sync_direction VARCHAR(20) DEFAULT 'bidirectional', -- inbound, outbound, bidirectional + sync_labels_filter JSONB, -- Only sync issues with these labels + sync_milestones_filter JSONB, + + -- Status tracking + last_sync_at TIMESTAMP WITH TIME ZONE, + last_error TEXT, + status VARCHAR(20) DEFAULT 'active', -- active, paused, error + + created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), + updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), + + UNIQUE(project_id, provider, repo_owner, repo_name) +); + +CREATE INDEX idx_connections_project ON external_connections(project_id); +CREATE INDEX idx_connections_status ON external_connections(status); +``` + +### SQLAlchemy Models + +```python +# app/models/issue_sync.py +from sqlalchemy import Column, String, Integer, Boolean, ForeignKey, Enum, JSON +from sqlalchemy.dialects.postgresql import UUID, JSONB +from sqlalchemy.orm import relationship +from app.models.base import Base, TimestampMixin, UUIDMixin +import enum + +class SyncStatus(str, enum.Enum): + SYNCED = "synced" + PENDING = "pending" + CONFLICT = "conflict" + ERROR = "error" + +class SyncDirection(str, enum.Enum): + INBOUND = "inbound" + OUTBOUND = "outbound" + BIDIRECTIONAL = "bidirectional" + +class Provider(str, enum.Enum): + GITEA = "gitea" + GITHUB = "github" + GITLAB = "gitlab" + + +class ExternalConnection(Base, UUIDMixin, TimestampMixin): + __tablename__ = "external_connections" + + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=False) + organization_id = Column(UUID(as_uuid=True), ForeignKey("organizations.id")) + + provider = Column(String(50), nullable=False) + provider_url = Column(String, nullable=False) + repo_owner = Column(String(255), nullable=False) + repo_name = Column(String(255), nullable=False) + + auth_type = Column(String(20), nullable=False) + credentials_encrypted = Column(String) + + webhook_secret_encrypted = Column(String) + webhook_id = Column(String(255)) + webhook_active = Column(Boolean, default=True) + + sync_enabled = Column(Boolean, default=True) + sync_direction = Column(String(20), default="bidirectional") + sync_labels_filter = Column(JSONB) + sync_milestones_filter = Column(JSONB) + + last_sync_at = Column(DateTime(timezone=True)) + last_error = Column(String) + status = Column(String(20), default="active") + + +class IssueSyncLog(Base, UUIDMixin): + __tablename__ = "issue_sync_log" + + issue_id = Column(UUID(as_uuid=True), ForeignKey("issues.id", ondelete="CASCADE")) + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE")) + + direction = Column(String(10), nullable=False) + provider = Column(String(50), nullable=False) + event_type = Column(String(100), nullable=False) + + previous_state = Column(JSONB) + new_state = Column(JSONB) + diff = Column(JSONB) + + had_conflict = Column(Boolean, default=False) + conflict_resolution = Column(String(50)) + conflict_details = Column(JSONB) + + status = Column(String(20), nullable=False) + error_message = Column(String) + + created_at = Column(DateTime(timezone=True), server_default=func.now()) + processed_at = Column(DateTime(timezone=True)) + + webhook_event_id = Column(String(255)) + webhook_delivery_id = Column(String(255)) + + +class SyncOutbox(Base, UUIDMixin): + __tablename__ = "sync_outbox" + + issue_id = Column(UUID(as_uuid=True), ForeignKey("issues.id", ondelete="CASCADE")) + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id", ondelete="CASCADE")) + + provider = Column(String(50), nullable=False) + operation = Column(String(20), nullable=False) + payload = Column(JSONB, nullable=False) + + status = Column(String(20), default="pending") + attempts = Column(Integer, default=0) + max_attempts = Column(Integer, default=5) + last_error = Column(String) + + created_at = Column(DateTime(timezone=True), server_default=func.now()) + scheduled_at = Column(DateTime(timezone=True), server_default=func.now()) + processed_at = Column(DateTime(timezone=True)) + + idempotency_key = Column(String(255), unique=True) +``` + +--- + +## Field Mapping Specification + +### Canonical Issue Model + +```python +# app/schemas/sync/canonical.py +from pydantic import BaseModel, Field +from datetime import datetime +from typing import Optional +from enum import Enum + +class IssueStatus(str, Enum): + OPEN = "open" + CLOSED = "closed" + IN_PROGRESS = "in_progress" # Syndarix-only + +class CanonicalIssue(BaseModel): + """Canonical issue representation for sync operations.""" + + # Identity + external_id: str + external_number: int + remote_url: str + + # Core fields + title: str + description: Optional[str] = None + status: IssueStatus + + # Relationships + assignee_ids: list[str] = Field(default_factory=list) + assignee_usernames: list[str] = Field(default_factory=list) + labels: list[str] = Field(default_factory=list) + milestone: Optional[str] = None + + # Metadata + created_at: datetime + updated_at: datetime + closed_at: Optional[datetime] = None + due_date: Optional[datetime] = None + + # Provider info + provider: str + raw_data: dict = Field(default_factory=dict) # Original payload +``` + +### Provider Adapters + +```python +# app/services/sync/adapters/gitea.py +from app.schemas.sync.canonical import CanonicalIssue, IssueStatus + +class GiteaAdapter: + """Transform between Gitea API format and canonical model.""" + + @staticmethod + def to_canonical(gitea_issue: dict, base_url: str) -> CanonicalIssue: + return CanonicalIssue( + external_id=str(gitea_issue["id"]), + external_number=gitea_issue["number"], + remote_url=gitea_issue["html_url"], + title=gitea_issue["title"], + description=gitea_issue.get("body"), + status=IssueStatus.OPEN if gitea_issue["state"] == "open" else IssueStatus.CLOSED, + assignee_ids=[str(a["id"]) for a in gitea_issue.get("assignees", [])], + assignee_usernames=[a["login"] for a in gitea_issue.get("assignees", [])], + labels=[label["name"] for label in gitea_issue.get("labels", [])], + milestone=gitea_issue.get("milestone", {}).get("title"), + created_at=gitea_issue["created_at"], + updated_at=gitea_issue["updated_at"], + closed_at=gitea_issue.get("closed_at"), + due_date=gitea_issue.get("due_date"), + provider="gitea", + raw_data=gitea_issue + ) + + @staticmethod + def from_canonical(issue: CanonicalIssue) -> dict: + """Convert canonical to Gitea API format for updates.""" + return { + "title": issue.title, + "body": issue.description, + "state": "open" if issue.status == IssueStatus.OPEN else "closed", + "assignees": issue.assignee_usernames, + "labels": issue.labels, + "milestone": issue.milestone, + "due_date": issue.due_date.isoformat() if issue.due_date else None, + } + + +# app/services/sync/adapters/github.py +class GitHubAdapter: + """Transform between GitHub API format and canonical model.""" + + @staticmethod + def to_canonical(github_issue: dict, base_url: str) -> CanonicalIssue: + return CanonicalIssue( + external_id=str(github_issue["id"]), + external_number=github_issue["number"], + remote_url=github_issue["html_url"], + title=github_issue["title"], + description=github_issue.get("body"), + status=IssueStatus.OPEN if github_issue["state"] == "open" else IssueStatus.CLOSED, + assignee_ids=[str(a["id"]) for a in github_issue.get("assignees", [])], + assignee_usernames=[a["login"] for a in github_issue.get("assignees", [])], + labels=[label["name"] for label in github_issue.get("labels", [])], + milestone=github_issue.get("milestone", {}).get("title") if github_issue.get("milestone") else None, + created_at=github_issue["created_at"], + updated_at=github_issue["updated_at"], + closed_at=github_issue.get("closed_at"), + due_date=None, # GitHub doesn't have native due dates + provider="github", + raw_data=github_issue + ) + + @staticmethod + def from_canonical(issue: CanonicalIssue) -> dict: + return { + "title": issue.title, + "body": issue.description, + "state": "open" if issue.status == IssueStatus.OPEN else "closed", + "assignees": issue.assignee_usernames, + "labels": issue.labels, + "milestone": issue.milestone, + } + + +# app/services/sync/adapters/gitlab.py +class GitLabAdapter: + """Transform between GitLab API format and canonical model.""" + + @staticmethod + def to_canonical(gitlab_issue: dict, base_url: str) -> CanonicalIssue: + # GitLab uses 'opened' instead of 'open' + state = gitlab_issue["state"] + status = IssueStatus.OPEN if state == "opened" else IssueStatus.CLOSED + + return CanonicalIssue( + external_id=str(gitlab_issue["id"]), + external_number=gitlab_issue["iid"], # GitLab uses iid for project-scoped number + remote_url=gitlab_issue["web_url"], + title=gitlab_issue["title"], + description=gitlab_issue.get("description"), + status=status, + assignee_ids=[str(a["id"]) for a in gitlab_issue.get("assignees", [])], + assignee_usernames=[a["username"] for a in gitlab_issue.get("assignees", [])], + labels=gitlab_issue.get("labels", []), # GitLab returns labels as strings + milestone=gitlab_issue.get("milestone", {}).get("title") if gitlab_issue.get("milestone") else None, + created_at=gitlab_issue["created_at"], + updated_at=gitlab_issue["updated_at"], + closed_at=gitlab_issue.get("closed_at"), + due_date=gitlab_issue.get("due_date"), + provider="gitlab", + raw_data=gitlab_issue + ) + + @staticmethod + def from_canonical(issue: CanonicalIssue) -> dict: + return { + "title": issue.title, + "description": issue.description, + "state_event": "reopen" if issue.status == IssueStatus.OPEN else "close", + "assignee_ids": issue.assignee_ids, + "labels": ",".join(issue.labels), + "milestone_id": issue.milestone, # Requires ID lookup + "due_date": issue.due_date.isoformat() if issue.due_date else None, + } +``` + +--- + +## Code Examples + +### Provider Interface + +```python +# app/services/sync/providers/base.py +from abc import ABC, abstractmethod +from typing import AsyncIterator +from app.schemas.sync.canonical import CanonicalIssue + +class IssueProvider(ABC): + """Abstract base class for issue tracker providers.""" + + def __init__(self, connection: ExternalConnection): + self.connection = connection + self.base_url = connection.provider_url + self.repo_owner = connection.repo_owner + self.repo_name = connection.repo_name + + @abstractmethod + async def get_issue(self, issue_number: int) -> CanonicalIssue: + """Fetch a single issue by number.""" + pass + + @abstractmethod + async def list_issues( + self, + state: str = "all", + since: datetime = None, + labels: list[str] = None + ) -> AsyncIterator[CanonicalIssue]: + """List issues with optional filters.""" + pass + + @abstractmethod + async def create_issue(self, issue: CanonicalIssue) -> CanonicalIssue: + """Create a new issue.""" + pass + + @abstractmethod + async def update_issue(self, issue_number: int, issue: CanonicalIssue) -> CanonicalIssue: + """Update an existing issue.""" + pass + + @abstractmethod + async def add_comment(self, issue_number: int, body: str) -> dict: + """Add a comment to an issue.""" + pass + + @abstractmethod + async def setup_webhook(self, callback_url: str, secret: str) -> str: + """Configure webhook for issue events. Returns webhook ID.""" + pass + + @abstractmethod + async def verify_webhook_signature(self, payload: bytes, headers: dict) -> bool: + """Verify webhook signature.""" + pass +``` + +### Gitea Provider Implementation + +```python +# app/services/sync/providers/gitea.py +import httpx +from app.services.sync.providers.base import IssueProvider +from app.services.sync.adapters.gitea import GiteaAdapter + +class GiteaProvider(IssueProvider): + """Gitea issue tracker provider.""" + + def __init__(self, connection: ExternalConnection): + super().__init__(connection) + self.adapter = GiteaAdapter() + self._client = None + + @property + def api_url(self) -> str: + return f"{self.base_url}/api/v1" + + async def _get_client(self) -> httpx.AsyncClient: + if self._client is None: + token = decrypt(self.connection.credentials_encrypted) + self._client = httpx.AsyncClient( + base_url=self.api_url, + headers={ + "Authorization": f"token {token}", + "Accept": "application/json", + }, + timeout=30.0 + ) + return self._client + + async def get_issue(self, issue_number: int) -> CanonicalIssue: + client = await self._get_client() + response = await client.get( + f"/repos/{self.repo_owner}/{self.repo_name}/issues/{issue_number}" + ) + response.raise_for_status() + return self.adapter.to_canonical(response.json(), self.base_url) + + async def list_issues( + self, + state: str = "all", + since: datetime = None, + labels: list[str] = None + ) -> AsyncIterator[CanonicalIssue]: + client = await self._get_client() + page = 1 + + while True: + params = { + "state": state, + "page": page, + "limit": 50, # Gitea default max + } + if since: + params["since"] = since.isoformat() + if labels: + params["labels"] = ",".join(labels) + + response = await client.get( + f"/repos/{self.repo_owner}/{self.repo_name}/issues", + params=params + ) + response.raise_for_status() + + issues = response.json() + if not issues: + break + + for issue in issues: + yield self.adapter.to_canonical(issue, self.base_url) + + # Check pagination + link_header = response.headers.get("link", "") + if 'rel="next"' not in link_header: + break + page += 1 + + async def create_issue(self, issue: CanonicalIssue) -> CanonicalIssue: + client = await self._get_client() + payload = self.adapter.from_canonical(issue) + + response = await client.post( + f"/repos/{self.repo_owner}/{self.repo_name}/issues", + json=payload + ) + response.raise_for_status() + return self.adapter.to_canonical(response.json(), self.base_url) + + async def update_issue(self, issue_number: int, issue: CanonicalIssue) -> CanonicalIssue: + client = await self._get_client() + payload = self.adapter.from_canonical(issue) + + response = await client.patch( + f"/repos/{self.repo_owner}/{self.repo_name}/issues/{issue_number}", + json=payload + ) + response.raise_for_status() + return self.adapter.to_canonical(response.json(), self.base_url) + + async def add_comment(self, issue_number: int, body: str) -> dict: + client = await self._get_client() + response = await client.post( + f"/repos/{self.repo_owner}/{self.repo_name}/issues/{issue_number}/comments", + json={"body": body} + ) + response.raise_for_status() + return response.json() + + async def setup_webhook(self, callback_url: str, secret: str) -> str: + client = await self._get_client() + response = await client.post( + f"/repos/{self.repo_owner}/{self.repo_name}/hooks", + json={ + "type": "gitea", + "active": True, + "events": ["issues", "issue_comment"], + "config": { + "url": callback_url, + "content_type": "json", + "secret": secret, + } + } + ) + response.raise_for_status() + return str(response.json()["id"]) + + async def verify_webhook_signature(self, payload: bytes, headers: dict) -> bool: + secret = decrypt(self.connection.webhook_secret_encrypted) + return GiteaWebhookValidator.verify_signature(payload, headers, secret) +``` + +### Sync Engine + +```python +# app/services/sync/sync_engine.py +from app.services.sync.providers.factory import ProviderFactory +from app.services.sync.conflict_resolver import ConflictResolver +from app.models.issue_sync import IssueSyncLog, SyncOutbox + +class SyncEngine: + """Orchestrates synchronization between Syndarix and external trackers.""" + + def __init__(self, db: AsyncSession): + self.db = db + self.conflict_resolver = ConflictResolver() + + async def sync_inbound( + self, + connection: ExternalConnection, + external_issue: CanonicalIssue + ) -> Issue: + """Sync an issue from external tracker to Syndarix.""" + + # Find existing local issue + local_issue = await self._find_by_external_id( + connection.project_id, + external_issue.provider, + external_issue.external_id + ) + + if local_issue is None: + # Create new local issue + local_issue = await self._create_local_issue( + connection.project_id, + external_issue + ) + await self._log_sync( + local_issue, + direction="inbound", + event_type="created", + status="success" + ) + else: + # Check for conflicts + conflict_result = self.conflict_resolver.check( + local_issue.version_vector, + external_issue.raw_data.get("_version_vector", {}), + local_issue.updated_at, + external_issue.updated_at + ) + + if conflict_result.has_conflict: + # Apply resolution strategy + resolved = self.conflict_resolver.resolve( + local_issue, + external_issue, + conflict_result + ) + await self._update_local_issue(local_issue, resolved) + await self._log_sync( + local_issue, + direction="inbound", + event_type="conflict_resolved", + status="success", + conflict_details=conflict_result.to_dict() + ) + else: + # Normal update + await self._update_local_issue(local_issue, external_issue) + await self._log_sync( + local_issue, + direction="inbound", + event_type="updated", + status="success" + ) + + return local_issue + + async def sync_outbound( + self, + connection: ExternalConnection, + local_issue: Issue + ): + """Queue a local issue for sync to external tracker.""" + + provider = ProviderFactory.get_provider(connection) + canonical = self._to_canonical(local_issue) + + outbox_entry = SyncOutbox( + issue_id=local_issue.id, + project_id=connection.project_id, + provider=connection.provider, + operation="update" if local_issue.external_id else "create", + payload=canonical.dict(), + idempotency_key=f"{local_issue.id}:{local_issue.updated_at.isoformat()}" + ) + + self.db.add(outbox_entry) + await self.db.commit() + + async def initial_import( + self, + connection: ExternalConnection, + since: datetime = None, + labels: list[str] = None + ) -> int: + """Import all issues from external tracker.""" + + provider = ProviderFactory.get_provider(connection) + imported = 0 + + async for external_issue in provider.list_issues( + state="all", + since=since, + labels=labels + ): + await self.sync_inbound(connection, external_issue) + imported += 1 + + connection.last_sync_at = datetime.utcnow() + await self.db.commit() + + return imported + + async def reconcile(self, connection: ExternalConnection): + """ + Periodic reconciliation to catch missed webhooks. + Runs via Celery Beat every 15-30 minutes. + """ + since = connection.last_sync_at or (datetime.utcnow() - timedelta(hours=24)) + + provider = ProviderFactory.get_provider(connection) + + async for external_issue in provider.list_issues( + state="all", + since=since + ): + local_issue = await self._find_by_external_id( + connection.project_id, + external_issue.provider, + external_issue.external_id + ) + + if local_issue: + # Check if external is newer + if external_issue.updated_at > local_issue.external_updated_at: + await self.sync_inbound(connection, external_issue) + else: + await self.sync_inbound(connection, external_issue) + + connection.last_sync_at = datetime.utcnow() + await self.db.commit() +``` + +--- + +## Error Handling & Recovery + +### Error Categories and Handling + +| Error Type | Handling Strategy | Retry | Alert | +|------------|-------------------|-------|-------| +| Network timeout | Exponential backoff | Yes (3x) | After 3 failures | +| Rate limit (429) | Wait for reset | Yes | No | +| Auth error (401/403) | Mark connection as error | No | Yes | +| Not found (404) | Mark issue as deleted | No | No | +| Conflict (409) | Apply resolution strategy | No | If unresolved | +| Server error (5xx) | Exponential backoff | Yes (5x) | After 5 failures | +| Validation error | Log and skip | No | Yes | + +### Retry Strategy + +```python +# app/services/sync/retry.py +import asyncio +from functools import wraps + +def with_retry( + max_attempts: int = 3, + base_delay: float = 1.0, + max_delay: float = 60.0, + exponential_base: float = 2.0, + retryable_exceptions: tuple = (httpx.TimeoutException, httpx.NetworkError) +): + """Decorator for retry with exponential backoff.""" + + def decorator(func): + @wraps(func) + async def wrapper(*args, **kwargs): + last_exception = None + + for attempt in range(max_attempts): + try: + return await func(*args, **kwargs) + except retryable_exceptions as e: + last_exception = e + if attempt < max_attempts - 1: + delay = min( + base_delay * (exponential_base ** attempt), + max_delay + ) + await asyncio.sleep(delay) + + raise last_exception + + return wrapper + return decorator +``` + +### Dead Letter Queue Handling + +```python +# app/workers/dead_letter_worker.py + +@celery_app.task +def process_dead_letter_queue(): + """Process failed sync items for manual review or retry.""" + + dead_items = db.query(SyncOutbox).filter( + SyncOutbox.status == "dead_letter", + SyncOutbox.created_at > datetime.utcnow() - timedelta(days=7) + ).all() + + for item in dead_items: + # Create notification for admin review + notify_admin( + f"Sync failed for issue {item.issue_id}", + details={ + "provider": item.provider, + "operation": item.operation, + "attempts": item.attempts, + "last_error": item.last_error, + } + ) + + # Optionally attempt one more retry with manual intervention + if should_retry(item): + item.status = "pending" + item.attempts = 0 + item.scheduled_at = datetime.utcnow() +``` + +### Health Monitoring + +```python +# app/services/sync/health.py + +async def check_sync_health(project_id: str) -> dict: + """Check sync health for a project.""" + + connections = await get_connections(project_id) + + health = { + "status": "healthy", + "connections": [], + "pending_syncs": 0, + "failed_syncs_24h": 0, + "conflicts_24h": 0, + } + + for conn in connections: + conn_health = { + "provider": conn.provider, + "status": conn.status, + "last_sync": conn.last_sync_at, + "webhook_active": conn.webhook_active, + } + + # Check if sync is stale + if conn.last_sync_at: + stale_threshold = datetime.utcnow() - timedelta(hours=1) + if conn.last_sync_at < stale_threshold: + conn_health["status"] = "stale" + health["status"] = "degraded" + + health["connections"].append(conn_health) + + # Count pending and failed + health["pending_syncs"] = await count_pending_syncs(project_id) + health["failed_syncs_24h"] = await count_failed_syncs(project_id, hours=24) + health["conflicts_24h"] = await count_conflicts(project_id, hours=24) + + if health["failed_syncs_24h"] > 10: + health["status"] = "unhealthy" + + return health +``` + +--- + +## Implementation Roadmap + +### Phase 1: Foundation (Week 1-2) +- [ ] Database schema and migrations +- [ ] Core models (ExternalConnection, IssueSyncLog, SyncOutbox) +- [ ] Provider interface and Gitea implementation +- [ ] Canonical issue model and field mapping + +### Phase 2: Inbound Sync (Week 2-3) +- [ ] Webhook endpoint and signature validation +- [ ] Redis queue for webhook events +- [ ] Celery worker for event processing +- [ ] Initial import functionality +- [ ] Basic conflict detection + +### Phase 3: Outbound Sync (Week 3-4) +- [ ] Outbox pattern implementation +- [ ] Outbound sync worker +- [ ] Retry and dead letter queue +- [ ] Bidirectional sync testing + +### Phase 4: GitHub & GitLab (Week 4-5) +- [ ] GitHub provider implementation +- [ ] GitLab provider implementation +- [ ] Provider factory and dynamic selection +- [ ] Cross-provider field mapping + +### Phase 5: Conflict Resolution (Week 5-6) +- [ ] Version vector implementation +- [ ] Conflict resolution strategies +- [ ] Merge algorithms for complex fields +- [ ] Conflict notification UI + +### Phase 6: Production Readiness (Week 6-7) +- [ ] Health monitoring and alerting +- [ ] Admin UI for connection management +- [ ] Comprehensive test coverage +- [ ] Performance optimization +- [ ] Documentation + +--- + +## References + +### Research Sources + +- [Two-Way Sync Tools 2025: Best Platforms for Real-Time Data Integration](https://www.stacksync.com/blog/2025-best-two-way-sync-tools-a-comprehensive-guide-for-data-integration) - StackSync +- [The Architect's Guide to Data Integration Patterns](https://medium.com/@prayagvakharia/the-architects-guide-to-data-integration-patterns-migration-broadcast-bi-directional-a4c92b5f908d) - Medium +- [System Design Pattern: Conflict Resolution in Distributed Systems](https://medium.com/@priyasrivastava18official/system-design-pattern-from-chaos-to-consistency-the-art-of-conflict-resolution-in-distributed-9d631028bdb4) - Medium +- [Eventual Consistency in Distributed Systems](https://www.geeksforgeeks.org/system-design/eventual-consistency-in-distributive-systems-learn-system-design/) - GeeksforGeeks +- [Bidirectional Synchronization: What It Is and Examples](https://www.workato.com/the-connector/bidirectional-synchronization/) - Workato +- [Data Integration Patterns: Bi-Directional Sync](https://blogs.mulesoft.com/api-integration/patterns/data-integration-patterns-bi-directional-sync/) - MuleSoft + +### API Documentation + +- [GitHub REST API - Issues](https://docs.github.com/en/rest/issues/issues) +- [GitHub Webhook Events](https://docs.github.com/en/webhooks/webhook-events-and-payloads) +- [GitLab Issues API](https://docs.gitlab.com/api/issues.html) +- [GitLab Webhooks](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html) +- [Gitea API Usage](https://docs.gitea.com/development/api-usage) + +### Related Syndarix Spikes + +- [SPIKE-001: MCP Integration Pattern](./SPIKE-001-mcp-integration-pattern.md) - MCP architecture and FastMCP usage +- [SPIKE-003: Real-time Updates](./SPIKE-003-realtime-updates.md) - SSE for event streaming +- [SPIKE-004: Celery Redis Integration](./SPIKE-004-celery-redis-integration.md) - Background job infrastructure + +--- + +## Decision + +**Adopt a webhook-first, polling-fallback synchronization architecture** with: + +1. **Last-Writer-Wins (LWW)** conflict resolution using version vectors +2. **External tracker as source of truth** with local mirrors +3. **Unified provider interface** abstracting Gitea, GitHub, GitLab differences +4. **Outbox pattern** for reliable outbound sync +5. **Redis Streams** for webhook event queuing and deduplication +6. **Celery Beat** for periodic reconciliation + +This approach balances real-time responsiveness with eventual consistency, providing a robust foundation for bidirectional issue synchronization. + +--- + +*Spike completed. Findings will inform ADR-009: Issue Synchronization Architecture.* diff --git a/docs/spikes/SPIKE-010-cost-tracking.md b/docs/spikes/SPIKE-010-cost-tracking.md new file mode 100644 index 0000000..596386f --- /dev/null +++ b/docs/spikes/SPIKE-010-cost-tracking.md @@ -0,0 +1,1821 @@ +# SPIKE-010: Cost Tracking for Syndarix + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #10 + +--- + +## Executive Summary + +Syndarix requires comprehensive LLM cost tracking to manage expenses across multiple providers (Anthropic, OpenAI, local Ollama). This spike researches token usage monitoring, budget enforcement, cost optimization strategies, and real-time alerting. + +### Recommendation + +**Adopt a multi-layered cost tracking architecture:** + +1. **LiteLLM Callbacks** for real-time usage capture at the gateway level +2. **PostgreSQL** for persistent usage records with time-series aggregation +3. **Redis** for real-time budget enforcement and rate limiting +4. **Celery Beat** for scheduled budget checks and alert processing +5. **SSE Events** for real-time dashboard updates + +**Expected Cost Savings:** 60-80% reduction through combined optimization strategies (semantic caching, model cascading, prompt compression). + +--- + +## Table of Contents + +1. [Research Questions & Findings](#research-questions--findings) +2. [Cost Tracking Architecture](#cost-tracking-architecture) +3. [Database Schema Design](#database-schema-design) +4. [LiteLLM Callback Implementation](#litellm-callback-implementation) +5. [Budget Management System](#budget-management-system) +6. [Alert System Integration](#alert-system-integration) +7. [Cost Optimization Strategies](#cost-optimization-strategies) +8. [Cost Estimation Before Execution](#cost-estimation-before-execution) +9. [Reporting Dashboard Requirements](#reporting-dashboard-requirements) +10. [Implementation Roadmap](#implementation-roadmap) + +--- + +## Research Questions & Findings + +### 1. How does LiteLLM track token usage and costs? + +LiteLLM provides built-in cost tracking through multiple mechanisms: + +**Response Usage Object:** +```python +response = await litellm.acompletion( + model="claude-3-5-sonnet-20241022", + messages=[...] +) +# Access usage data +print(response.usage.prompt_tokens) # Input tokens +print(response.usage.completion_tokens) # Output tokens +print(response.usage.total_tokens) # Total tokens +``` + +**Automatic Cost Calculation:** +LiteLLM maintains a centralized pricing database (`model_prices_and_context_window.json`) with costs for 100+ models. Cost is accessible via `kwargs["response_cost"]` in callbacks. + +**Custom Callbacks:** +```python +from litellm.integrations.custom_logger import CustomLogger + +class SyndarixCostLogger(CustomLogger): + async def async_log_success_event(self, kwargs, response_obj, start_time, end_time): + cost = kwargs.get("response_cost", 0) + model = kwargs.get("model") + usage = response_obj.usage + # Store in database +``` + +**Custom Pricing Override:** +```python +response = await litellm.acompletion( + model="custom-model", + messages=[...], + input_cost_per_token=0.000003, # $3 per 1M tokens + output_cost_per_token=0.000015 # $15 per 1M tokens +) +``` + +### 2. Best practices for LLM cost attribution in multi-tenant systems? + +**Key Patterns Identified:** + +1. **Hierarchical Attribution:** Track costs at multiple levels (organization > project > agent type > agent instance > request) +2. **Metadata Tagging:** Include cost center, budget ID, and user context in every request +3. **Tenant-Aware Budgets:** Implement per-customer tier budgets with different limits +4. **Chargeback/Showback:** Enable detailed cost allocation for billing integration + +**Multi-Tenant Architecture for Syndarix:** +``` +Organization (Billing Entity) + └── Project (Cost Center) + └── Sprint (Time-bounded Budget) + └── Agent Instance (Worker) + └── LLM Request (Atomic Cost Unit) +``` + +### 3. Real-time cost monitoring approaches? + +**Recommended Approach:** Hybrid Redis + PostgreSQL + +| Layer | Technology | Purpose | +|-------|------------|---------| +| Hot data | Redis | Real-time budget counters, rate limiting | +| Warm data | PostgreSQL | Hourly/daily aggregates, analytics | +| Cold data | S3/Archive | Monthly exports, long-term retention | + +**Real-time Pipeline:** +``` +LLM Request β†’ LiteLLM Callback β†’ Redis INCR β†’ Budget Check + ↓ + Async Queue β†’ PostgreSQL Insert β†’ SSE Event +``` + +### 4. Budget enforcement strategies (soft vs hard limits)? + +**Soft Limits (Recommended for most cases):** +- Send alerts at threshold percentages (50%, 80%, 100%) +- Allow continued usage with warnings +- Automatic model downgrade (e.g., Sonnet β†’ Haiku) +- Require approval for continued expensive operations + +**Hard Limits (For critical cost control):** +- Block requests once budget exhausted +- Reject LLM calls at gateway level +- Require manual budget increase + +**Syndarix Recommendation:** +- **Daily budgets:** Hard limits (prevent runaway costs) +- **Weekly/Monthly budgets:** Soft limits with alerts and escalation +- **Per-request limits:** Enforce max tokens per call + +### 5. Cost optimization techniques? + +**Tier 1: Immediate Impact (0 code changes)** +- Semantic caching: 15-30% cost reduction +- Response caching for deterministic queries + +**Tier 2: Model Selection (Architecture changes)** +- Model cascading: Start with cheaper models, escalate as needed +- 87% cost reduction possible (90% of queries handled by smaller models) +- Dynamic routing based on query complexity + +**Tier 3: Prompt Engineering (Ongoing optimization)** +- Prompt compression with LLMLingua: Up to 20x compression with <5% quality loss +- Extractive compression for RAG: 2-10x compression, often improves accuracy + +**Tier 4: Infrastructure (Long-term)** +- Self-hosted models for high-volume, latency-tolerant tasks +- Fine-tuned smaller models for domain-specific tasks + +### 6. Reporting and visualization requirements? + +**Dashboard Requirements:** +1. Real-time spend ticker (updated via SSE) +2. Cost breakdown by: project, agent type, model, time period +3. Budget utilization gauges with threshold indicators +4. Historical trend charts (daily/weekly/monthly) +5. Anomaly detection alerts (spending spikes) +6. Forecast projections based on current burn rate +7. Cost per task/feature analysis +8. Comparative efficiency metrics (cost per successful completion) + +### 7. Cost estimation before execution? + +**Pre-execution Estimation Pattern:** +```python +def estimate_cost(messages: list, model: str) -> CostEstimate: + """Estimate cost before making LLM call.""" + input_tokens = count_tokens(messages, model) + estimated_output = estimate_output_tokens(messages, model) + + costs = MODEL_COSTS[model] + input_cost = (input_tokens / 1_000_000) * costs["input"] + output_cost = (estimated_output / 1_000_000) * costs["output"] + + return CostEstimate( + input_tokens=input_tokens, + estimated_output_tokens=estimated_output, + estimated_cost_usd=input_cost + output_cost, + confidence="medium" # Based on output estimation accuracy + ) +``` + +**Token Counting:** +- Use `tiktoken` for OpenAI models +- Use `anthropic.count_tokens()` for Claude models +- Use LiteLLM's `token_counter()` for unified counting + +--- + +## Cost Tracking Architecture + +### System Overview + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Syndarix Backend β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ LLM Gateway (LiteLLM) β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Pre-Request β”‚ β”‚ Router β”‚ β”‚ Post-Requestβ”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ Callback │──▢│ (Failover) │──▢│ Callback β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β–Ό β–Ό β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Budget β”‚ β”‚ Cost Tracker β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ Check β”‚ β”‚ Service β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ β”‚ +β”‚ β–Ό β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Redis β”‚ β”‚ PostgreSQL β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ Budget β”‚ β”‚ β”‚ β”‚ token_usage β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ Counters β”‚ β”‚ β”‚ β”‚ (time-series) β”‚ β”‚ β”‚ +β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ +β”‚ β”‚ β”‚ Rate Limits β”‚ β”‚ β”‚ β”‚ budgets β”‚ β”‚ β”‚ +β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ +β”‚ β”‚ β”‚ Cache Keys β”‚ β”‚ β”‚ β”‚ alerts_log β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ daily_summary β”‚ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β–Ό β–Ό β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ Event Bus (SSE) β”‚ β”‚ Celery Beat β”‚ β”‚ +β”‚ β”‚ - cost_update β”‚ β”‚ - budget_check β”‚ β”‚ +β”‚ β”‚ - budget_alert β”‚ β”‚ - daily_rollup β”‚ β”‚ +β”‚ β”‚ - threshold_warn β”‚ β”‚ - monthly_report β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Component Responsibilities + +| Component | Responsibility | +|-----------|----------------| +| LLM Gateway | Route requests, capture usage, enforce limits | +| Cost Tracker | Calculate costs, persist records, update counters | +| Redis | Real-time budget counters, rate limiting, caching | +| PostgreSQL | Persistent usage records, aggregations, analytics | +| Event Bus | Real-time notifications to dashboards | +| Celery Beat | Scheduled tasks (rollups, alerts, reports) | + +--- + +## Database Schema Design + +### Core Tables + +```python +# app/models/cost_tracking.py +from sqlalchemy import Column, String, Integer, Float, DateTime, ForeignKey, Enum, Index +from sqlalchemy.dialects.postgresql import UUID, JSONB +from sqlalchemy.orm import relationship +from app.models.base import Base, TimestampMixin, UUIDMixin +import enum + +class BudgetPeriod(str, enum.Enum): + DAILY = "daily" + WEEKLY = "weekly" + MONTHLY = "monthly" + +class AlertSeverity(str, enum.Enum): + INFO = "info" # 50% threshold + WARNING = "warning" # 80% threshold + CRITICAL = "critical" # 100% threshold + +class AlertStatus(str, enum.Enum): + PENDING = "pending" + ACKNOWLEDGED = "acknowledged" + RESOLVED = "resolved" + + +class TokenUsage(Base, UUIDMixin, TimestampMixin): + """Individual LLM request usage record.""" + __tablename__ = "token_usage" + + # Attribution + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=False) + agent_instance_id = Column(UUID(as_uuid=True), ForeignKey("agent_instances.id"), nullable=True) + agent_type_id = Column(UUID(as_uuid=True), ForeignKey("agent_types.id"), nullable=True) + user_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=True) + + # Request details + request_id = Column(String(64), unique=True, nullable=False) # Correlation ID + model = Column(String(100), nullable=False) + provider = Column(String(50), nullable=False) # anthropic, openai, ollama + + # Token counts + prompt_tokens = Column(Integer, nullable=False) + completion_tokens = Column(Integer, nullable=False) + total_tokens = Column(Integer, nullable=False) + cached_tokens = Column(Integer, default=0) # Tokens served from cache + + # Cost calculation + input_cost_usd = Column(Float, nullable=False) + output_cost_usd = Column(Float, nullable=False) + total_cost_usd = Column(Float, nullable=False) + + # Timing + latency_ms = Column(Integer, nullable=True) + request_timestamp = Column(DateTime(timezone=True), nullable=False) + + # Metadata + task_type = Column(String(50), nullable=True) # reasoning, coding, chat, etc. + success = Column(Boolean, default=True) + error_type = Column(String(100), nullable=True) + metadata = Column(JSONB, default={}) # Additional context + + # Indexes for common queries + __table_args__ = ( + Index("ix_token_usage_project_timestamp", "project_id", "request_timestamp"), + Index("ix_token_usage_agent_instance_timestamp", "agent_instance_id", "request_timestamp"), + Index("ix_token_usage_model_timestamp", "model", "request_timestamp"), + Index("ix_token_usage_request_timestamp", "request_timestamp"), + ) + + +class Budget(Base, UUIDMixin, TimestampMixin): + """Budget definition for cost control.""" + __tablename__ = "budgets" + + # Scope (one of these will be set) + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=True) + agent_type_id = Column(UUID(as_uuid=True), ForeignKey("agent_types.id"), nullable=True) + organization_id = Column(UUID(as_uuid=True), ForeignKey("organizations.id"), nullable=True) + + # Budget configuration + name = Column(String(100), nullable=False) + period = Column(Enum(BudgetPeriod), nullable=False) + limit_usd = Column(Float, nullable=False) + + # Thresholds (percentages) + warn_threshold = Column(Float, default=0.5) # 50% + alert_threshold = Column(Float, default=0.8) # 80% + critical_threshold = Column(Float, default=1.0) # 100% + + # Enforcement + hard_limit = Column(Boolean, default=False) # Block requests at limit + auto_downgrade = Column(Boolean, default=True) # Downgrade to cheaper model + + # Status + is_active = Column(Boolean, default=True) + current_spend = Column(Float, default=0.0) # Cached from Redis + period_start = Column(DateTime(timezone=True), nullable=False) + period_end = Column(DateTime(timezone=True), nullable=False) + + __table_args__ = ( + Index("ix_budgets_project_active", "project_id", "is_active"), + Index("ix_budgets_period_end", "period_end"), + ) + + +class BudgetAlert(Base, UUIDMixin, TimestampMixin): + """Alert record for budget threshold breaches.""" + __tablename__ = "budget_alerts" + + budget_id = Column(UUID(as_uuid=True), ForeignKey("budgets.id"), nullable=False) + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=True) + + severity = Column(Enum(AlertSeverity), nullable=False) + status = Column(Enum(AlertStatus), default=AlertStatus.PENDING) + + threshold_percent = Column(Float, nullable=False) + current_spend = Column(Float, nullable=False) + budget_limit = Column(Float, nullable=False) + + message = Column(String(500), nullable=False) + acknowledged_by = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=True) + acknowledged_at = Column(DateTime(timezone=True), nullable=True) + + # Notification tracking + notifications_sent = Column(JSONB, default=[]) # [{"channel": "email", "sent_at": "..."}] + + budget = relationship("Budget", backref="alerts") + + +class DailyCostSummary(Base, UUIDMixin): + """Materialized daily cost aggregations for fast reporting.""" + __tablename__ = "daily_cost_summaries" + + date = Column(DateTime(timezone=True), nullable=False) + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=True) + agent_type_id = Column(UUID(as_uuid=True), ForeignKey("agent_types.id"), nullable=True) + model = Column(String(100), nullable=True) + + # Aggregated metrics + request_count = Column(Integer, default=0) + total_prompt_tokens = Column(Integer, default=0) + total_completion_tokens = Column(Integer, default=0) + total_tokens = Column(Integer, default=0) + total_cost_usd = Column(Float, default=0.0) + + # Performance metrics + avg_latency_ms = Column(Float, nullable=True) + cache_hit_rate = Column(Float, default=0.0) + success_rate = Column(Float, default=1.0) + + __table_args__ = ( + Index("ix_daily_summary_date_project", "date", "project_id"), + Index("ix_daily_summary_date_model", "date", "model"), + ) + + +class ModelPricing(Base, UUIDMixin, TimestampMixin): + """Custom model pricing overrides.""" + __tablename__ = "model_pricing" + + model = Column(String(100), unique=True, nullable=False) + provider = Column(String(50), nullable=False) + + input_cost_per_million = Column(Float, nullable=False) + output_cost_per_million = Column(Float, nullable=False) + + # Context limits + max_input_tokens = Column(Integer, nullable=True) + max_output_tokens = Column(Integer, nullable=True) + + # Validity period + effective_from = Column(DateTime(timezone=True), nullable=False) + effective_until = Column(DateTime(timezone=True), nullable=True) + + is_active = Column(Boolean, default=True) +``` + +### Redis Key Structure + +```python +# Redis key patterns for real-time tracking + +REDIS_KEYS = { + # Budget counters (reset on period start) + "budget_spend": "budget:{budget_id}:spend", # INCRBYFLOAT + "budget_requests": "budget:{budget_id}:requests", # INCR + + # Project-level aggregations + "project_daily_spend": "project:{project_id}:daily:{date}:spend", + "project_monthly_spend": "project:{project_id}:monthly:{year_month}:spend", + + # Rate limiting + "rate_limit": "ratelimit:{project_id}:{model}:{window}", + + # Semantic cache + "semantic_cache": "cache:semantic:{hash}", + + # Real-time metrics (for dashboard) + "realtime_spend": "metrics:realtime:spend:{project_id}", +} +``` + +--- + +## LiteLLM Callback Implementation + +### Custom Callback Handler + +```python +# app/services/cost_tracking/callbacks.py +from typing import Any +from datetime import datetime, UTC +import asyncio +from litellm.integrations.custom_logger import CustomLogger +from app.core.config import settings +from app.services.cost_tracking.tracker import CostTracker +from app.services.cost_tracking.budget import BudgetEnforcer +from app.services.events import EventBus + +class SyndarixCostCallback(CustomLogger): + """ + LiteLLM callback handler for cost tracking and budget enforcement. + + Integrates with: + - Redis for real-time budget counters + - PostgreSQL for persistent usage records + - Event bus for real-time dashboard updates + """ + + def __init__(self): + self.tracker = CostTracker() + self.budget_enforcer = BudgetEnforcer() + self.event_bus = EventBus() + + def log_pre_api_call(self, model: str, messages: list, kwargs: dict): + """Pre-request validation and cost estimation.""" + project_id = kwargs.get("metadata", {}).get("project_id") + agent_id = kwargs.get("metadata", {}).get("agent_id") + + if not project_id: + return # Skip tracking for internal calls + + # Check budget before making request + budget_status = self.budget_enforcer.check_budget_sync(project_id) + + if budget_status.exceeded and budget_status.hard_limit: + raise BudgetExceededError( + f"Budget exceeded for project {project_id}. " + f"Limit: ${budget_status.limit:.2f}, " + f"Spent: ${budget_status.spent:.2f}" + ) + + if budget_status.exceeded and budget_status.auto_downgrade: + # Suggest model downgrade + kwargs["model"] = self._get_fallback_model(model) + + async def async_log_success_event( + self, + kwargs: dict, + response_obj: Any, + start_time: float, + end_time: float + ): + """Record successful request usage and costs.""" + metadata = kwargs.get("metadata", {}) + project_id = metadata.get("project_id") + agent_id = metadata.get("agent_id") + agent_type_id = metadata.get("agent_type_id") + + if not project_id: + return + + # Extract usage data + usage = response_obj.usage + model = kwargs.get("model") + cost = kwargs.get("response_cost", 0) + + # Calculate costs if not provided + if cost == 0: + cost = self._calculate_cost(model, usage) + + # Create usage record + usage_record = { + "request_id": kwargs.get("litellm_call_id", str(uuid.uuid4())), + "project_id": project_id, + "agent_instance_id": agent_id, + "agent_type_id": agent_type_id, + "model": model, + "provider": self._get_provider(model), + "prompt_tokens": usage.prompt_tokens, + "completion_tokens": usage.completion_tokens, + "total_tokens": usage.total_tokens, + "input_cost_usd": self._calculate_input_cost(model, usage.prompt_tokens), + "output_cost_usd": self._calculate_output_cost(model, usage.completion_tokens), + "total_cost_usd": cost, + "latency_ms": int((end_time - start_time) * 1000), + "request_timestamp": datetime.now(UTC), + "task_type": metadata.get("task_type"), + "success": True, + } + + # Async operations in parallel + await asyncio.gather( + self.tracker.record_usage(usage_record), + self.budget_enforcer.increment_spend(project_id, cost), + self._publish_cost_event(project_id, usage_record), + ) + + async def async_log_failure_event( + self, + kwargs: dict, + response_obj: Any, + start_time: float, + end_time: float + ): + """Record failed request (still costs input tokens).""" + metadata = kwargs.get("metadata", {}) + project_id = metadata.get("project_id") + + if not project_id: + return + + # Failed requests still consume input tokens + input_tokens = kwargs.get("input_tokens", 0) + model = kwargs.get("model") + + if input_tokens > 0: + input_cost = self._calculate_input_cost(model, input_tokens) + + usage_record = { + "request_id": kwargs.get("litellm_call_id", str(uuid.uuid4())), + "project_id": project_id, + "model": model, + "prompt_tokens": input_tokens, + "completion_tokens": 0, + "total_tokens": input_tokens, + "total_cost_usd": input_cost, + "success": False, + "error_type": type(response_obj).__name__, + "request_timestamp": datetime.now(UTC), + } + + await self.tracker.record_usage(usage_record) + + async def _publish_cost_event(self, project_id: str, usage: dict): + """Publish real-time cost event to dashboard.""" + await self.event_bus.publish(f"project:{project_id}", { + "type": "cost_update", + "data": { + "model": usage["model"], + "tokens": usage["total_tokens"], + "cost_usd": usage["total_cost_usd"], + "timestamp": usage["request_timestamp"].isoformat(), + } + }) + + def _calculate_cost(self, model: str, usage) -> float: + """Calculate total cost from usage.""" + input_cost = self._calculate_input_cost(model, usage.prompt_tokens) + output_cost = self._calculate_output_cost(model, usage.completion_tokens) + return input_cost + output_cost + + def _calculate_input_cost(self, model: str, tokens: int) -> float: + """Calculate input token cost.""" + costs = MODEL_COSTS.get(model, {"input": 0, "output": 0}) + return (tokens / 1_000_000) * costs["input"] + + def _calculate_output_cost(self, model: str, tokens: int) -> float: + """Calculate output token cost.""" + costs = MODEL_COSTS.get(model, {"input": 0, "output": 0}) + return (tokens / 1_000_000) * costs["output"] + + def _get_provider(self, model: str) -> str: + """Extract provider from model name.""" + if model.startswith("claude"): + return "anthropic" + elif model.startswith("gpt") or model.startswith("o1"): + return "openai" + elif model.startswith("ollama/"): + return "ollama" + return "unknown" + + def _get_fallback_model(self, model: str) -> str: + """Get cheaper fallback model.""" + fallbacks = { + "claude-3-5-sonnet-20241022": "claude-3-haiku-20240307", + "gpt-4-turbo": "gpt-4o-mini", + "gpt-4o": "gpt-4o-mini", + } + return fallbacks.get(model, model) + + +# Model cost database (per 1M tokens) +MODEL_COSTS = { + # Anthropic + "claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00}, + "claude-3-opus-20240229": {"input": 15.00, "output": 75.00}, + "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25}, + + # OpenAI + "gpt-4-turbo": {"input": 10.00, "output": 30.00}, + "gpt-4o": {"input": 2.50, "output": 10.00}, + "gpt-4o-mini": {"input": 0.15, "output": 0.60}, + "o1-preview": {"input": 15.00, "output": 60.00}, + "o1-mini": {"input": 3.00, "output": 12.00}, + + # Local (compute-only, no API cost) + "ollama/llama3": {"input": 0.00, "output": 0.00}, + "ollama/mixtral": {"input": 0.00, "output": 0.00}, +} +``` + +### Registering the Callback + +```python +# app/services/llm_gateway.py +import litellm +from app.services.cost_tracking.callbacks import SyndarixCostCallback + +# Register the callback globally +syndarix_callback = SyndarixCostCallback() +litellm.callbacks = [syndarix_callback] + +# Or register specific callbacks +litellm.success_callback = [syndarix_callback.async_log_success_event] +litellm.failure_callback = [syndarix_callback.async_log_failure_event] +``` + +--- + +## Budget Management System + +### Budget Enforcer + +```python +# app/services/cost_tracking/budget.py +from dataclasses import dataclass +from datetime import datetime, UTC, timedelta +from typing import Optional +import redis.asyncio as redis +from sqlalchemy.ext.asyncio import AsyncSession +from app.models.cost_tracking import Budget, BudgetPeriod, BudgetAlert, AlertSeverity + +@dataclass +class BudgetStatus: + budget_id: str + limit: float + spent: float + remaining: float + percent_used: float + exceeded: bool + hard_limit: bool + auto_downgrade: bool + threshold_breached: Optional[str] = None # warn, alert, critical + + +class BudgetEnforcer: + """Real-time budget enforcement using Redis.""" + + def __init__(self, redis_url: str = None): + self.redis = redis.from_url(redis_url or settings.REDIS_URL) + + async def check_budget(self, project_id: str) -> BudgetStatus: + """Check current budget status for a project.""" + # Get active budgets for project + budgets = await self._get_active_budgets(project_id) + + if not budgets: + return BudgetStatus( + budget_id=None, + limit=float("inf"), + spent=0, + remaining=float("inf"), + percent_used=0, + exceeded=False, + hard_limit=False, + auto_downgrade=False, + ) + + # Check most restrictive budget + for budget in budgets: + spend_key = f"budget:{budget.id}:spend" + spent = float(await self.redis.get(spend_key) or 0) + + percent_used = (spent / budget.limit_usd) * 100 + exceeded = spent >= budget.limit_usd + + # Determine threshold breach + threshold_breached = None + if percent_used >= budget.critical_threshold * 100: + threshold_breached = "critical" + elif percent_used >= budget.alert_threshold * 100: + threshold_breached = "alert" + elif percent_used >= budget.warn_threshold * 100: + threshold_breached = "warn" + + return BudgetStatus( + budget_id=str(budget.id), + limit=budget.limit_usd, + spent=spent, + remaining=max(0, budget.limit_usd - spent), + percent_used=percent_used, + exceeded=exceeded, + hard_limit=budget.hard_limit, + auto_downgrade=budget.auto_downgrade, + threshold_breached=threshold_breached, + ) + + async def increment_spend(self, project_id: str, amount: float) -> BudgetStatus: + """Increment budget spend and check thresholds.""" + budgets = await self._get_active_budgets(project_id) + + for budget in budgets: + spend_key = f"budget:{budget.id}:spend" + + # Atomic increment + new_spend = await self.redis.incrbyfloat(spend_key, amount) + + # Check and trigger alerts + await self._check_thresholds(budget, new_spend) + + return await self.check_budget(project_id) + + async def _check_thresholds(self, budget: Budget, current_spend: float): + """Check thresholds and create alerts if needed.""" + percent = (current_spend / budget.limit_usd) * 100 + + thresholds = [ + (budget.warn_threshold * 100, AlertSeverity.INFO), + (budget.alert_threshold * 100, AlertSeverity.WARNING), + (budget.critical_threshold * 100, AlertSeverity.CRITICAL), + ] + + for threshold, severity in thresholds: + if percent >= threshold: + # Check if alert already exists for this threshold + alert_key = f"budget:{budget.id}:alert:{severity.value}" + if not await self.redis.exists(alert_key): + await self._create_alert(budget, severity, percent, current_spend) + # Mark alert as sent + await self.redis.setex(alert_key, 86400, "1") # 24h TTL + + async def _create_alert( + self, + budget: Budget, + severity: AlertSeverity, + percent: float, + current_spend: float + ): + """Create budget alert and trigger notifications.""" + alert = BudgetAlert( + budget_id=budget.id, + project_id=budget.project_id, + severity=severity, + threshold_percent=percent, + current_spend=current_spend, + budget_limit=budget.limit_usd, + message=f"Budget '{budget.name}' at {percent:.1f}% (${current_spend:.2f}/${budget.limit_usd:.2f})" + ) + + # Persist alert + async with get_async_session() as session: + session.add(alert) + await session.commit() + + # Publish alert event + await EventBus().publish(f"project:{budget.project_id}", { + "type": "budget_alert", + "data": { + "severity": severity.value, + "message": alert.message, + "percent_used": percent, + "budget_name": budget.name, + } + }) + + async def reset_budget_period(self, budget_id: str): + """Reset budget counter for new period.""" + spend_key = f"budget:{budget_id}:spend" + await self.redis.delete(spend_key) + + # Clear alert markers + for severity in AlertSeverity: + alert_key = f"budget:{budget_id}:alert:{severity.value}" + await self.redis.delete(alert_key) + + def check_budget_sync(self, project_id: str) -> BudgetStatus: + """Synchronous version for pre-request checks.""" + import asyncio + loop = asyncio.get_event_loop() + return loop.run_until_complete(self.check_budget(project_id)) +``` + +### Budget Period Management (Celery Task) + +```python +# app/tasks/budget_tasks.py +from celery import shared_task +from datetime import datetime, UTC, timedelta +from app.services.cost_tracking.budget import BudgetEnforcer + +@shared_task +def check_budget_periods(): + """ + Scheduled task to manage budget periods. + Run every hour via Celery Beat. + """ + now = datetime.now(UTC) + enforcer = BudgetEnforcer() + + # Find budgets that need period reset + with get_session() as session: + expired_budgets = session.query(Budget).filter( + Budget.is_active == True, + Budget.period_end <= now + ).all() + + for budget in expired_budgets: + # Archive current period spend + archive_budget_period(budget) + + # Calculate new period + if budget.period == BudgetPeriod.DAILY: + period_start = now.replace(hour=0, minute=0, second=0) + period_end = period_start + timedelta(days=1) + elif budget.period == BudgetPeriod.WEEKLY: + period_start = now - timedelta(days=now.weekday()) + period_end = period_start + timedelta(weeks=1) + elif budget.period == BudgetPeriod.MONTHLY: + period_start = now.replace(day=1) + next_month = period_start + timedelta(days=32) + period_end = next_month.replace(day=1) + + # Update budget + budget.period_start = period_start + budget.period_end = period_end + budget.current_spend = 0 + + session.commit() + + # Reset Redis counter + asyncio.run(enforcer.reset_budget_period(str(budget.id))) + +@shared_task +def daily_cost_rollup(): + """ + Aggregate daily costs into summary table. + Run at 1 AM daily via Celery Beat. + """ + yesterday = datetime.now(UTC).date() - timedelta(days=1) + + with get_session() as session: + # Aggregate by project, agent_type, model + results = session.execute(""" + INSERT INTO daily_cost_summaries + (date, project_id, agent_type_id, model, + request_count, total_prompt_tokens, total_completion_tokens, + total_tokens, total_cost_usd, avg_latency_ms, + cache_hit_rate, success_rate) + SELECT + DATE(request_timestamp) as date, + project_id, + agent_type_id, + model, + COUNT(*) as request_count, + SUM(prompt_tokens) as total_prompt_tokens, + SUM(completion_tokens) as total_completion_tokens, + SUM(total_tokens) as total_tokens, + SUM(total_cost_usd) as total_cost_usd, + AVG(latency_ms) as avg_latency_ms, + SUM(cached_tokens)::float / NULLIF(SUM(total_tokens), 0) as cache_hit_rate, + AVG(CASE WHEN success THEN 1 ELSE 0 END) as success_rate + FROM token_usage + WHERE DATE(request_timestamp) = :yesterday + GROUP BY DATE(request_timestamp), project_id, agent_type_id, model + ON CONFLICT DO NOTHING + """, {"yesterday": yesterday}) + + session.commit() +``` + +--- + +## Alert System Integration + +### Alert Configuration + +```python +# app/services/cost_tracking/alerts.py +from dataclasses import dataclass +from enum import Enum +from typing import List, Optional + +class AlertChannel(str, Enum): + EMAIL = "email" + SLACK = "slack" + WEBHOOK = "webhook" + SSE = "sse" # Real-time dashboard + +@dataclass +class AlertRule: + budget_id: str + channels: List[AlertChannel] + recipients: List[str] # Email addresses, Slack channels, etc. + escalation_delay: int = 0 # Minutes before escalating + +class AlertDispatcher: + """Dispatch budget alerts to configured channels.""" + + async def dispatch(self, alert: BudgetAlert, rules: List[AlertRule]): + """Send alert through all configured channels.""" + for rule in rules: + if rule.budget_id == str(alert.budget_id): + for channel in rule.channels: + await self._send_alert(channel, alert, rule.recipients) + + async def _send_alert( + self, + channel: AlertChannel, + alert: BudgetAlert, + recipients: List[str] + ): + """Send alert to specific channel.""" + notification = { + "channel": channel.value, + "sent_at": datetime.now(UTC).isoformat(), + "recipients": recipients, + } + + if channel == AlertChannel.EMAIL: + await self._send_email_alert(alert, recipients) + elif channel == AlertChannel.SLACK: + await self._send_slack_alert(alert, recipients) + elif channel == AlertChannel.WEBHOOK: + await self._send_webhook_alert(alert, recipients) + elif channel == AlertChannel.SSE: + # Already handled via EventBus + pass + + # Track notification + alert.notifications_sent.append(notification) + + async def _send_slack_alert(self, alert: BudgetAlert, channels: List[str]): + """Send alert to Slack.""" + color = { + AlertSeverity.INFO: "#36a64f", # Green + AlertSeverity.WARNING: "#ff9800", # Orange + AlertSeverity.CRITICAL: "#f44336", # Red + }[alert.severity] + + payload = { + "attachments": [{ + "color": color, + "title": f"Budget Alert: {alert.message}", + "fields": [ + {"title": "Budget", "value": f"${alert.budget_limit:.2f}", "short": True}, + {"title": "Current Spend", "value": f"${alert.current_spend:.2f}", "short": True}, + {"title": "Usage", "value": f"{alert.threshold_percent:.1f}%", "short": True}, + {"title": "Severity", "value": alert.severity.value.upper(), "short": True}, + ], + "ts": datetime.now(UTC).timestamp(), + }] + } + + for channel in channels: + await send_slack_message(channel, payload) +``` + +--- + +## Cost Optimization Strategies + +### 1. Semantic Caching + +```python +# app/services/cost_tracking/cache.py +import hashlib +import numpy as np +from typing import Optional, Tuple +import redis.asyncio as redis + +class SemanticCache: + """ + Cache LLM responses based on semantic similarity. + Uses embedding vectors to find similar queries. + """ + + def __init__(self, redis_url: str, similarity_threshold: float = 0.95): + self.redis = redis.from_url(redis_url) + self.threshold = similarity_threshold + self.embedding_model = "text-embedding-3-small" # OpenAI + + async def get( + self, + messages: list, + model: str, + project_id: str + ) -> Optional[Tuple[str, dict]]: + """ + Check cache for semantically similar query. + + Returns: + Tuple of (cached_response, usage_info) or None + """ + query_embedding = await self._get_embedding(messages) + cache_key_pattern = f"cache:semantic:{project_id}:{model}:*" + + # Search for similar embeddings + async for key in self.redis.scan_iter(match=cache_key_pattern): + cached = await self.redis.hgetall(key) + if cached: + cached_embedding = np.frombuffer(cached[b"embedding"], dtype=np.float32) + similarity = self._cosine_similarity(query_embedding, cached_embedding) + + if similarity >= self.threshold: + return ( + cached[b"response"].decode(), + { + "cache_hit": True, + "similarity": similarity, + "tokens_saved": int(cached[b"tokens"]), + "cost_saved": float(cached[b"cost"]), + } + ) + + return None + + async def set( + self, + messages: list, + model: str, + project_id: str, + response: str, + usage: dict, + ttl: int = 3600 + ): + """Cache response with embedding for future similarity matching.""" + query_embedding = await self._get_embedding(messages) + + # Create unique key + content_hash = hashlib.sha256(str(messages).encode()).hexdigest()[:16] + cache_key = f"cache:semantic:{project_id}:{model}:{content_hash}" + + await self.redis.hset(cache_key, mapping={ + "embedding": query_embedding.tobytes(), + "response": response, + "tokens": usage["total_tokens"], + "cost": usage["total_cost_usd"], + "model": model, + "created_at": datetime.now(UTC).isoformat(), + }) + await self.redis.expire(cache_key, ttl) + + async def _get_embedding(self, messages: list) -> np.ndarray: + """Generate embedding for messages.""" + # Extract text content + text = " ".join(m.get("content", "") for m in messages) + + # Call embedding API + response = await litellm.aembedding( + model=self.embedding_model, + input=text + ) + + return np.array(response.data[0].embedding, dtype=np.float32) + + @staticmethod + def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: + """Calculate cosine similarity between vectors.""" + return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) +``` + +### 2. Model Cascading + +```python +# app/services/cost_tracking/cascade.py +from dataclasses import dataclass +from typing import List, Optional + +@dataclass +class CascadeModel: + model: str + cost_per_1k_tokens: float + max_complexity: float # 0-1 score + timeout: int = 30 + +class ModelCascade: + """ + Route queries to cheapest capable model. + Escalate to more expensive models only when needed. + """ + + def __init__(self): + self.models = [ + CascadeModel("claude-3-haiku-20240307", 0.00075, 0.3, timeout=10), + CascadeModel("gpt-4o-mini", 0.000375, 0.4, timeout=15), + CascadeModel("claude-3-5-sonnet-20241022", 0.009, 0.8, timeout=30), + CascadeModel("gpt-4-turbo", 0.02, 1.0, timeout=60), + ] + self.complexity_classifier = None # TODO: Train classifier + + async def route( + self, + messages: list, + required_quality: float = 0.7 + ) -> str: + """ + Determine the best model for the query. + + Args: + messages: The query messages + required_quality: Minimum quality threshold (0-1) + + Returns: + Model identifier to use + """ + complexity = await self._estimate_complexity(messages) + + # Find cheapest model that meets requirements + for model in self.models: + if model.max_complexity >= complexity and model.max_complexity >= required_quality: + return model.model + + # Fallback to most capable + return self.models[-1].model + + async def _estimate_complexity(self, messages: list) -> float: + """ + Estimate query complexity (0-1). + + Factors: + - Message length + - Technical terms + - Multi-step reasoning required + - Code generation requested + """ + text = " ".join(m.get("content", "") for m in messages) + + # Simple heuristics (replace with ML model) + complexity = 0.0 + + # Length factor + word_count = len(text.split()) + if word_count > 500: + complexity += 0.2 + elif word_count > 200: + complexity += 0.1 + + # Technical indicators + technical_terms = ["implement", "architect", "optimize", "debug", "refactor"] + if any(term in text.lower() for term in technical_terms): + complexity += 0.3 + + # Code indicators + if "```" in text or "code" in text.lower(): + complexity += 0.2 + + # Multi-step indicators + if "step" in text.lower() or "first" in text.lower() or "then" in text.lower(): + complexity += 0.2 + + return min(1.0, complexity) + + async def execute_with_fallback( + self, + messages: list, + required_quality: float = 0.7, + confidence_threshold: float = 0.8 + ) -> dict: + """ + Execute query with automatic fallback to stronger models. + """ + for model in self.models: + if model.max_complexity < required_quality: + continue + + try: + response = await litellm.acompletion( + model=model.model, + messages=messages, + timeout=model.timeout, + ) + + # Check response quality (simplified) + confidence = await self._evaluate_response(response, messages) + + if confidence >= confidence_threshold: + return { + "response": response, + "model_used": model.model, + "fallback_count": self.models.index(model), + } + + except Exception as e: + # Log and try next model + continue + + # Final fallback + return await self._execute_premium(messages) +``` + +### 3. Prompt Compression (LLMLingua Integration) + +```python +# app/services/cost_tracking/compression.py +from llmlingua import PromptCompressor + +class PromptOptimizer: + """ + Compress prompts to reduce token count while preserving meaning. + Uses LLMLingua for intelligent compression. + """ + + def __init__(self): + self.compressor = PromptCompressor( + model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", + use_llmlingua2=True + ) + + def compress( + self, + messages: list, + target_ratio: float = 0.5, # Compress to 50% of original + preserve_instructions: bool = True + ) -> list: + """ + Compress messages while preserving critical information. + + Args: + messages: Original messages + target_ratio: Target compression ratio + preserve_instructions: Whether to preserve system instructions + + Returns: + Compressed messages + """ + compressed_messages = [] + + for msg in messages: + role = msg["role"] + content = msg["content"] + + # Don't compress system instructions if requested + if role == "system" and preserve_instructions: + compressed_messages.append(msg) + continue + + # Apply compression + compressed = self.compressor.compress_prompt( + content, + rate=target_ratio, + force_tokens=self._extract_critical_tokens(content), + ) + + compressed_messages.append({ + "role": role, + "content": compressed["compressed_prompt"] + }) + + return compressed_messages + + def _extract_critical_tokens(self, text: str) -> list: + """Extract tokens that should never be removed.""" + # Preserve code blocks, variable names, etc. + import re + + critical = [] + + # Preserve code blocks + code_blocks = re.findall(r'```[\s\S]*?```', text) + for block in code_blocks: + critical.extend(block.split()) + + # Preserve quoted strings + quotes = re.findall(r'"[^"]*"', text) + critical.extend(quotes) + + return critical + + def estimate_savings(self, original: list, compressed: list) -> dict: + """Calculate token savings from compression.""" + original_tokens = sum( + len(m["content"].split()) * 1.3 # Rough token estimate + for m in original + ) + compressed_tokens = sum( + len(m["content"].split()) * 1.3 + for m in compressed + ) + + return { + "original_tokens": int(original_tokens), + "compressed_tokens": int(compressed_tokens), + "tokens_saved": int(original_tokens - compressed_tokens), + "compression_ratio": compressed_tokens / original_tokens, + } +``` + +--- + +## Cost Estimation Before Execution + +```python +# app/services/cost_tracking/estimator.py +from dataclasses import dataclass +from typing import Optional +import tiktoken + +@dataclass +class CostEstimate: + input_tokens: int + estimated_output_tokens: int + min_cost_usd: float + max_cost_usd: float + expected_cost_usd: float + model: str + confidence: str # low, medium, high + +class CostEstimator: + """ + Estimate LLM request cost before execution. + """ + + def __init__(self): + self.encoders = {} + + def estimate( + self, + messages: list, + model: str, + max_tokens: Optional[int] = None + ) -> CostEstimate: + """ + Estimate cost for a completion request. + + Args: + messages: Input messages + model: Target model + max_tokens: Maximum output tokens (if known) + """ + # Count input tokens + input_tokens = self._count_tokens(messages, model) + + # Estimate output tokens + if max_tokens: + estimated_output = max_tokens + confidence = "high" + else: + estimated_output = self._estimate_output_tokens(messages, model) + confidence = "medium" + + # Get model costs + costs = MODEL_COSTS.get(model, {"input": 0, "output": 0}) + + # Calculate costs + input_cost = (input_tokens / 1_000_000) * costs["input"] + + min_output = estimated_output * 0.3 # Conservative + max_output = estimated_output * 1.5 # Liberal + expected_output = estimated_output + + min_output_cost = (min_output / 1_000_000) * costs["output"] + max_output_cost = (max_output / 1_000_000) * costs["output"] + expected_output_cost = (expected_output / 1_000_000) * costs["output"] + + return CostEstimate( + input_tokens=input_tokens, + estimated_output_tokens=estimated_output, + min_cost_usd=input_cost + min_output_cost, + max_cost_usd=input_cost + max_output_cost, + expected_cost_usd=input_cost + expected_output_cost, + model=model, + confidence=confidence, + ) + + def _count_tokens(self, messages: list, model: str) -> int: + """Count tokens for input messages.""" + # Get or create encoder + if model not in self.encoders: + try: + if "gpt" in model or "o1" in model: + self.encoders[model] = tiktoken.encoding_for_model(model) + else: + # Fallback to cl100k_base for non-OpenAI + self.encoders[model] = tiktoken.get_encoding("cl100k_base") + except Exception: + self.encoders[model] = tiktoken.get_encoding("cl100k_base") + + encoder = self.encoders[model] + + total = 0 + for msg in messages: + total += len(encoder.encode(msg.get("content", ""))) + total += 4 # Message overhead tokens + + return total + 2 # Priming tokens + + def _estimate_output_tokens(self, messages: list, model: str) -> int: + """ + Estimate expected output tokens based on input. + Uses heuristics and historical data. + """ + input_tokens = self._count_tokens(messages, model) + last_message = messages[-1].get("content", "") if messages else "" + + # Heuristics based on request type + if "explain" in last_message.lower() or "describe" in last_message.lower(): + # Explanatory responses tend to be longer + return min(2000, input_tokens * 1.5) + + if "code" in last_message.lower() or "implement" in last_message.lower(): + # Code generation can be substantial + return min(4000, input_tokens * 2) + + if "yes or no" in last_message.lower() or "brief" in last_message.lower(): + # Short responses + return 100 + + if "list" in last_message.lower(): + # Lists are medium length + return min(1000, input_tokens * 0.8) + + # Default: output roughly equal to input + return min(2000, input_tokens) + + def should_warn( + self, + estimate: CostEstimate, + budget_remaining: float, + threshold: float = 0.1 # Warn if request is >10% of remaining budget + ) -> bool: + """Check if user should be warned about expensive request.""" + if budget_remaining <= 0: + return True + + return estimate.max_cost_usd > (budget_remaining * threshold) +``` + +### Pre-Execution Check Integration + +```python +# app/services/llm_gateway.py + +class LLMGateway: + async def complete( + self, + agent_id: str, + project_id: str, + messages: list, + model_preference: str = "high-reasoning", + require_estimate: bool = False, + **kwargs + ) -> dict: + """Generate completion with optional cost estimation.""" + + model = self._resolve_model(model_preference) + + if require_estimate: + estimator = CostEstimator() + estimate = estimator.estimate(messages, model, kwargs.get("max_tokens")) + + # Check against budget + budget_status = await self.budget_enforcer.check_budget(project_id) + + if estimator.should_warn(estimate, budget_status.remaining): + return { + "type": "cost_warning", + "estimate": estimate, + "budget_remaining": budget_status.remaining, + "message": f"This request may cost ${estimate.expected_cost_usd:.4f}. " + f"Remaining budget: ${budget_status.remaining:.2f}. " + f"Proceed?", + "alternatives": self._suggest_alternatives(estimate, model), + } + + # Proceed with request + return await self._execute_completion(...) + + def _suggest_alternatives(self, estimate: CostEstimate, model: str) -> list: + """Suggest cost-saving alternatives.""" + alternatives = [] + + # Cheaper model + if model == "claude-3-5-sonnet-20241022": + alternatives.append({ + "action": "use_haiku", + "model": "claude-3-haiku-20240307", + "estimated_savings": estimate.expected_cost_usd * 0.9, + "quality_impact": "May reduce quality for complex tasks", + }) + + # Compression + alternatives.append({ + "action": "compress_prompt", + "estimated_savings": estimate.expected_cost_usd * 0.3, + "quality_impact": "Minimal for most use cases", + }) + + return alternatives +``` + +--- + +## Reporting Dashboard Requirements + +### Dashboard Components + +| Component | Data Source | Update Frequency | +|-----------|-------------|------------------| +| Real-time Spend Ticker | Redis + SSE | Real-time | +| Budget Utilization Gauges | Redis | 1 second | +| Daily Cost Chart | PostgreSQL (daily_cost_summaries) | 1 minute | +| Model Breakdown Pie Chart | PostgreSQL | 5 minutes | +| Agent Efficiency Table | PostgreSQL | 5 minutes | +| Cost Forecast | PostgreSQL + ML | 1 hour | +| Alert Feed | SSE + PostgreSQL | Real-time | + +### API Endpoints + +```python +# app/api/v1/costs.py +from fastapi import APIRouter, Query +from datetime import datetime, timedelta + +router = APIRouter(prefix="/costs", tags=["costs"]) + +@router.get("/summary/{project_id}") +async def get_cost_summary( + project_id: str, + period: str = Query("day", regex="^(day|week|month)$") +) -> CostSummaryResponse: + """Get cost summary for a project.""" + pass + +@router.get("/breakdown/{project_id}") +async def get_cost_breakdown( + project_id: str, + group_by: str = Query("model", regex="^(model|agent_type|day)$"), + start_date: datetime = None, + end_date: datetime = None +) -> CostBreakdownResponse: + """Get detailed cost breakdown.""" + pass + +@router.get("/trends/{project_id}") +async def get_cost_trends( + project_id: str, + days: int = Query(30, ge=1, le=365) +) -> CostTrendsResponse: + """Get historical cost trends.""" + pass + +@router.get("/forecast/{project_id}") +async def get_cost_forecast( + project_id: str, + days_ahead: int = Query(7, ge=1, le=30) +) -> CostForecastResponse: + """Get projected costs based on current trends.""" + pass + +@router.get("/efficiency/{project_id}") +async def get_efficiency_metrics( + project_id: str +) -> EfficiencyMetricsResponse: + """Get cost efficiency metrics (cost per task, cache hit rate, etc.).""" + pass + +@router.get("/budgets/{project_id}") +async def get_budgets(project_id: str) -> List[BudgetResponse]: + """Get all budgets for a project.""" + pass + +@router.post("/budgets") +async def create_budget(budget: BudgetCreate) -> BudgetResponse: + """Create a new budget.""" + pass + +@router.put("/budgets/{budget_id}") +async def update_budget(budget_id: str, budget: BudgetUpdate) -> BudgetResponse: + """Update budget configuration.""" + pass + +@router.get("/alerts/{project_id}") +async def get_alerts( + project_id: str, + status: str = Query(None, regex="^(pending|acknowledged|resolved)$") +) -> List[AlertResponse]: + """Get budget alerts for a project.""" + pass + +@router.post("/alerts/{alert_id}/acknowledge") +async def acknowledge_alert(alert_id: str) -> AlertResponse: + """Acknowledge a budget alert.""" + pass +``` + +### Dashboard Wireframe + +``` ++------------------------------------------------------------------+ +| Syndarix Cost Dashboard | ++------------------------------------------------------------------+ +| | +| +------------------+ +------------------+ +-------------------+ | +| | DAILY SPEND | | WEEKLY BUDGET | | MONTHLY BUDGET | | +| | $12.47 | | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ 76% | | β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ 42% | | +| | β–² 15% vs avg | | $152/$200 | | $421/$1000 | | +| +------------------+ +------------------+ +-------------------+ | +| | +| +-------------------------------+ +----------------------------+ | +| | COST BY MODEL (7 days) | | DAILY COST TREND | | +| | β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” | | $ | | +| | β”‚ Claude Sonnet 64% β”‚ | | 20β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ | | +| | β”‚ GPT-4o-mini 22% β”‚ | | 15β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–ˆβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€| | +| | β”‚ Claude Haiku 9% β”‚ | | 10β”œβ”€β”€β”€β”€β”€β”€β”€β”€β–ˆβ”€β”€β”€β–ˆβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€| | +| | β”‚ Other 5% β”‚ | | 5β”œβ”€β”€β–ˆβ”€β”€β”€β–ˆβ”€β”€β”€β–ˆβ”€β”€β”€β–ˆβ”€β”€β”€β”€β”€β”€β”€β”€β”€| | +| | β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ | | 0└───────────────────────>| | +| +-------------------------------+ | Mon Tue Wed Thu Fri | | +| +----------------------------+ | +| | +| +---------------------------------------------------------------+ | +| | AGENT EFFICIENCY | | +| +----------+----------+----------+-----------+------------------+ | +| | Agent | Requests | Tokens | Cost | Cost/Task | | +| +----------+----------+----------+-----------+------------------+ | +| | Arch | 145 | 2.3M | $18.50 | $0.13 | | +| | Engineer | 892 | 12.1M | $96.80 | $0.11 | | +| | QA | 234 | 1.2M | $4.20 | $0.02 | | +| +----------+----------+----------+-----------+------------------+ | +| | +| +---------------------------------------------------------------+ | +| | RECENT ALERTS [Clear] | | +| +---------------------------------------------------------------+ | +| | ! WARNING Daily budget at 80% ($16/$20) 2 mins ago | | +| | i INFO Weekly budget at 50% 1 hour ago | | +| +---------------------------------------------------------------+ | ++------------------------------------------------------------------+ +``` + +--- + +## Implementation Roadmap + +### Phase 1: Core Infrastructure (Week 1-2) + +1. **Database Schema** + - [ ] Create migration for `token_usage`, `budgets`, `budget_alerts` + - [ ] Create migration for `daily_cost_summaries`, `model_pricing` + - [ ] Add indexes for performance + +2. **Redis Setup** + - [ ] Configure Redis keys for budget counters + - [ ] Implement budget increment/decrement operations + - [ ] Set up TTL for temporary data + +3. **LiteLLM Callback** + - [ ] Implement `SyndarixCostCallback` class + - [ ] Integrate with existing LLM Gateway + - [ ] Add metadata propagation for attribution + +### Phase 2: Budget Management (Week 2-3) + +4. **Budget Enforcer** + - [ ] Implement `BudgetEnforcer` class + - [ ] Add pre-request budget checks + - [ ] Implement soft/hard limit logic + +5. **Alert System** + - [ ] Implement threshold detection + - [ ] Create alert dispatcher + - [ ] Integrate with SSE event bus + +6. **Celery Tasks** + - [ ] Budget period management task + - [ ] Daily cost rollup task + - [ ] Alert digest task + +### Phase 3: Cost Optimization (Week 3-4) + +7. **Semantic Cache** + - [ ] Implement `SemanticCache` with Redis + - [ ] Integrate embedding generation + - [ ] Add cache hit/miss tracking + +8. **Model Cascading** + - [ ] Implement `ModelCascade` router + - [ ] Add complexity estimation + - [ ] Integrate with LLM Gateway + +9. **Cost Estimation** + - [ ] Implement `CostEstimator` + - [ ] Add pre-execution warnings + - [ ] Create alternative suggestions + +### Phase 4: Reporting (Week 4-5) + +10. **API Endpoints** + - [ ] Implement cost summary endpoints + - [ ] Add breakdown and trends endpoints + - [ ] Create forecast endpoint + +11. **Dashboard** + - [ ] Design and implement cost dashboard UI + - [ ] Add real-time updates via SSE + - [ ] Create alert management interface + +### Phase 5: Testing & Documentation (Week 5-6) + +12. **Testing** + - [ ] Unit tests for cost tracking + - [ ] Integration tests for budget enforcement + - [ ] Load tests for high-volume scenarios + +13. **Documentation** + - [ ] API documentation + - [ ] User guide for budget management + - [ ] Operations runbook + +--- + +## References + +### Research Sources + +- [LiteLLM Custom Callbacks Documentation](https://docs.litellm.ai/docs/observability/custom_callback) +- [LiteLLM Cost Calculation](https://deepwiki.com/BerriAI/litellm/2.5-cost-calculation-and-model-pricing) +- [LiteLLM Spend Tracking](https://docs.litellm.ai/docs/proxy/cost_tracking) +- [LLM Cost Optimization Guide 2025](https://ai.koombea.com/blog/llm-cost-optimization) +- [Dynamic Load Balancing for Multi-Tenant LLMs](https://latitude-blog.ghost.io/blog/dynamic-load-balancing-for-multi-tenant-llms/) +- [Multi-Tenant AI Architecture - Azure](https://learn.microsoft.com/en-us/azure/architecture/guide/multitenant/approaches/ai-ml) +- [LLMLingua Prompt Compression](https://www.llmlingua.com/) +- [Model Cascading for Cost Reduction](https://arxiv.org/abs/2410.10347) +- [BlockLLM Multi-tenant LLM Serving](https://arxiv.org/html/2404.18322v1) + +### Related Spikes + +- [SPIKE-005: LLM Provider Abstraction](./SPIKE-005-llm-provider-abstraction.md) - LiteLLM integration baseline +- [SPIKE-003: Real-time Updates](./SPIKE-003-realtime-updates.md) - SSE event architecture +- [SPIKE-004: Celery + Redis Integration](./SPIKE-004-celery-redis-integration.md) - Background task infrastructure + +--- + +## Decision + +**Implement a comprehensive cost tracking system** with: + +1. **LiteLLM callbacks** for real-time usage capture +2. **Redis + PostgreSQL** for hybrid hot/warm data storage +3. **Hierarchical budgets** with soft/hard limit enforcement +4. **Multi-channel alerts** via SSE, email, and Slack +5. **Cost optimization** through caching, cascading, and compression + +**Expected Outcomes:** +- Full cost visibility per project, agent, and model +- Real-time budget enforcement preventing cost overruns +- 60-80% cost reduction through optimization strategies +- Actionable insights for cost-aware decision making + +--- + +*Spike completed. Findings will inform ADR-010: Cost Tracking Architecture.* diff --git a/docs/spikes/SPIKE-011-audit-logging.md b/docs/spikes/SPIKE-011-audit-logging.md new file mode 100644 index 0000000..dc3a231 --- /dev/null +++ b/docs/spikes/SPIKE-011-audit-logging.md @@ -0,0 +1,1064 @@ +# SPIKE-011: Audit Logging for Syndarix + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #11 + +--- + +## Executive Summary + +Syndarix, as an autonomous AI-powered consulting agency, requires comprehensive audit logging to ensure compliance, enable debugging, and build client trust. This spike researches best practices for audit logging in autonomous AI systems and provides concrete recommendations for implementation. + +**Recommendation:** Implement a structured, OpenTelemetry-compatible audit logging system using: +- **Structlog** for structured JSON logging with contextual enrichment +- **PostgreSQL** with TimescaleDB extension for hot storage (0-90 days) +- **S3-compatible object storage** for cold archival (90+ days) +- **Cryptographic hash chaining** for immutability verification +- **OpenTelemetry** integration for correlation with traces and metrics + +--- + +## Objective + +Research the optimal audit logging architecture for Syndarix, focusing on: +1. Comprehensive event capture for autonomous AI agent actions +2. Compliance with SOC2/GDPR requirements +3. Searchable, queryable audit trails +4. Immutable, tamper-evident logging +5. Scalable storage architecture + +--- + +## Research Questions & Findings + +### 1. What to Log in Autonomous AI Systems + +Based on LLM observability best practices and Syndarix-specific requirements, the following event categories must be logged: + +#### Agent Actions +| Event Type | Description | Critical Fields | +|------------|-------------|-----------------| +| `agent.spawned` | Agent instance created | agent_type, agent_id, project_id, config | +| `agent.action.started` | Agent begins action | action_type, input_params, before_state | +| `agent.action.completed` | Agent completes action | action_type, output, after_state, duration_ms | +| `agent.action.failed` | Agent action failed | action_type, error, stack_trace | +| `agent.decision` | Agent makes autonomous decision | decision_type, options, chosen, reasoning | +| `agent.terminated` | Agent instance destroyed | reason, final_state | + +#### LLM Interactions +| Event Type | Description | Critical Fields | +|------------|-------------|-----------------| +| `llm.request` | Prompt sent to LLM | model, prompt_template, variables, token_count | +| `llm.response` | LLM response received | model, response, token_count, latency_ms | +| `llm.error` | LLM call failed | model, error, retry_count | +| `llm.tool_call` | LLM invokes tool | tool_name, arguments | + +#### MCP Tool Invocations +| Event Type | Description | Critical Fields | +|------------|-------------|-----------------| +| `mcp.tool.invoked` | MCP tool called | server, tool_name, arguments, project_id | +| `mcp.tool.result` | MCP tool returned | server, tool_name, result, duration_ms | +| `mcp.tool.error` | MCP tool failed | server, tool_name, error | + +#### Human Approvals +| Event Type | Description | Critical Fields | +|------------|-------------|-----------------| +| `approval.requested` | System requests human approval | action_type, context, options | +| `approval.granted` | Human approves action | approver_id, action_id, comments | +| `approval.rejected` | Human rejects action | approver_id, action_id, reason | +| `approval.timeout` | Approval request timed out | action_id, timeout_ms | + +#### Git Operations +| Event Type | Description | Critical Fields | +|------------|-------------|-----------------| +| `git.commit` | Commit created | repo, branch, commit_sha, message, files | +| `git.branch.created` | Branch created | repo, branch_name, base_branch | +| `git.pr.created` | Pull request opened | repo, pr_number, title, head, base | +| `git.pr.merged` | Pull request merged | repo, pr_number, merge_commit | + +#### Project Lifecycle +| Event Type | Description | Critical Fields | +|------------|-------------|-----------------| +| `project.created` | New project started | project_id, client_id, autonomy_level | +| `project.sprint.started` | Sprint begins | sprint_id, goals, assigned_agents | +| `project.milestone.completed` | Milestone achieved | milestone_id, deliverables | +| `project.checkpoint` | Client checkpoint | checkpoint_type, feedback | + +### 2. Structured Logging Format + +**Recommendation:** Use OpenTelemetry-compatible structured JSON logging. + +#### Base Event Schema + +```python +from pydantic import BaseModel +from datetime import datetime +from typing import Any, Optional +from enum import Enum + +class AuditEventSeverity(str, Enum): + DEBUG = "DEBUG" + INFO = "INFO" + WARNING = "WARNING" + ERROR = "ERROR" + CRITICAL = "CRITICAL" + +class AuditEvent(BaseModel): + """Base schema for all audit events.""" + + # Identity & Correlation + event_id: str # UUID v7 (time-ordered) + trace_id: Optional[str] = None # OpenTelemetry trace ID + span_id: Optional[str] = None # OpenTelemetry span ID + parent_event_id: Optional[str] = None # For event chains + + # Timestamp + timestamp: datetime # ISO 8601 with timezone + timestamp_unix_ms: int # Unix millis for indexing + + # Event Classification + event_type: str # e.g., "agent.action.completed" + event_category: str # e.g., "agent", "llm", "mcp" + severity: AuditEventSeverity + + # Context + project_id: Optional[str] = None + agent_id: Optional[str] = None + agent_type: Optional[str] = None + user_id: Optional[str] = None # Human actor if applicable + session_id: Optional[str] = None + + # Event Data + action: str # Human-readable action description + data: dict[str, Any] # Event-specific payload + + # State Tracking + before_state: Optional[dict] = None # State before action + after_state: Optional[dict] = None # State after action + + # Technical Metadata + service: str = "syndarix" + service_version: str + environment: str # production, staging, development + hostname: str + + # Immutability + previous_hash: Optional[str] = None # Hash of previous event (chain) + event_hash: Optional[str] = None # SHA-256 of this event + + class Config: + json_schema_extra = { + "example": { + "event_id": "019373a8-9b2e-7f4c-8d1a-2b3c4d5e6f7a", + "trace_id": "abc123def456", + "timestamp": "2025-12-29T14:30:00.000Z", + "timestamp_unix_ms": 1735480200000, + "event_type": "agent.action.completed", + "event_category": "agent", + "severity": "INFO", + "project_id": "proj-001", + "agent_id": "agent-123", + "agent_type": "software_engineer", + "action": "Created feature branch", + "data": { + "branch_name": "feature/user-auth", + "base_branch": "main" + }, + "service": "syndarix", + "service_version": "1.0.0", + "environment": "production" + } + } +``` + +#### LLM-Specific Event Schema + +```python +class LLMRequestEvent(AuditEvent): + """Schema for LLM request events.""" + event_type: str = "llm.request" + event_category: str = "llm" + + data: dict = { + "model": str, # e.g., "claude-3-5-sonnet" + "provider": str, # e.g., "anthropic", "openai" + "prompt_template_id": str, # Reference to template + "prompt_template_version": str, + "prompt_variables": dict, # Variables substituted + "system_prompt_hash": str, # Hash of system prompt + "user_prompt": str, # Full user prompt (may be truncated) + "token_count_estimate": int, + "max_tokens": int, + "temperature": float, + "tools_available": list[str], + } + +class LLMResponseEvent(AuditEvent): + """Schema for LLM response events.""" + event_type: str = "llm.response" + event_category: str = "llm" + + data: dict = { + "model": str, + "provider": str, + "response_text": str, # May be truncated for storage + "response_hash": str, # Full response hash + "input_tokens": int, + "output_tokens": int, + "total_tokens": int, + "latency_ms": int, + "finish_reason": str, # "stop", "max_tokens", "tool_use" + "tool_calls": list[dict], # If tools were invoked + "cost_usd": float, # Estimated cost + } +``` + +### 3. Storage Architecture + +**Recommendation:** Tiered storage with hot, warm, and cold layers. + +``` + Query Latency + ◄────────────► + Fast Slow +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ HOT STORAGE β”‚ +β”‚ (0-30 days, ~10TB/month) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ PostgreSQL + TimescaleDB β”‚ β”‚ +β”‚ β”‚ - Hypertables partitioned by day β”‚ β”‚ +β”‚ β”‚ - Native compression after 7 days β”‚ β”‚ +β”‚ β”‚ - Full-text search on action/data fields β”‚ β”‚ +β”‚ β”‚ - B-tree indexes on project_id, agent_id, event_type β”‚ β”‚ +β”‚ β”‚ - GIN index on JSONB data field β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β–Ό (30 days) β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ WARM STORAGE β”‚ +β”‚ (30-90 days, compressed) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ TimescaleDB Continuous Aggregates β”‚ β”‚ +β”‚ β”‚ - Hourly/daily rollups of metrics β”‚ β”‚ +β”‚ β”‚ - Detailed logs in highly compressed chunks β”‚ β”‚ +β”‚ β”‚ - Query via same SQL interface β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ +β”‚ β–Ό (90 days) β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ COLD STORAGE β”‚ +β”‚ (90+ days, archival, 7 year retention) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ S3-Compatible Object Storage (MinIO / AWS S3) β”‚ β”‚ +β”‚ β”‚ - Parquet files partitioned by date/project β”‚ β”‚ +β”‚ β”‚ - Glacier-class storage after 1 year β”‚ β”‚ +β”‚ β”‚ - Queryable via Trino/Athena for investigations β”‚ β”‚ +β”‚ β”‚ - Cryptographic manifest for integrity verification β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +#### TimescaleDB Schema + +```sql +-- Enable TimescaleDB extension +CREATE EXTENSION IF NOT EXISTS timescaledb; + +-- Main audit events table +CREATE TABLE audit_events ( + event_id UUID PRIMARY KEY, + trace_id VARCHAR(32), + span_id VARCHAR(16), + parent_event_id UUID REFERENCES audit_events(event_id), + + timestamp TIMESTAMPTZ NOT NULL, + timestamp_unix_ms BIGINT NOT NULL, + + event_type VARCHAR(100) NOT NULL, + event_category VARCHAR(50) NOT NULL, + severity VARCHAR(20) NOT NULL, + + project_id VARCHAR(50), + agent_id VARCHAR(50), + agent_type VARCHAR(50), + user_id VARCHAR(50), + session_id VARCHAR(50), + + action TEXT NOT NULL, + data JSONB NOT NULL, + + before_state JSONB, + after_state JSONB, + + service VARCHAR(50) NOT NULL, + service_version VARCHAR(20) NOT NULL, + environment VARCHAR(20) NOT NULL, + hostname VARCHAR(100) NOT NULL, + + previous_hash VARCHAR(64), + event_hash VARCHAR(64) NOT NULL, + + -- Denormalized for efficient queries + created_at TIMESTAMPTZ DEFAULT NOW() +); + +-- Convert to hypertable for time-series optimization +SELECT create_hypertable( + 'audit_events', + 'timestamp', + chunk_time_interval => INTERVAL '1 day' +); + +-- Enable compression for chunks older than 7 days +ALTER TABLE audit_events SET ( + timescaledb.compress, + timescaledb.compress_segmentby = 'project_id, event_category', + timescaledb.compress_orderby = 'timestamp DESC' +); + +SELECT add_compression_policy('audit_events', INTERVAL '7 days'); + +-- Retention policy: move to cold storage after 90 days +SELECT add_retention_policy('audit_events', INTERVAL '90 days'); + +-- Indexes for common query patterns +CREATE INDEX idx_audit_project_time ON audit_events (project_id, timestamp DESC); +CREATE INDEX idx_audit_agent_time ON audit_events (agent_id, timestamp DESC); +CREATE INDEX idx_audit_type_time ON audit_events (event_type, timestamp DESC); +CREATE INDEX idx_audit_category_time ON audit_events (event_category, timestamp DESC); +CREATE INDEX idx_audit_trace ON audit_events (trace_id) WHERE trace_id IS NOT NULL; + +-- GIN index for JSONB queries +CREATE INDEX idx_audit_data ON audit_events USING GIN (data); + +-- Full-text search on action field +CREATE INDEX idx_audit_action_fts ON audit_events USING GIN (to_tsvector('english', action)); +``` + +#### Cold Storage Archive Job + +```python +# jobs/archive_audit_logs.py +import asyncio +from datetime import datetime, timedelta +import pyarrow as pa +import pyarrow.parquet as pq +from minio import Minio +import hashlib + +async def archive_old_audit_logs(): + """Archive audit logs older than 90 days to S3/MinIO.""" + + cutoff_date = datetime.utcnow() - timedelta(days=90) + + # Query logs to archive + query = """ + SELECT * FROM audit_events + WHERE timestamp < %s + ORDER BY timestamp ASC + """ + + async with get_db_session() as session: + result = await session.execute(query, [cutoff_date]) + records = result.fetchall() + + if not records: + return + + # Convert to Parquet + table = pa.Table.from_pylist([dict(r) for r in records]) + + # Partition by date and project + partition_key = cutoff_date.strftime("%Y/%m/%d") + + # Write to buffer + buffer = pa.BufferOutputStream() + pq.write_table(table, buffer, compression='zstd') + + # Calculate manifest hash + data = buffer.getvalue().to_pybytes() + content_hash = hashlib.sha256(data).hexdigest() + + # Upload to MinIO/S3 + client = Minio( + settings.MINIO_ENDPOINT, + access_key=settings.MINIO_ACCESS_KEY, + secret_key=settings.MINIO_SECRET_KEY, + ) + + object_name = f"audit-logs/{partition_key}/events.parquet" + client.put_object( + bucket_name="syndarix-audit-archive", + object_name=object_name, + data=io.BytesIO(data), + length=len(data), + metadata={"content-hash": content_hash} + ) + + # Update manifest for integrity verification + await update_archive_manifest(partition_key, object_name, content_hash) + + # Delete archived records from TimescaleDB + await session.execute( + "DELETE FROM audit_events WHERE timestamp < %s", + [cutoff_date] + ) +``` + +### 4. Immutability & Integrity + +**Recommendation:** Use cryptographic hash chaining for tamper evidence. + +```python +# app/audit/integrity.py +import hashlib +import json +from typing import Optional + +class AuditIntegrity: + """Cryptographic hash chaining for audit log integrity.""" + + def __init__(self, redis_client): + self.redis = redis_client + self._last_hash_key = "audit:last_hash:{project_id}" + + async def compute_event_hash( + self, + event: AuditEvent, + previous_hash: Optional[str] = None + ) -> str: + """ + Compute SHA-256 hash of event including previous hash. + Creates a blockchain-like chain of events. + """ + # Canonical JSON representation (sorted keys, no whitespace) + canonical = json.dumps( + { + "event_id": str(event.event_id), + "timestamp_unix_ms": event.timestamp_unix_ms, + "event_type": event.event_type, + "project_id": event.project_id, + "agent_id": event.agent_id, + "action": event.action, + "data": event.data, + "previous_hash": previous_hash or "", + }, + sort_keys=True, + separators=(",", ":") + ) + + return hashlib.sha256(canonical.encode()).hexdigest() + + async def chain_event(self, event: AuditEvent) -> AuditEvent: + """Add event to the hash chain.""" + project_key = self._last_hash_key.format(project_id=event.project_id) + + # Get previous hash atomically + previous_hash = await self.redis.get(project_key) + + # Compute new hash + event.previous_hash = previous_hash + event.event_hash = await self.compute_event_hash(event, previous_hash) + + # Update last hash atomically + await self.redis.set(project_key, event.event_hash) + + return event + + async def verify_chain( + self, + project_id: str, + start_event_id: str, + end_event_id: str + ) -> tuple[bool, Optional[str]]: + """ + Verify integrity of event chain between two events. + Returns (is_valid, first_invalid_event_id). + """ + events = await self.get_events_in_range( + project_id, start_event_id, end_event_id + ) + + previous_hash = events[0].previous_hash + + for event in events: + expected_hash = await self.compute_event_hash(event, previous_hash) + + if event.event_hash != expected_hash: + return (False, str(event.event_id)) + + previous_hash = event.event_hash + + return (True, None) +``` + +### 5. Query Patterns & Indexing + +#### Common Query Patterns + +```python +# app/audit/queries.py + +class AuditQueries: + """Optimized audit log queries.""" + + async def get_project_timeline( + self, + project_id: str, + start_time: datetime, + end_time: datetime, + event_types: Optional[list[str]] = None, + limit: int = 1000 + ) -> list[AuditEvent]: + """Get chronological audit trail for a project.""" + query = """ + SELECT * FROM audit_events + WHERE project_id = $1 + AND timestamp BETWEEN $2 AND $3 + {type_filter} + ORDER BY timestamp DESC + LIMIT $4 + """ + type_filter = "" + if event_types: + type_filter = f"AND event_type = ANY($5)" + + return await self.db.fetch(query, project_id, start_time, end_time, limit) + + async def get_agent_actions( + self, + agent_id: str, + hours: int = 24 + ) -> list[AuditEvent]: + """Get all actions by a specific agent.""" + query = """ + SELECT * FROM audit_events + WHERE agent_id = $1 + AND timestamp > NOW() - INTERVAL '%s hours' + AND event_category = 'agent' + ORDER BY timestamp DESC + """ + return await self.db.fetch(query, agent_id, hours) + + async def get_llm_usage_summary( + self, + project_id: str, + days: int = 30 + ) -> dict: + """Get LLM usage statistics for billing/monitoring.""" + query = """ + SELECT + data->>'model' as model, + data->>'provider' as provider, + COUNT(*) as request_count, + SUM((data->>'total_tokens')::int) as total_tokens, + SUM((data->>'cost_usd')::float) as total_cost, + AVG((data->>'latency_ms')::int) as avg_latency_ms + FROM audit_events + WHERE project_id = $1 + AND event_type = 'llm.response' + AND timestamp > NOW() - INTERVAL '%s days' + GROUP BY data->>'model', data->>'provider' + """ + return await self.db.fetch(query, project_id, days) + + async def search_actions( + self, + query_text: str, + project_id: Optional[str] = None, + limit: int = 100 + ) -> list[AuditEvent]: + """Full-text search on action descriptions.""" + query = """ + SELECT *, ts_rank(to_tsvector('english', action), query) as rank + FROM audit_events, plainto_tsquery('english', $1) query + WHERE to_tsvector('english', action) @@ query + {project_filter} + ORDER BY rank DESC, timestamp DESC + LIMIT $2 + """ + project_filter = f"AND project_id = '{project_id}'" if project_id else "" + return await self.db.fetch(query.format(project_filter=project_filter), query_text, limit) + + async def trace_event_chain( + self, + event_id: str + ) -> list[AuditEvent]: + """Trace full event chain using parent_event_id.""" + query = """ + WITH RECURSIVE event_chain AS ( + SELECT * FROM audit_events WHERE event_id = $1 + UNION ALL + SELECT e.* FROM audit_events e + JOIN event_chain ec ON e.event_id = ec.parent_event_id + ) + SELECT * FROM event_chain ORDER BY timestamp ASC + """ + return await self.db.fetch(query, event_id) +``` + +### 6. Logging Decorators & Implementation + +```python +# app/audit/decorators.py +import functools +import time +from uuid import uuid7 +from datetime import datetime, timezone +from typing import Callable, Any +import structlog + +logger = structlog.get_logger() + +def audit_agent_action(action_type: str): + """ + Decorator to audit agent actions with before/after state capture. + + Usage: + @audit_agent_action("create_branch") + async def create_branch(self, branch_name: str) -> Branch: + ... + """ + def decorator(func: Callable) -> Callable: + @functools.wraps(func) + async def wrapper(self, *args, **kwargs): + # Capture before state + before_state = await self.get_state() if hasattr(self, 'get_state') else None + + event_id = str(uuid7()) + start_time = time.perf_counter() + + try: + # Execute the action + result = await func(self, *args, **kwargs) + + # Capture after state + after_state = await self.get_state() if hasattr(self, 'get_state') else None + + # Log success + await audit_logger.log_event( + AuditEvent( + event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type=f"agent.action.completed", + event_category="agent", + severity=AuditEventSeverity.INFO, + project_id=self.project_id, + agent_id=self.agent_id, + agent_type=self.agent_type, + action=f"{action_type}: {func.__name__}", + data={ + "action_type": action_type, + "function": func.__name__, + "args": _serialize_args(args), + "kwargs": _serialize_kwargs(kwargs), + "result_summary": _summarize_result(result), + "duration_ms": int((time.perf_counter() - start_time) * 1000), + }, + before_state=before_state, + after_state=after_state, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + ) + + return result + + except Exception as e: + # Log failure + await audit_logger.log_event( + AuditEvent( + event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type=f"agent.action.failed", + event_category="agent", + severity=AuditEventSeverity.ERROR, + project_id=self.project_id, + agent_id=self.agent_id, + agent_type=self.agent_type, + action=f"{action_type}: {func.__name__} (FAILED)", + data={ + "action_type": action_type, + "function": func.__name__, + "error": str(e), + "error_type": type(e).__name__, + "duration_ms": int((time.perf_counter() - start_time) * 1000), + }, + before_state=before_state, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + ) + raise + + return wrapper + return decorator + + +def audit_llm_call(func: Callable) -> Callable: + """ + Decorator to audit LLM calls with prompt/response logging. + + Usage: + @audit_llm_call + async def generate_response(self, prompt: str, **kwargs) -> str: + ... + """ + @functools.wraps(func) + async def wrapper(self, *args, **kwargs): + event_id = str(uuid7()) + start_time = time.perf_counter() + + # Log request + request_event = AuditEvent( + event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type="llm.request", + event_category="llm", + severity=AuditEventSeverity.INFO, + project_id=getattr(self, 'project_id', None), + agent_id=getattr(self, 'agent_id', None), + action="LLM request initiated", + data={ + "model": kwargs.get('model', self.model), + "provider": self.provider, + "prompt_hash": hashlib.sha256(str(args).encode()).hexdigest()[:16], + "max_tokens": kwargs.get('max_tokens', 4096), + "temperature": kwargs.get('temperature', 0.7), + }, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + await audit_logger.log_event(request_event) + + try: + result = await func(self, *args, **kwargs) + + # Log response + response_event = AuditEvent( + event_id=str(uuid7()), + parent_event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type="llm.response", + event_category="llm", + severity=AuditEventSeverity.INFO, + project_id=getattr(self, 'project_id', None), + agent_id=getattr(self, 'agent_id', None), + action="LLM response received", + data={ + "model": kwargs.get('model', self.model), + "provider": self.provider, + "response_hash": hashlib.sha256(str(result).encode()).hexdigest()[:16], + "input_tokens": result.usage.input_tokens if hasattr(result, 'usage') else None, + "output_tokens": result.usage.output_tokens if hasattr(result, 'usage') else None, + "latency_ms": int((time.perf_counter() - start_time) * 1000), + "finish_reason": result.finish_reason if hasattr(result, 'finish_reason') else None, + }, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + await audit_logger.log_event(response_event) + + return result + + except Exception as e: + # Log error + error_event = AuditEvent( + event_id=str(uuid7()), + parent_event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type="llm.error", + event_category="llm", + severity=AuditEventSeverity.ERROR, + project_id=getattr(self, 'project_id', None), + agent_id=getattr(self, 'agent_id', None), + action="LLM request failed", + data={ + "model": kwargs.get('model', getattr(self, 'model', None)), + "error": str(e), + "error_type": type(e).__name__, + "latency_ms": int((time.perf_counter() - start_time) * 1000), + }, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + await audit_logger.log_event(error_event) + raise + + return wrapper + + +def audit_mcp_tool(func: Callable) -> Callable: + """ + Decorator to audit MCP tool invocations. + + Usage: + @audit_mcp_tool + async def call_tool(self, server: str, tool: str, args: dict): + ... + """ + @functools.wraps(func) + async def wrapper(self, server: str, tool: str, arguments: dict, **kwargs): + event_id = str(uuid7()) + start_time = time.perf_counter() + + # Log invocation + await audit_logger.log_event( + AuditEvent( + event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type="mcp.tool.invoked", + event_category="mcp", + severity=AuditEventSeverity.INFO, + project_id=arguments.get('project_id'), + agent_id=arguments.get('agent_id'), + action=f"MCP tool invoked: {server}/{tool}", + data={ + "server": server, + "tool_name": tool, + "arguments": _redact_sensitive(arguments), + }, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + ) + + try: + result = await func(self, server, tool, arguments, **kwargs) + + # Log result + await audit_logger.log_event( + AuditEvent( + event_id=str(uuid7()), + parent_event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type="mcp.tool.result", + event_category="mcp", + severity=AuditEventSeverity.INFO, + project_id=arguments.get('project_id'), + agent_id=arguments.get('agent_id'), + action=f"MCP tool completed: {server}/{tool}", + data={ + "server": server, + "tool_name": tool, + "result_summary": _summarize_result(result), + "duration_ms": int((time.perf_counter() - start_time) * 1000), + }, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + ) + + return result + + except Exception as e: + await audit_logger.log_event( + AuditEvent( + event_id=str(uuid7()), + parent_event_id=event_id, + timestamp=datetime.now(timezone.utc), + timestamp_unix_ms=int(time.time() * 1000), + event_type="mcp.tool.error", + event_category="mcp", + severity=AuditEventSeverity.ERROR, + project_id=arguments.get('project_id'), + agent_id=arguments.get('agent_id'), + action=f"MCP tool failed: {server}/{tool}", + data={ + "server": server, + "tool_name": tool, + "error": str(e), + "error_type": type(e).__name__, + "duration_ms": int((time.perf_counter() - start_time) * 1000), + }, + service="syndarix", + service_version=settings.VERSION, + environment=settings.ENVIRONMENT, + hostname=socket.gethostname(), + ) + ) + raise + + return wrapper +``` + +### 7. OpenTelemetry Integration + +```python +# app/audit/otel_integration.py +from opentelemetry import trace +from opentelemetry.trace import Span +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import BatchSpanProcessor +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter + +# Configure OpenTelemetry +trace.set_tracer_provider(TracerProvider()) +tracer = trace.get_tracer("syndarix.audit") + +otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317") +trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter)) + + +class AuditLoggerWithOTel: + """Audit logger with OpenTelemetry correlation.""" + + def __init__(self, db_writer, integrity_checker): + self.db_writer = db_writer + self.integrity = integrity_checker + + async def log_event(self, event: AuditEvent) -> None: + """Log event with OpenTelemetry trace correlation.""" + # Get current span context + current_span = trace.get_current_span() + span_context = current_span.get_span_context() + + if span_context.is_valid: + event.trace_id = format(span_context.trace_id, '032x') + event.span_id = format(span_context.span_id, '016x') + + # Add to hash chain for immutability + event = await self.integrity.chain_event(event) + + # Write to database + await self.db_writer.write(event) + + # Add event as span event for trace correlation + current_span.add_event( + name=event.event_type, + attributes={ + "audit.event_id": str(event.event_id), + "audit.project_id": event.project_id or "", + "audit.agent_id": event.agent_id or "", + "audit.action": event.action, + } + ) +``` + +### 8. Retention Policy Recommendations + +Based on compliance requirements (SOC2, GDPR) and operational needs: + +| Data Category | Hot Storage | Warm Storage | Cold Archive | Total Retention | +|---------------|-------------|--------------|--------------|-----------------| +| Agent actions | 30 days | 60 days | 7 years | 7 years | +| LLM requests/responses | 30 days | 60 days | 7 years | 7 years | +| MCP tool calls | 30 days | 60 days | 7 years | 7 years | +| Human approvals | 30 days | 60 days | 7 years | 7 years | +| Git operations | 30 days | 60 days | 7 years | 7 years | +| System events | 7 days | 23 days | 1 year | 1 year | +| Debug/trace logs | 3 days | 4 days | N/A | 7 days | + +**GDPR Considerations:** +- PII must be redacted or encrypted before archival +- Implement right-to-deletion capability (pseudonymization in archives) +- Document lawful basis for retention (legitimate business interest, legal compliance) + +**SOC2 Considerations:** +- Audit logs must be tamper-evident (hash chaining) +- Access to audit logs must be logged (audit of audits) +- Retention must be documented in data retention policy + +--- + +## Recommendations + +### Implementation Phases + +#### Phase 1: Foundation (Week 1-2) +1. Set up TimescaleDB with audit_events hypertable +2. Implement base AuditEvent schema and writer +3. Create `@audit_agent_action` decorator +4. Add basic project/agent filtering + +#### Phase 2: LLM & MCP Logging (Week 3-4) +1. Implement `@audit_llm_call` decorator +2. Implement `@audit_mcp_tool` decorator +3. Add prompt/response logging with redaction +4. Integrate with OpenTelemetry + +#### Phase 3: Immutability & Compliance (Week 5-6) +1. Implement hash chaining for tamper evidence +2. Add integrity verification endpoints +3. Implement cold storage archival job +4. Document retention policies + +#### Phase 4: Query & Investigation (Week 7-8) +1. Build audit query API endpoints +2. Implement full-text search +3. Add trace correlation queries +4. Create audit dashboard + +### Technology Stack + +| Component | Recommendation | Alternative | +|-----------|----------------|-------------| +| Structured Logging | structlog | python-json-logger | +| Hot Storage | PostgreSQL + TimescaleDB | ClickHouse | +| Cold Storage | MinIO (S3-compatible) | AWS S3 | +| Archive Format | Parquet + ZSTD | ORC | +| Tracing | OpenTelemetry | Jaeger | +| Query Engine | Native SQL + Trino | Apache Druid | + +### Cost Estimates (Self-Hosted) + +| Resource | Specification | Monthly Cost | +|----------|---------------|--------------| +| TimescaleDB (hot) | 500GB SSD, 16GB RAM | Infrastructure only | +| MinIO (cold) | 10TB HDD | Infrastructure only | +| Estimated log volume | ~500GB/month | - | +| Compression ratio | ~10:1 | - | +| Effective storage | ~50GB/month | - | + +--- + +## References + +- [OpenTelemetry Logging Documentation](https://opentelemetry.io/docs/specs/otel/logs/) +- [How to Structure Logs Properly in OpenTelemetry](https://oneuptime.com/blog/post/2025-08-28-how-to-structure-logs-properly-in-opentelemetry/view) +- [Langfuse LLM Observability & Tracing](https://langfuse.com/docs/tracing) +- [LLM Observability: Tutorial & Best Practices](https://www.patronus.ai/llm-testing/llm-observability) +- [Datadog LLM Observability](https://www.datadoghq.com/product/llm-observability/) +- [Security Log Retention Best Practices](https://auditboard.com/blog/security-log-retention-best-practices-guide) +- [SOC 2 Data Security and Retention Requirements](https://www.bytebase.com/blog/soc2-data-security-and-retention-requirements/) +- [Immutable Audit Log Architecture](https://www.emergentmind.com/topics/immutable-audit-log) +- [What Are Immutable Logs? A Complete Guide](https://www.hubifi.com/blog/immutable-audit-log-guide) +- [TimescaleDB Documentation](https://docs.timescale.com/) + +--- + +## Decision + +**Adopt a tiered audit logging architecture** with: +1. **Structlog + OpenTelemetry** for structured, correlated logging +2. **PostgreSQL + TimescaleDB** for hot storage with time-series optimization +3. **S3/MinIO + Parquet** for cold archival +4. **Cryptographic hash chaining** for immutability +5. **7-year retention** for compliance with SOC2/financial regulations + +--- + +*Spike completed. Findings will inform ADR-007: Audit Logging Architecture.* diff --git a/docs/spikes/SPIKE-012-client-approval-flow.md b/docs/spikes/SPIKE-012-client-approval-flow.md new file mode 100644 index 0000000..a27a007 --- /dev/null +++ b/docs/spikes/SPIKE-012-client-approval-flow.md @@ -0,0 +1,1662 @@ +# SPIKE-012: Client Approval Flow + +**Status:** Completed +**Date:** 2025-12-29 +**Author:** Architecture Team +**Related Issue:** #12 + +--- + +## Executive Summary + +This spike researches the optimal patterns for implementing client approval workflows in Syndarix, enabling human oversight at configurable checkpoints within autonomous agent operations. Based on industry research and Syndarix's existing architecture, we recommend a **checkpoint-based approval system** with confidence-aware routing, multi-channel notifications, and timeout escalation. + +### Key Recommendations + +1. **Adopt checkpoint-based approval pattern** aligned with Syndarix's three autonomy levels +2. **Use confidence-based routing** to automatically escalate uncertain AI decisions +3. **Implement batch approval UI** for FULL_CONTROL mode efficiency +4. **Leverage existing SSE infrastructure** for real-time approval notifications +5. **Support mobile-responsive approval interface** with push notification integration +6. **Design flexible timeout and escalation policies** per project configuration + +--- + +## Table of Contents + +1. [Research Questions](#1-research-questions) +2. [Syndarix Autonomy Levels](#2-syndarix-autonomy-levels) +3. [Approval UX Patterns](#3-approval-ux-patterns) +4. [Presenting AI Decisions for Review](#4-presenting-ai-decisions-for-review) +5. [Timeout and Escalation Handling](#5-timeout-and-escalation-handling) +6. [Delegation Patterns](#6-delegation-patterns) +7. [Batch vs Individual Approval](#7-batch-vs-individual-approval) +8. [Mobile-Friendly Interface](#8-mobile-friendly-interface) +9. [Audit Trail Design](#9-audit-trail-design) +10. [Notification System Integration](#10-notification-system-integration) +11. [Database Schema](#11-database-schema) +12. [Code Examples](#12-code-examples) +13. [UI Mockup Descriptions](#13-ui-mockup-descriptions) +14. [Implementation Roadmap](#14-implementation-roadmap) + +--- + +## 1. Research Questions + +| Question | Summary Answer | +|----------|----------------| +| Best UX patterns for approval workflows? | Checkpoint-based with queue management and clear status visibility | +| How to present AI decisions for human review? | Explainable AI with confidence scores, reasoning chains, and impact summaries | +| Timeout handling for pending approvals? | Configurable escalation paths with auto-approve/block policies | +| Delegation and escalation patterns? | Role-based delegation with backup approvers and team escalation | +| Batch approval vs individual approval? | Context-dependent: batch for routine decisions, individual for high-stakes | +| Mobile-friendly approval interfaces? | One-touch approve/reject with push notifications and minimal context display | +| Audit trail for approval decisions? | Immutable event log with decision rationale and full context snapshot | + +--- + +## 2. Syndarix Autonomy Levels + +Syndarix supports three autonomy levels that determine when client approval is required: + +### 2.1 Autonomy Level Matrix + +| Action | FULL_CONTROL | MILESTONE | AUTONOMOUS | +|--------|--------------|-----------|------------| +| Requirements approval | Required | Required | Required | +| Architecture approval | Required | Required | Required | +| Sprint start | Required | Required | Auto | +| Story implementation | Required | Auto | Auto | +| PR merge | Required | Auto | Auto | +| Sprint completion | Required | Required | Auto | +| Bug fixes | Required | Auto | Auto | +| Documentation updates | Required | Auto | Auto | +| Budget threshold exceeded | Required | Required | Required | +| Production deployment | Required | Required | Required | +| Agent conflict resolution | Required | Required | Required | + +### 2.2 Approval Checkpoint Categories + +```python +class ApprovalCategory(str, Enum): + """Categories of approval checkpoints.""" + + # Always require approval regardless of autonomy level + CRITICAL = "critical" # Budget, production, architecture + + # Require approval at MILESTONE and FULL_CONTROL + MILESTONE = "milestone" # Sprint boundaries, major features + + # Only require approval at FULL_CONTROL + ROUTINE = "routine" # Individual stories, bug fixes + + # Agent-initiated (confidence below threshold) + UNCERTAINTY = "uncertainty" # Low confidence decisions + + # Human expertise explicitly requested + EXPERTISE = "expertise" # Agent needs human input +``` + +### 2.3 Approval Triggers + +| Scenario | Category | Description | +|----------|----------|-------------| +| Before starting a new sprint | MILESTONE | Sprint scope and goal confirmation | +| Before merging PR to main branch | ROUTINE | Code review approval | +| Before deploying to production | CRITICAL | Release sign-off | +| When budget threshold exceeded | CRITICAL | Cost authorization | +| When agent requests human expertise | EXPERTISE | Knowledge gap | +| When conflicting decisions between agents | UNCERTAINTY | Conflict resolution | +| Low confidence AI decision | UNCERTAINTY | Automatic escalation | + +--- + +## 3. Approval UX Patterns + +### 3.1 Recommended Pattern: Queue-Based Checkpoint System + +Based on industry research, we recommend a **queue-based approval system** with the following characteristics: + +``` ++------------------+ +------------------+ +------------------+ +| Approval Queue | --> | Review Panel | --> | Decision Log | ++------------------+ +------------------+ +------------------+ +| - Priority lanes | | - Context view | | - Audit trail | +| - Grouping | | - Action buttons | | - Metrics | +| - SLA indicators | | - History | | - Analytics | ++------------------+ +------------------+ +------------------+ +``` + +### 3.2 Queue Pattern Components + +**Single Queue with Priority Lanes:** +``` ++---------------------------------------------------------------+ +| APPROVAL QUEUE | ++---------------------------------------------------------------+ +| [CRITICAL - 2] | [MILESTONE - 5] | [ROUTINE - 12] | ++---------------------------------------------------------------+ +| > Budget overrun (2h ago) | SLA: 4h | URGENT | +| > Sprint 3 start request | SLA: 24h | PENDING | +| > PR #45: Add user authentication | SLA: 48h | PENDING | +| > PR #46: Fix login validation | SLA: 48h | PENDING | +| > [Batch Select: 8 routine items] | | BULK | ++---------------------------------------------------------------+ +``` + +### 3.3 Key UX Principles + +1. **Status Visibility**: Clear indicators showing approved, pending, rejected states with color coding +2. **Context Preservation**: Full decision context without requiring navigation away +3. **One-Touch Actions**: Approve/Reject buttons immediately visible +4. **Batch Operations**: Select multiple items for bulk approval +5. **SLA Tracking**: Visual indicators for time-sensitive decisions +6. **Undo Capability**: Brief window to reverse accidental approvals + +--- + +## 4. Presenting AI Decisions for Review + +### 4.1 Explainable AI Interface + +AI decisions must be presented with sufficient context for informed human review: + +``` ++---------------------------------------------------------------+ +| APPROVAL REQUEST: Architecture Decision | ++---------------------------------------------------------------+ +| | +| DECISION: Use PostgreSQL with pgvector for knowledge base | +| | +| CONFIDENCE: 87% [=========> ] | +| | +| REASONING: | +| 1. Requirement: Vector similarity search for RAG | +| 2. Constraint: Must integrate with existing PostgreSQL | +| 3. Alternative considered: Pinecone (rejected: external dep) | +| 4. Alternative considered: Milvus (rejected: operational cost)| +| | +| IMPACT ASSESSMENT: | +| - Cost: $0 additional (existing infrastructure) | +| - Complexity: Low (native PostgreSQL extension) | +| - Risk: Low (mature, well-documented) | +| | +| RECOMMENDED BY: Architect Agent (Sofia) | +| SUPPORTED BY: DevOps Agent (Marcus) - "Easy to maintain" | +| | +| [APPROVE] [REJECT] [REQUEST MORE INFO] [DELEGATE] | ++---------------------------------------------------------------+ +``` + +### 4.2 Information Hierarchy + +| Level | Content | Always Shown | +|-------|---------|--------------| +| 1. Decision Summary | One-line description | Yes | +| 2. Confidence Score | Visual indicator (0-100%) | Yes | +| 3. Key Reasoning | Top 3 factors | Yes | +| 4. Impact Assessment | Cost, risk, complexity | Yes | +| 5. Alternatives Considered | Other options evaluated | Expandable | +| 6. Full Agent Conversation | Complete discussion thread | Expandable | +| 7. Related Artifacts | Code, documents, diagrams | Linked | + +### 4.3 Confidence-Based Presentation + +```python +def get_presentation_level(confidence: float, category: ApprovalCategory) -> str: + """Determine how much detail to present based on confidence and category.""" + + if category == ApprovalCategory.CRITICAL: + return "full" # Always show full context for critical decisions + + if confidence >= 0.9: + return "summary" # High confidence: summary with expand option + elif confidence >= 0.7: + return "detailed" # Medium confidence: show reasoning + else: + return "full" # Low confidence: full context with alternatives +``` + +--- + +## 5. Timeout and Escalation Handling + +### 5.1 Escalation Policy Configuration + +```python +@dataclass +class EscalationPolicy: + """Configuration for approval timeout and escalation.""" + + # Initial timeout before first escalation + initial_timeout_hours: int = 24 + + # Maximum escalation levels + max_escalation_levels: int = 3 + + # Escalation path + escalation_path: list[str] = field(default_factory=lambda: [ + "project_owner", + "team_lead", + "admin" + ]) + + # Reminder intervals before timeout + reminder_intervals: list[int] = field(default_factory=lambda: [ + 4, # 4 hours before timeout + 1, # 1 hour before timeout + ]) + + # Action on final timeout + final_action: str = "block" # "block", "auto_approve", "auto_reject" + + # Category-specific overrides + category_overrides: dict[ApprovalCategory, dict] = field(default_factory=dict) +``` + +### 5.2 Timeout Behavior by Category + +| Category | Default Timeout | Escalation Path | Final Action | +|----------|-----------------|-----------------|--------------| +| CRITICAL | 4 hours | Owner -> Admin -> Block | Block (halt work) | +| MILESTONE | 24 hours | Owner -> Team Lead | Block | +| ROUTINE | 48 hours | Owner only | Auto-approve (with flag) | +| UNCERTAINTY | 12 hours | Owner -> Architect | Request more info | +| EXPERTISE | 24 hours | Owner -> External | Block | + +### 5.3 Escalation Flow + +``` ++-------------+ +-------------+ +-------------+ +| Created | --> | Pending | --> | Escalated | ++-------------+ +-------------+ +-------------+ + | | + [Timeout] [Timeout] + | | + v v + +-------------+ +-------------+ + | Reminder | | Final | + | Sent | | Action | + +-------------+ +-------------+ +``` + +### 5.4 Notification Sequence + +```python +async def handle_approval_timeout(approval: ApprovalRequest): + """Process approval timeout with escalation.""" + + current_level = approval.escalation_level + policy = get_escalation_policy(approval.project_id, approval.category) + + if current_level < policy.max_escalation_levels: + # Escalate to next level + next_approver = policy.escalation_path[current_level] + + await notify_escalation( + approval=approval, + new_approver=next_approver, + reason="timeout", + previous_approver=approval.current_approver + ) + + approval.current_approver = next_approver + approval.escalation_level += 1 + approval.escalated_at = datetime.utcnow() + + else: + # Final action + if policy.final_action == "block": + await block_workflow(approval) + await notify_workflow_blocked(approval) + elif policy.final_action == "auto_approve": + await auto_approve(approval, reason="timeout_policy") + await notify_auto_approved(approval) + elif policy.final_action == "auto_reject": + await auto_reject(approval, reason="timeout_policy") + await notify_auto_rejected(approval) +``` + +--- + +## 6. Delegation Patterns + +### 6.1 Delegation Types + +| Type | Description | Use Case | +|------|-------------|----------| +| **Temporary** | Delegate for specific time period | Vacation, OOO | +| **Permanent** | Assign backup approver | Team redundancy | +| **Categorical** | Delegate by approval category | Expertise-based routing | +| **Threshold** | Delegate below certain impact level | Routine automation | + +### 6.2 Delegation Configuration + +```python +@dataclass +class DelegationRule: + """Rule for delegating approval authority.""" + + delegator_id: UUID + delegate_id: UUID + + # Scope + project_ids: list[UUID] | None = None # None = all projects + categories: list[ApprovalCategory] | None = None + + # Constraints + max_impact_level: str | None = None # "low", "medium", "high" + requires_notification: bool = True + + # Time bounds + start_date: datetime | None = None + end_date: datetime | None = None + + # Status + is_active: bool = True +``` + +### 6.3 Delegation Resolution + +```python +async def resolve_approver( + approval: ApprovalRequest, + original_approver_id: UUID +) -> UUID: + """Resolve the actual approver considering delegation rules.""" + + # Check for active delegation rules + delegation = await get_active_delegation( + delegator_id=original_approver_id, + project_id=approval.project_id, + category=approval.category, + impact_level=approval.impact_level + ) + + if delegation: + # Verify delegate is available + if await is_user_available(delegation.delegate_id): + if delegation.requires_notification: + await notify_delegation_used(delegation, approval) + return delegation.delegate_id + + return original_approver_id +``` + +--- + +## 7. Batch vs Individual Approval + +### 7.1 Decision Framework + +| Factor | Batch Approval | Individual Approval | +|--------|---------------|---------------------| +| Category | ROUTINE | CRITICAL, UNCERTAINTY | +| Confidence | > 85% | < 85% | +| Similar context | Same sprint/feature | Different contexts | +| Risk level | Low | Medium/High | +| Time pressure | High volume | Single items | + +### 7.2 Batch Approval UI + +``` ++---------------------------------------------------------------+ +| BATCH APPROVAL: 8 Routine Code Reviews | ++---------------------------------------------------------------+ +| | +| SUMMARY: | +| - Sprint: Sprint 3 - Authentication | +| - Agent: Dave (Software Engineer) | +| - Average Confidence: 91% | +| - Test Coverage: All passing, avg 87% coverage | +| | +| ITEMS: | +| [x] PR #45: Add login form validation | 94% | Low risk | +| [x] PR #46: Add password reset endpoint | 92% | Low risk | +| [x] PR #47: Add email verification | 89% | Low risk | +| [x] PR #48: Add session management | 91% | Low risk | +| [ ] PR #49: Add OAuth integration | 76% | Med risk | <- Excluded +| [x] PR #50: Add logout functionality | 95% | Low risk | +| [x] PR #51: Add remember me feature | 88% | Low risk | +| [x] PR #52: Add login rate limiting | 93% | Low risk | +| | +| [APPROVE ALL (7)] [REJECT ALL] [REVIEW INDIVIDUALLY] | ++---------------------------------------------------------------+ +``` + +### 7.3 Batch Approval Rules + +```python +class BatchApprovalPolicy: + """Policy for batch approval eligibility.""" + + # Minimum items for batch + min_items: int = 3 + + # Maximum items per batch + max_items: int = 20 + + # Grouping criteria + group_by: list[str] = ["sprint_id", "agent_type", "category"] + + # Eligibility criteria + min_confidence: float = 0.85 + max_risk_level: str = "low" + excluded_categories: list[ApprovalCategory] = [ + ApprovalCategory.CRITICAL, + ApprovalCategory.UNCERTAINTY + ] + + # Require same context + require_same_sprint: bool = True + require_same_feature: bool = False +``` + +--- + +## 8. Mobile-Friendly Interface + +### 8.1 Mobile Design Principles + +1. **Progressive Disclosure**: Show minimal info, expand on tap +2. **Swipe Actions**: Swipe right to approve, left to reject +3. **Push Notifications**: Actionable notifications with quick responses +4. **Offline Queue**: Cache pending approvals for offline review +5. **Biometric Auth**: Face ID / fingerprint for high-stakes approvals + +### 8.2 Mobile Notification Design + +``` ++----------------------------------+ +| SYNDARIX now | +| Approval Required | +| | +| Sprint 3 Ready to Start | +| Project: E-Commerce Platform | +| Confidence: 92% | +| | +| [APPROVE] [REJECT] [VIEW] | ++----------------------------------+ +``` + +### 8.3 Mobile Approval Card + +``` ++----------------------------------+ +| < Back Approval #127 | ++----------------------------------+ +| | +| Sprint 3 Start Request | +| ============================ | +| | +| Requested by: PM Agent (Alex) | +| 2 hours ago | +| | +| [=========> ] 92% Confidence | +| | +| Sprint Goal: | +| Implement user authentication | +| and authorization system | +| | +| Stories: 8 | Points: 21 | +| Estimated: 2 weeks | +| | +| [v] Show Details | +| | ++----------------------------------+ +| | +| [ REJECT ] [ APPROVE ] | +| | ++----------------------------------+ +``` + +### 8.4 Responsive Breakpoints + +| Breakpoint | Layout | Features | +|------------|--------|----------| +| Mobile (< 640px) | Single column, cards | Swipe actions, minimal info | +| Tablet (640-1024px) | Two columns | Side-by-side compare | +| Desktop (> 1024px) | Full queue + detail panel | Batch operations, keyboard shortcuts | + +--- + +## 9. Audit Trail Design + +### 9.1 Audit Event Structure + +```python +@dataclass +class ApprovalAuditEvent: + """Immutable audit record for approval decisions.""" + + id: UUID + approval_id: UUID + + # Event info + event_type: str # "created", "viewed", "approved", "rejected", "escalated", "delegated" + timestamp: datetime + + # Actor info + actor_id: UUID + actor_type: str # "user", "system", "timeout" + actor_ip: str | None + actor_device: str | None + + # Decision info (for approve/reject) + decision: str | None + rationale: str | None + + # Context snapshot (immutable copy at decision time) + context_snapshot: dict + + # Metadata + duration_seconds: int | None # Time spent reviewing + confidence_at_decision: float | None +``` + +### 9.2 Audit Trail Query API + +```python +class ApprovalAuditService: + """Service for querying and analyzing approval audit trails.""" + + async def get_audit_trail( + self, + approval_id: UUID + ) -> list[ApprovalAuditEvent]: + """Get complete audit trail for an approval.""" + pass + + async def get_user_decisions( + self, + user_id: UUID, + start_date: datetime, + end_date: datetime + ) -> list[ApprovalAuditEvent]: + """Get all decisions made by a user in a time range.""" + pass + + async def get_approval_metrics( + self, + project_id: UUID | None = None, + start_date: datetime | None = None + ) -> ApprovalMetrics: + """Get approval workflow metrics.""" + pass +``` + +### 9.3 Audit Metrics Dashboard + +| Metric | Description | Target | +|--------|-------------|--------| +| Average Decision Time | Time from request to decision | < 4 hours (CRITICAL) | +| Approval Rate | % of approvals vs rejections | > 80% | +| Escalation Rate | % requiring escalation | < 10% | +| Timeout Rate | % hitting timeout | < 5% | +| Override Rate | Auto-decisions overridden | Track only | +| Batch Efficiency | Items approved in batch vs individual | > 60% for ROUTINE | + +--- + +## 10. Notification System Integration + +### 10.1 Notification Channels + +Leveraging Syndarix's existing SSE infrastructure (SPIKE-003): + +| Channel | Use Case | Priority | +|---------|----------|----------| +| In-App (SSE) | Real-time dashboard updates | All | +| Email | Digest and escalation alerts | MILESTONE, CRITICAL | +| Push (Mobile) | Urgent approvals | CRITICAL, ESCALATED | +| Slack/Teams | Team notifications | Configurable | +| SMS | Critical timeout warnings | CRITICAL (final escalation) | + +### 10.2 Notification Events + +```python +class ApprovalNotificationEvent(str, Enum): + """Events that trigger notifications.""" + + # New approval + APPROVAL_REQUESTED = "approval_requested" + + # Reminders + APPROVAL_REMINDER = "approval_reminder" + + # Escalation + APPROVAL_ESCALATED = "approval_escalated" + APPROVAL_ESCALATED_TO_YOU = "approval_escalated_to_you" + + # Resolution + APPROVAL_APPROVED = "approval_approved" + APPROVAL_REJECTED = "approval_rejected" + APPROVAL_AUTO_APPROVED = "approval_auto_approved" + + # Delegation + APPROVAL_DELEGATED = "approval_delegated" + + # Workflow impact + WORKFLOW_BLOCKED = "workflow_blocked" + WORKFLOW_RESUMED = "workflow_resumed" +``` + +### 10.3 SSE Integration + +```python +# Extend existing SSE event types from SPIKE-003 +class EventType(str, Enum): + # ... existing events ... + + # Approval Events + APPROVAL_REQUESTED = "approval_requested" + APPROVAL_UPDATED = "approval_updated" + APPROVAL_RESOLVED = "approval_resolved" + APPROVAL_ESCALATED = "approval_escalated" + APPROVAL_BATCH_READY = "approval_batch_ready" + +# Publishing approval events +async def publish_approval_event(approval: ApprovalRequest, event_type: str): + """Publish approval event via existing EventBus.""" + + await event_bus.publish( + channel=f"project:{approval.project_id}", + event=Event( + type=event_type, + data={ + "approval_id": str(approval.id), + "category": approval.category.value, + "summary": approval.summary, + "confidence": approval.confidence, + "deadline": approval.deadline.isoformat(), + "approver_id": str(approval.current_approver_id), + }, + project_id=str(approval.project_id), + agent_id=str(approval.requesting_agent_id) if approval.requesting_agent_id else None + ) + ) +``` + +--- + +## 11. Database Schema + +### 11.1 Core Tables + +```python +# app/models/approval.py +from sqlalchemy import Column, String, Text, Float, ForeignKey, Enum, Boolean +from sqlalchemy.dialects.postgresql import UUID, JSONB +from sqlalchemy.orm import relationship +from app.models.base import Base, TimestampMixin, UUIDMixin + + +class ApprovalCategory(str, Enum): + CRITICAL = "critical" + MILESTONE = "milestone" + ROUTINE = "routine" + UNCERTAINTY = "uncertainty" + EXPERTISE = "expertise" + + +class ApprovalStatus(str, Enum): + PENDING = "pending" + APPROVED = "approved" + REJECTED = "rejected" + ESCALATED = "escalated" + EXPIRED = "expired" + DELEGATED = "delegated" + + +class ApprovalRequest(Base, UUIDMixin, TimestampMixin): + """Approval request entity.""" + + __tablename__ = "approval_requests" + + # Foreign keys + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=False, index=True) + sprint_id = Column(UUID(as_uuid=True), ForeignKey("sprints.id"), nullable=True) + requesting_agent_id = Column(UUID(as_uuid=True), ForeignKey("agent_instances.id"), nullable=True) + current_approver_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=False, index=True) + original_approver_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=False) + + # Approval details + category = Column(Enum(ApprovalCategory), nullable=False, index=True) + status = Column(Enum(ApprovalStatus), default=ApprovalStatus.PENDING, index=True) + + # Content + title = Column(String(255), nullable=False) + summary = Column(Text, nullable=False) + detailed_context = Column(JSONB, nullable=False, default=dict) + + # AI decision metadata + confidence = Column(Float, nullable=True) + reasoning = Column(JSONB, nullable=True) # Structured reasoning chain + alternatives_considered = Column(JSONB, nullable=True) + impact_assessment = Column(JSONB, nullable=True) + + # Workflow reference + workflow_type = Column(String(50), nullable=False) # "sprint_start", "pr_merge", etc. + workflow_reference_id = Column(UUID(as_uuid=True), nullable=True) # PR ID, Sprint ID, etc. + workflow_reference_url = Column(String(500), nullable=True) # External link + + # Escalation tracking + escalation_level = Column(Integer, default=0) + escalated_at = Column(DateTime(timezone=True), nullable=True) + escalation_reason = Column(String(255), nullable=True) + + # Timing + deadline = Column(DateTime(timezone=True), nullable=False) + reminded_at = Column(DateTime(timezone=True), nullable=True) + resolved_at = Column(DateTime(timezone=True), nullable=True) + + # Resolution + resolution_decision = Column(String(50), nullable=True) # "approved", "rejected" + resolution_rationale = Column(Text, nullable=True) + resolved_by_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=True) + resolution_type = Column(String(50), nullable=True) # "manual", "auto", "timeout" + + # Batch tracking + batch_id = Column(UUID(as_uuid=True), nullable=True, index=True) + is_batch_eligible = Column(Boolean, default=True) + + # Relationships + project = relationship("Project", back_populates="approval_requests") + current_approver = relationship("User", foreign_keys=[current_approver_id]) + resolved_by = relationship("User", foreign_keys=[resolved_by_id]) + audit_events = relationship("ApprovalAuditEvent", back_populates="approval_request") + + +class ApprovalAuditEvent(Base, UUIDMixin): + """Immutable audit trail for approval decisions.""" + + __tablename__ = "approval_audit_events" + + approval_id = Column(UUID(as_uuid=True), ForeignKey("approval_requests.id"), nullable=False, index=True) + + # Event details + event_type = Column(String(50), nullable=False) # "created", "viewed", "approved", etc. + timestamp = Column(DateTime(timezone=True), nullable=False, default=lambda: datetime.now(UTC)) + + # Actor + actor_id = Column(UUID(as_uuid=True), nullable=False) + actor_type = Column(String(20), nullable=False) # "user", "system", "timeout" + actor_ip = Column(String(45), nullable=True) + actor_user_agent = Column(String(500), nullable=True) + + # Decision details (for resolution events) + decision = Column(String(50), nullable=True) + rationale = Column(Text, nullable=True) + + # Context snapshot (immutable) + context_snapshot = Column(JSONB, nullable=False) + + # Metrics + review_duration_seconds = Column(Integer, nullable=True) + + # Relationships + approval_request = relationship("ApprovalRequest", back_populates="audit_events") + + +class DelegationRule(Base, UUIDMixin, TimestampMixin): + """Delegation rules for approval authority.""" + + __tablename__ = "delegation_rules" + + delegator_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=False, index=True) + delegate_id = Column(UUID(as_uuid=True), ForeignKey("users.id"), nullable=False) + + # Scope + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=True) # None = all projects + categories = Column(JSONB, nullable=True) # None = all categories + + # Constraints + max_impact_level = Column(String(20), nullable=True) + requires_notification = Column(Boolean, default=True) + + # Time bounds + start_date = Column(DateTime(timezone=True), nullable=True) + end_date = Column(DateTime(timezone=True), nullable=True) + + # Status + is_active = Column(Boolean, default=True) + + # Relationships + delegator = relationship("User", foreign_keys=[delegator_id]) + delegate = relationship("User", foreign_keys=[delegate_id]) + + +class EscalationPolicy(Base, UUIDMixin, TimestampMixin): + """Project-level escalation policy configuration.""" + + __tablename__ = "escalation_policies" + + project_id = Column(UUID(as_uuid=True), ForeignKey("projects.id"), nullable=False, unique=True) + + # Default settings + default_timeout_hours = Column(Integer, default=24) + max_escalation_levels = Column(Integer, default=3) + + # Escalation path (list of role names) + escalation_path = Column(JSONB, default=["project_owner", "team_lead", "admin"]) + + # Reminder intervals (hours before timeout) + reminder_intervals = Column(JSONB, default=[4, 1]) + + # Final action + final_action = Column(String(20), default="block") # "block", "auto_approve", "auto_reject" + + # Category-specific overrides + category_overrides = Column(JSONB, default=dict) + + # Relationships + project = relationship("Project", back_populates="escalation_policy") +``` + +### 11.2 Database Indexes + +```sql +-- Performance indexes for approval queries +CREATE INDEX idx_approval_requests_pending ON approval_requests(project_id, status) + WHERE status = 'pending'; + +CREATE INDEX idx_approval_requests_approver ON approval_requests(current_approver_id, status) + WHERE status = 'pending'; + +CREATE INDEX idx_approval_requests_deadline ON approval_requests(deadline) + WHERE status = 'pending'; + +CREATE INDEX idx_approval_requests_batch ON approval_requests(batch_id) + WHERE batch_id IS NOT NULL; + +CREATE INDEX idx_audit_events_approval ON approval_audit_events(approval_id, timestamp); + +CREATE INDEX idx_delegation_active ON delegation_rules(delegator_id, is_active) + WHERE is_active = true; +``` + +--- + +## 12. Code Examples + +### 12.1 Approval Gate Decorator + +```python +# app/services/approval/decorators.py +from functools import wraps +from typing import Callable, Any +from app.models.approval import ApprovalCategory, ApprovalRequest +from app.services.approval.service import ApprovalService + + +def requires_approval( + category: ApprovalCategory, + title_template: str, + summary_template: str, + confidence_threshold: float = 0.85 +): + """ + Decorator to create an approval checkpoint before executing an action. + + Usage: + @requires_approval( + category=ApprovalCategory.MILESTONE, + title_template="Start Sprint {sprint_name}", + summary_template="Sprint goal: {sprint_goal}\nStories: {story_count}" + ) + async def start_sprint(self, sprint_id: UUID, **kwargs) -> Sprint: + # This only executes after approval + ... + """ + def decorator(func: Callable) -> Callable: + @wraps(func) + async def wrapper(self, *args, **kwargs) -> Any: + # Get context from kwargs + project_id = kwargs.get('project_id') + agent_id = kwargs.get('agent_id') + confidence = kwargs.get('confidence', 1.0) + + # Check if approval is needed based on autonomy level + project = await get_project(project_id) + + if should_require_approval( + autonomy_level=project.autonomy_level, + category=category, + confidence=confidence, + threshold=confidence_threshold + ): + # Create approval request + approval_service = ApprovalService() + + approval = await approval_service.create_approval( + project_id=project_id, + requesting_agent_id=agent_id, + category=category, + title=title_template.format(**kwargs), + summary=summary_template.format(**kwargs), + confidence=confidence, + workflow_type=func.__name__, + workflow_context=kwargs + ) + + # Wait for approval (blocking) + result = await approval_service.wait_for_resolution(approval.id) + + if result.status == ApprovalStatus.REJECTED: + raise ApprovalRejectedError( + f"Approval rejected: {result.resolution_rationale}" + ) + + # Execute the action + return await func(self, *args, **kwargs) + + return wrapper + return decorator + + +def should_require_approval( + autonomy_level: str, + category: ApprovalCategory, + confidence: float, + threshold: float +) -> bool: + """Determine if approval is required based on autonomy and confidence.""" + + # Critical always requires approval + if category == ApprovalCategory.CRITICAL: + return True + + # Low confidence triggers approval regardless of autonomy + if confidence < threshold: + return True + + # Check autonomy level + if autonomy_level == "FULL_CONTROL": + return True + elif autonomy_level == "MILESTONE": + return category in [ApprovalCategory.MILESTONE, ApprovalCategory.CRITICAL] + else: # AUTONOMOUS + return category == ApprovalCategory.CRITICAL +``` + +### 12.2 Approval Service + +```python +# app/services/approval/service.py +from datetime import datetime, timedelta +from uuid import UUID +from sqlalchemy.ext.asyncio import AsyncSession +from app.models.approval import ApprovalRequest, ApprovalStatus, ApprovalAuditEvent +from app.services.events import EventBus + + +class ApprovalService: + """Service for managing approval workflows.""" + + def __init__(self, db: AsyncSession, event_bus: EventBus): + self.db = db + self.event_bus = event_bus + + async def create_approval( + self, + project_id: UUID, + requesting_agent_id: UUID | None, + category: ApprovalCategory, + title: str, + summary: str, + confidence: float | None = None, + workflow_type: str = "generic", + workflow_context: dict | None = None, + reasoning: dict | None = None, + alternatives: list[dict] | None = None, + impact_assessment: dict | None = None, + ) -> ApprovalRequest: + """Create a new approval request.""" + + # Get project and escalation policy + project = await self._get_project(project_id) + policy = await self._get_escalation_policy(project_id) + + # Determine approver (with delegation check) + approver_id = await self._resolve_approver( + project.owner_id, project_id, category + ) + + # Calculate deadline + timeout_hours = self._get_timeout_hours(policy, category) + deadline = datetime.utcnow() + timedelta(hours=timeout_hours) + + # Create approval request + approval = ApprovalRequest( + project_id=project_id, + sprint_id=workflow_context.get('sprint_id') if workflow_context else None, + requesting_agent_id=requesting_agent_id, + current_approver_id=approver_id, + original_approver_id=project.owner_id, + category=category, + title=title, + summary=summary, + detailed_context=workflow_context or {}, + confidence=confidence, + reasoning=reasoning, + alternatives_considered=alternatives, + impact_assessment=impact_assessment, + workflow_type=workflow_type, + deadline=deadline, + is_batch_eligible=self._is_batch_eligible(category, confidence), + ) + + self.db.add(approval) + await self.db.commit() + await self.db.refresh(approval) + + # Create audit event + await self._create_audit_event( + approval_id=approval.id, + event_type="created", + actor_id=requesting_agent_id or UUID('00000000-0000-0000-0000-000000000000'), + actor_type="agent" if requesting_agent_id else "system", + context_snapshot=workflow_context or {} + ) + + # Publish event + await self.event_bus.publish( + channel=f"project:{project_id}", + event={ + "type": "approval_requested", + "approval_id": str(approval.id), + "category": category.value, + "title": title, + "approver_id": str(approver_id), + } + ) + + # Send notification + await self._notify_approver(approval, approver_id) + + return approval + + async def approve( + self, + approval_id: UUID, + user_id: UUID, + rationale: str | None = None, + ip_address: str | None = None, + user_agent: str | None = None, + ) -> ApprovalRequest: + """Approve a pending request.""" + + approval = await self._get_pending_approval(approval_id) + + # Verify user can approve + await self._verify_approver(approval, user_id) + + # Update approval + approval.status = ApprovalStatus.APPROVED + approval.resolution_decision = "approved" + approval.resolution_rationale = rationale + approval.resolved_by_id = user_id + approval.resolved_at = datetime.utcnow() + approval.resolution_type = "manual" + + await self.db.commit() + + # Create audit event + await self._create_audit_event( + approval_id=approval.id, + event_type="approved", + actor_id=user_id, + actor_type="user", + context_snapshot={"rationale": rationale}, + actor_ip=ip_address, + actor_user_agent=user_agent, + decision="approved", + rationale=rationale, + ) + + # Publish resolution event + await self.event_bus.publish( + channel=f"project:{approval.project_id}", + event={ + "type": "approval_resolved", + "approval_id": str(approval.id), + "decision": "approved", + } + ) + + # Resume workflow + await self._resume_workflow(approval) + + return approval + + async def reject( + self, + approval_id: UUID, + user_id: UUID, + rationale: str, # Required for rejection + ip_address: str | None = None, + user_agent: str | None = None, + ) -> ApprovalRequest: + """Reject a pending request.""" + + approval = await self._get_pending_approval(approval_id) + + # Verify user can approve + await self._verify_approver(approval, user_id) + + # Update approval + approval.status = ApprovalStatus.REJECTED + approval.resolution_decision = "rejected" + approval.resolution_rationale = rationale + approval.resolved_by_id = user_id + approval.resolved_at = datetime.utcnow() + approval.resolution_type = "manual" + + await self.db.commit() + + # Create audit event + await self._create_audit_event( + approval_id=approval.id, + event_type="rejected", + actor_id=user_id, + actor_type="user", + context_snapshot={"rationale": rationale}, + actor_ip=ip_address, + actor_user_agent=user_agent, + decision="rejected", + rationale=rationale, + ) + + # Notify requesting agent + await self._notify_rejection(approval, rationale) + + return approval + + async def batch_approve( + self, + approval_ids: list[UUID], + user_id: UUID, + rationale: str | None = None, + ) -> list[ApprovalRequest]: + """Approve multiple requests in batch.""" + + results = [] + + for approval_id in approval_ids: + try: + approval = await self.approve( + approval_id=approval_id, + user_id=user_id, + rationale=rationale or "Batch approved", + ) + results.append(approval) + except Exception as e: + # Log error but continue with other approvals + logger.error(f"Failed to approve {approval_id}: {e}") + + return results + + async def wait_for_resolution( + self, + approval_id: UUID, + timeout_seconds: int = 86400 # 24 hours + ) -> ApprovalRequest: + """Wait for an approval to be resolved (blocking).""" + + import asyncio + + start_time = datetime.utcnow() + poll_interval = 5 # seconds + + while True: + approval = await self._get_approval(approval_id) + + if approval.status != ApprovalStatus.PENDING: + return approval + + # Check timeout + elapsed = (datetime.utcnow() - start_time).total_seconds() + if elapsed > timeout_seconds: + raise TimeoutError(f"Approval {approval_id} not resolved within timeout") + + await asyncio.sleep(poll_interval) +``` + +### 12.3 Celery Task for Timeout Processing + +```python +# app/tasks/approval_tasks.py +from celery import shared_task +from datetime import datetime, timedelta +from app.services.approval.service import ApprovalService +from app.core.database import get_db + + +@shared_task +def process_approval_timeouts(): + """ + Periodic task to process approval timeouts and escalations. + Run every 15 minutes. + """ + with get_db() as db: + approval_service = ApprovalService(db) + + # Find approvals approaching deadline + approaching_deadline = approval_service.get_approvals_approaching_deadline( + hours=4 # Reminder threshold + ) + + for approval in approaching_deadline: + if not approval.reminded_at: + send_reminder.delay(str(approval.id)) + + # Find expired approvals + expired = approval_service.get_expired_approvals() + + for approval in expired: + process_escalation.delay(str(approval.id)) + + +@shared_task +def send_reminder(approval_id: str): + """Send reminder notification for pending approval.""" + with get_db() as db: + approval_service = ApprovalService(db) + await approval_service.send_reminder(UUID(approval_id)) + + +@shared_task +def process_escalation(approval_id: str): + """Process escalation for expired approval.""" + with get_db() as db: + approval_service = ApprovalService(db) + await approval_service.escalate_or_timeout(UUID(approval_id)) +``` + +### 12.4 API Endpoints + +```python +# app/api/v1/approvals.py +from fastapi import APIRouter, Depends, Query +from typing import Optional +from uuid import UUID +from app.schemas.approval import ( + ApprovalRequestCreate, + ApprovalRequestResponse, + ApprovalListResponse, + ApprovalDecision, + BatchApprovalRequest, +) +from app.services.approval.service import ApprovalService +from app.api.deps import get_current_user, get_db + + +router = APIRouter(prefix="/approvals", tags=["approvals"]) + + +@router.get("/pending", response_model=ApprovalListResponse) +async def list_pending_approvals( + project_id: Optional[UUID] = None, + category: Optional[str] = None, + batch_eligible: bool = False, + page: int = Query(1, ge=1), + page_size: int = Query(20, ge=1, le=100), + current_user = Depends(get_current_user), + db = Depends(get_db), +): + """List pending approvals for the current user.""" + service = ApprovalService(db) + + return await service.list_pending( + approver_id=current_user.id, + project_id=project_id, + category=category, + batch_eligible=batch_eligible, + page=page, + page_size=page_size, + ) + + +@router.get("/{approval_id}", response_model=ApprovalRequestResponse) +async def get_approval( + approval_id: UUID, + current_user = Depends(get_current_user), + db = Depends(get_db), +): + """Get approval details with full context.""" + service = ApprovalService(db) + + approval = await service.get_with_context(approval_id) + + # Record view event + await service.record_view(approval_id, current_user.id) + + return approval + + +@router.post("/{approval_id}/approve", response_model=ApprovalRequestResponse) +async def approve_request( + approval_id: UUID, + decision: ApprovalDecision, + current_user = Depends(get_current_user), + db = Depends(get_db), + request: Request = None, +): + """Approve a pending request.""" + service = ApprovalService(db) + + return await service.approve( + approval_id=approval_id, + user_id=current_user.id, + rationale=decision.rationale, + ip_address=request.client.host if request else None, + user_agent=request.headers.get("user-agent") if request else None, + ) + + +@router.post("/{approval_id}/reject", response_model=ApprovalRequestResponse) +async def reject_request( + approval_id: UUID, + decision: ApprovalDecision, + current_user = Depends(get_current_user), + db = Depends(get_db), + request: Request = None, +): + """Reject a pending request.""" + service = ApprovalService(db) + + if not decision.rationale: + raise HTTPException(400, "Rationale is required for rejection") + + return await service.reject( + approval_id=approval_id, + user_id=current_user.id, + rationale=decision.rationale, + ip_address=request.client.host if request else None, + user_agent=request.headers.get("user-agent") if request else None, + ) + + +@router.post("/batch/approve", response_model=list[ApprovalRequestResponse]) +async def batch_approve( + batch_request: BatchApprovalRequest, + current_user = Depends(get_current_user), + db = Depends(get_db), +): + """Approve multiple requests in batch.""" + service = ApprovalService(db) + + return await service.batch_approve( + approval_ids=batch_request.approval_ids, + user_id=current_user.id, + rationale=batch_request.rationale, + ) + + +@router.post("/{approval_id}/delegate") +async def delegate_approval( + approval_id: UUID, + delegate_to: UUID, + current_user = Depends(get_current_user), + db = Depends(get_db), +): + """Delegate an approval to another user.""" + service = ApprovalService(db) + + return await service.delegate( + approval_id=approval_id, + from_user_id=current_user.id, + to_user_id=delegate_to, + ) +``` + +--- + +## 13. UI Mockup Descriptions + +### 13.1 Approval Dashboard (Desktop) + +``` ++-----------------------------------------------------------------------+ +| SYNDARIX [User Menu] [Notifications]| ++-----------------------------------------------------------------------+ +| | +| +---------------+ +------------------------------------------------+ | +| | SIDEBAR | | APPROVAL QUEUE | | +| | | +------------------------------------------------+ | +| | Dashboard | | | | +| | Projects | | Filter: [All Categories v] [All Projects v] | | +| | Agents | | | | +| | > Approvals | | +----------------------------------------------+ | | +| | - Pending | | | CRITICAL (2) SLA: 4h remaining | | | +| | - History | | +----------------------------------------------+ | | +| | Settings | | | [!] Budget Overrun - Project Alpha | | | +| | | | | $12,500 over threshold ($10,000) | | | +| | | | | Requested by: PM Agent (Alex) | | | +| | | | | [APPROVE] [REJECT] [VIEW] | | | +| | | | +----------------------------------------------+ | | +| | | | | [!] Production Deploy - Project Beta | | | +| | | | | v2.3.1 ready for production | | | +| | | | | CI: All green | Tests: 98% pass | | | +| | | | | [APPROVE] [REJECT] [VIEW] | | | +| | | | +----------------------------------------------+ | | +| | | | | | +| | | | +----------------------------------------------+ | | +| | | | | MILESTONE (3) SLA: 24h remaining | | | +| | | | +----------------------------------------------+ | | +| | | | | Sprint 4 Start - Project Alpha | | | +| | | | | Sprint 2 Complete - Project Gamma | | | +| | | | | Architecture Review - Project Delta | | | +| | | | +----------------------------------------------+ | | +| | | | | | +| | | | +----------------------------------------------+ | | +| | | | | ROUTINE (12) [Batch Approve] | | | +| | | | +----------------------------------------------+ | | +| | | | | [x] PR #45: Login validation | | | +| | | | | [x] PR #46: Password reset | | | +| | | | | [x] PR #47: Email verification | | | +| | | | | [ ] ... 9 more items | | | +| | | | +----------------------------------------------+ | | +| +---------------+ +------------------------------------------------+ | +| | ++-----------------------------------------------------------------------+ +``` + +### 13.2 Approval Detail Panel + +``` ++-----------------------------------------------------------------------+ +| < Back to Queue Approval #APR-127 | ++-----------------------------------------------------------------------+ +| | +| +-------------------------------------------------------------------+ | +| | SPRINT 4 START REQUEST PENDING | | +| +-------------------------------------------------------------------+ | +| | | | +| | Project: E-Commerce Platform | | +| | Requested: 2 hours ago by PM Agent (Alex) | | +| | Deadline: 22 hours remaining | | +| | | | +| | +-----------------------------------------------------------------+ | | +| | | CONFIDENCE | | | +| | | [================> ] 87% | | | +| | +-----------------------------------------------------------------+ | | +| | | | +| | +-----------------------------------------------------------------+ | | +| | | SPRINT GOAL | | | +| | | Implement complete user authentication and authorization system | | | +| | | including OAuth2 integration and role-based access control. | | | +| | +-----------------------------------------------------------------+ | | +| | | | +| | +-----------------------------------------------------------------+ | | +| | | SPRINT DETAILS | | | +| | | Stories: 8 | Points: 21 | Duration: 2 weeks | | | +| | +-----------------------------------------------------------------+ | | +| | | | +| | [v] View Stories | | +| | [v] View Agent Reasoning | | +| | [v] View Risk Assessment | | +| | | | +| | +-----------------------------------------------------------------+ | | +| | | AGENT RECOMMENDATION | | | +| | | "This sprint has well-defined stories with clear acceptance | | | +| | | criteria. The team composition includes 3 engineers with | | | +| | | relevant auth experience. Risk is low." | | | +| | | - PM Agent (Alex) | | | +| | +-----------------------------------------------------------------+ | | +| | | | +| +-------------------------------------------------------------------+ | +| | +| +-------------------------------------------------------------------+ | +| | | | +| | Optional feedback: | | +| | +---------------------------------------------------------------+ | | +| | | | | | +| | +---------------------------------------------------------------+ | | +| | | | +| | [ REJECT ] [ APPROVE ] | | +| | | | +| +-------------------------------------------------------------------+ | +| | ++-----------------------------------------------------------------------+ +``` + +### 13.3 Mobile Approval List + +``` ++---------------------------+ +| SYNDARIX [=] [Bell] | ++---------------------------+ +| | +| Pending Approvals (5) | +| | ++---------------------------+ +| [!] CRITICAL | ++---------------------------+ +| | +| Budget Overrun | +| Project Alpha | +| $12,500 over threshold | +| | +| 4h remaining | +| | +| [APPROVE] [REJECT] | +| | ++---------------------------+ +| [!] CRITICAL | ++---------------------------+ +| | +| Production Deploy | +| Project Beta - v2.3.1 | +| | +| 6h remaining | +| | +| [APPROVE] [REJECT] | +| | ++---------------------------+ +| MILESTONE | ++---------------------------+ +| | +| Sprint 4 Start | +| E-Commerce Platform | +| 87% confidence | +| | +| 22h remaining | +| | +| Tap for details > | +| | ++---------------------------+ +``` + +--- + +## 14. Implementation Roadmap + +### Phase 1: Foundation (Week 1-2) + +| Task | Priority | Effort | +|------|----------|--------| +| Database schema implementation | High | 2 days | +| ApprovalService core CRUD | High | 2 days | +| API endpoints | High | 2 days | +| SSE event integration | High | 1 day | +| Basic approval decorator | High | 1 day | +| Unit tests | High | 2 days | + +### Phase 2: Workflow Integration (Week 3-4) + +| Task | Priority | Effort | +|------|----------|--------| +| Sprint workflow integration | High | 2 days | +| PR merge workflow integration | High | 2 days | +| Production deploy gate | High | 1 day | +| Budget threshold triggers | Medium | 1 day | +| Agent conflict resolution | Medium | 2 days | +| Confidence-based routing | Medium | 2 days | + +### Phase 3: UX & Notifications (Week 5-6) + +| Task | Priority | Effort | +|------|----------|--------| +| Desktop approval dashboard | High | 3 days | +| Approval detail panel | High | 2 days | +| Batch approval UI | Medium | 2 days | +| Email notifications | High | 1 day | +| Mobile responsive design | Medium | 2 days | +| Push notification integration | Low | 2 days | + +### Phase 4: Advanced Features (Week 7-8) + +| Task | Priority | Effort | +|------|----------|--------| +| Timeout/escalation processing | High | 2 days | +| Delegation rules | Medium | 2 days | +| Audit trail & reporting | Medium | 2 days | +| Metrics dashboard | Low | 2 days | +| Slack/Teams integration | Low | 2 days | +| E2E testing | High | 2 days | + +--- + +## References + +### Research Sources + +- [Human-in-the-Loop for AI Agents: Best Practices](https://www.permit.io/blog/human-in-the-loop-for-ai-agents-best-practices-frameworks-use-cases-and-demo) +- [Human-in-the-Loop AI Review Queues: Workflow Patterns That Scale](https://alldaystech.com/guides/artificial-intelligence/human-in-the-loop-ai-review-queue-workflows) +- [Human-in-the-Loop AI in 2025: Proven Design Patterns](https://blog.ideafloats.com/human-in-the-loop-ai-in-2025/) +- [LangGraph: Human-in-the-Loop for Reliable AI Workflows](https://medium.com/@sitabjapal03/langgraph-part-4-human-in-the-loop-for-reliable-ai-workflows-aa4cc175bce4) +- [Improving Approval Request Process - UX Case Study](https://bootcamp.uxdesign.cc/improving-the-approval-request-process-on-an-enterprise-application-a-ux-case-study-12d2756af876) +- [Complex Approvals - How to Design an App to Streamline Approvals](https://www.uxpin.com/studio/blog/complex-approvals-app-design/) +- [Bulk Approval - Achieving 83% Efficiency](https://sreeja-ux-ui-portfolio.webflow.io/projects/bulk-approval) +- [Enterprise Manager's Guide to Mobile Approvals](https://propelapps.com/blog/mobile-approvals-solutions/enterprise-manager-guide-to-mobile-approvals/) +- [Notification System Design: Architecture & Best Practices](https://www.magicbell.com/blog/notification-system-design) +- [Explainable AI: Transparent Decisions for AI Agents](https://www.rapidinnovation.io/post/for-developers-implementing-explainable-ai-for-transparent-agent-decisions) +- [Human-Centered Explainable AI Interface Design](https://arxiv.org/html/2403.14496v1) + +### Internal References + +- [SPIKE-003: Real-time Updates Architecture](./SPIKE-003-realtime-updates.md) +- [SPIKE-001: MCP Integration Pattern](./SPIKE-001-mcp-integration-pattern.md) +- [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md) +- [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md) +- [Requirements: FR-203 Autonomy Level Configuration](../requirements/SYNDARIX_REQUIREMENTS.md) + +--- + +## Decision + +**Adopt a checkpoint-based approval system** with: + +1. **Queue-based UI pattern** with priority lanes and batch approval support +2. **Confidence-aware routing** that escalates low-confidence AI decisions +3. **Configurable escalation policies** per project with timeout handling +4. **Multi-channel notifications** leveraging existing SSE infrastructure +5. **Full audit trail** with immutable event logging +6. **Mobile-responsive design** with push notification support + +This design integrates seamlessly with Syndarix's existing autonomy level configuration (FR-203) and real-time event infrastructure (SPIKE-003). + +--- + +*Spike completed. Findings will inform implementation of client approval workflows.*