docs: add architecture decision records (ADRs) for key technical choices

- Added the following ADRs to `docs/adrs/` directory: - ADR-001: MCP Integration Architecture - ADR-002: Real-time Communication Architecture - ADR-003: Background Task Architecture - ADR-004: LLM Provider Abstraction - ADR-005: Technology Stack Selection - Each ADR details the context, decision drivers, considered options, final decisions, and implementation plans. - Documentation aligns technical choices with architecture principles and system requirements for Syndarix.
docs: add spike findings for LLM abstraction, MCP integration, and real-time updates
2025-12-29 13:16:02 +01:00 · 2025-12-29 13:15:50 +01:00
11 changed files with 3127 additions and 0 deletions
--- a/docs/adrs/ADR-001-mcp-integration-architecture.md
+++ b/docs/adrs/ADR-001-mcp-integration-architecture.md
@@ -0,0 +1,134 @@
 # ADR-001: MCP Integration Architecture
 **Status:** Accepted
 **Date:** 2025-12-29
 **Deciders:** Architecture Team
 **Related Spikes:** SPIKE-001
 ---
 ## Context
 Syndarix requires integration with multiple external services (LLM providers, Git, issue trackers, file systems, CI/CD). The Model Context Protocol (MCP) was identified as the standard for tool integration in AI applications. We need to decide on:
 1. The MCP framework to use
 2. Server deployment pattern (singleton vs per-project)
 3. Scoping mechanism for multi-project/multi-agent access
 ## Decision Drivers
 - **Simplicity:** Minimize operational complexity
 - **Resource Efficiency:** Avoid spawning redundant processes
 - **Consistency:** Unified interface across all integrations
 - **Scalability:** Support 10+ concurrent projects
 - **Maintainability:** Easy to add new MCP servers
 ## Considered Options
 ### Option 1: Per-Project MCP Servers
 Spawn dedicated MCP server instances for each project.
 **Pros:**
 - Complete isolation between projects
 - Simple access control (project owns server)
 **Cons:**
 - Resource heavy (7 servers × N projects)
 - Complex orchestration
 - Difficult to share cross-project resources
 ### Option 2: Unified Singleton MCP Servers (Selected)
 Single instance of each MCP server type, with explicit project/agent scoping.
 **Pros:**
 - Resource efficient (7 total servers)
 - Simpler deployment
 - Enables cross-project learning (if desired)
 - Consistent management
 **Cons:**
 - Requires explicit scoping in all tools
 - Shared state requires careful design
 ### Option 3: Hybrid (MCP Proxy)
 Single proxy that routes to per-project backends.
 **Pros:**
 - Balance of isolation and efficiency
 **Cons:**
 - Added complexity
 - Routing overhead
 ## Decision
 **Adopt Option 2: Unified Singleton MCP Servers with explicit scoping.**
 All MCP servers are deployed as singletons. Every tool accepts `project_id` and `agent_id` parameters for:
 - Access control validation
 - Audit logging
 - Context filtering
 ## Implementation
 ### MCP Server Registry
 | Server | Port | Purpose |
 |--------|------|---------|
 | LLM Gateway | 9001 | Route LLM requests with failover |
 | Git MCP | 9002 | Git operations across providers |
 | Knowledge Base MCP | 9003 | RAG and document search |
 | Issues MCP | 9004 | Issue tracking operations |
 | File System MCP | 9005 | Workspace file operations |
 | Code Analysis MCP | 9006 | Static analysis, linting |
 | CI/CD MCP | 9007 | Pipeline operations |
 ### Framework Selection
 Use **FastMCP 2.0** for all MCP server implementations:
 - Decorator-based tool registration
 - Built-in async support
 - Compatible with SSE transport
 - Type-safe with Pydantic
 ### Tool Signature Pattern
 ```python
@mcp.tool()
 def tool_name(
    project_id: str,   # Required: project scope
    agent_id: str,     # Required: calling agent
    # ... tool-specific params
 ) -> Result:
    validate_access(agent_id, project_id)
    log_tool_usage(agent_id, project_id, "tool_name")
    # ... implementation
 ```
 ## Consequences
 ### Positive
 - Single deployment per MCP type simplifies operations
 - Consistent interface across all tools
 - Easy to add monitoring/logging centrally
 - Cross-project analytics possible
 ### Negative
 - All tools must include scoping parameters
 - Shared state requires careful design
 - Single point of failure per MCP type (mitigated by multiple instances)
 ### Neutral
 - Requires MCP client manager in FastAPI backend
 - Authentication handled internally (service tokens for v1)
 ## Compliance
 This decision aligns with:
 - FR-802: MCP-first architecture requirement
 - NFR-201: Horizontal scalability requirement
 - NFR-602: Centralized logging requirement
 ---
 *This ADR supersedes any previous decisions regarding MCP architecture.*
--- a/docs/adrs/ADR-002-realtime-communication.md
+++ b/docs/adrs/ADR-002-realtime-communication.md
@@ -0,0 +1,160 @@
 # ADR-002: Real-time Communication Architecture
 **Status:** Accepted
 **Date:** 2025-12-29
 **Deciders:** Architecture Team
 **Related Spikes:** SPIKE-003
 ---
 ## Context
 Syndarix requires real-time communication for:
 - Agent activity streams
 - Project progress updates
 - Build/pipeline status
 - Client approval requests
 - Issue change notifications
 - Interactive chat with agents
 We need to decide between WebSocket and Server-Sent Events (SSE) for real-time data delivery.
 ## Decision Drivers
 - **Simplicity:** Minimize implementation complexity
 - **Reliability:** Built-in reconnection handling
 - **Scalability:** Support 200+ concurrent connections
 - **Compatibility:** Work through proxies and load balancers
 - **Use Case Fit:** Match communication patterns
 ## Considered Options
 ### Option 1: WebSocket Only
 Use WebSocket for all real-time communication.
 **Pros:**
 - Bidirectional communication
 - Single protocol to manage
 - Well-supported in FastAPI
 **Cons:**
 - Manual reconnection logic required
 - More complex through proxies
 - Overkill for server-to-client streams
 ### Option 2: SSE Only
 Use Server-Sent Events for all real-time communication.
 **Pros:**
 - Built-in automatic reconnection
 - Native HTTP (proxy-friendly)
 - Simpler implementation
 **Cons:**
 - Unidirectional only
 - Browser connection limits per domain
 ### Option 3: SSE Primary + WebSocket for Chat (Selected)
 Use SSE for server-to-client events, WebSocket for bidirectional chat.
 **Pros:**
 - Best tool for each use case
 - SSE simplicity for 90% of needs
 - WebSocket only where truly needed
 **Cons:**
 - Two protocols to manage
 ## Decision
 **Adopt Option 3: SSE as primary transport, WebSocket for interactive chat.**
 ### SSE Use Cases (90%)
 - Agent activity streams
 - Project progress updates
 - Build/pipeline status
 - Approval request notifications
 - Issue change notifications
 ### WebSocket Use Cases (10%)
 - Interactive chat with agents
 - Real-time debugging sessions
 - Future collaboration features
 ## Implementation
 ### Event Bus with Redis Pub/Sub
 ```
 FastAPI Backend ──publish──> Redis Pub/Sub ──subscribe──> SSE Endpoints
                                   │
                                   └──> Other Backend Instances
 ```
 ### SSE Endpoint Pattern
 ```python
@router.get("/projects/{project_id}/events")
 async def project_events(project_id: str, request: Request):
    async def event_generator():
        subscriber = await event_bus.subscribe(f"project:{project_id}")
        try:
            while not await request.is_disconnected():
                event = await asyncio.wait_for(
                    subscriber.get_event(), timeout=30.0
                )
                yield f"event: {event.type}\ndata: {event.json()}\n\n"
        finally:
            await subscriber.unsubscribe()
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream"
    )
 ```
 ### Event Types
 | Category | Event Types |
 |----------|-------------|
 | Agent | `agent_started`, `agent_activity`, `agent_completed`, `agent_error` |
 | Project | `issue_created`, `issue_updated`, `issue_closed` |
 | Git | `branch_created`, `commit_pushed`, `pr_created`, `pr_merged` |
 | Workflow | `approval_required`, `sprint_started`, `sprint_completed` |
 | Pipeline | `pipeline_started`, `pipeline_completed`, `pipeline_failed` |
 ### Client Implementation
 - Single SSE connection per project
 - Event multiplexing through event types
 - Exponential backoff on reconnection
 - Native `EventSource` API with automatic reconnect
 ## Consequences
 ### Positive
 - Simpler implementation for server-to-client streams
 - Automatic reconnection reduces client complexity
 - Works through all HTTP proxies
 - Reduced server resource usage vs WebSocket
 ### Negative
 - Two protocols to maintain
 - WebSocket requires manual reconnect logic
 - SSE limited to ~6 connections per domain (HTTP/1.1)
 ### Mitigation
 - Use HTTP/2 where possible (higher connection limits)
 - Multiplex all project events on single connection
 - WebSocket only for interactive chat sessions
 ## Compliance
 This decision aligns with:
 - FR-105: Real-time agent activity monitoring
 - NFR-102: 200+ concurrent connections requirement
 - NFR-501: Responsive UI updates
 ---
 *This ADR supersedes any previous decisions regarding real-time communication.*
--- a/docs/adrs/ADR-003-background-task-architecture.md
+++ b/docs/adrs/ADR-003-background-task-architecture.md
@@ -0,0 +1,179 @@
 # ADR-003: Background Task Architecture
 **Status:** Accepted
 **Date:** 2025-12-29
 **Deciders:** Architecture Team
 **Related Spikes:** SPIKE-004
 ---
 ## Context
 Syndarix requires background task processing for:
 - Agent actions (LLM calls, code generation)
 - Git operations (clone, commit, push, PR creation)
 - External synchronization (issue sync with Gitea/GitHub/GitLab)
 - CI/CD pipeline triggers
 - Long-running workflows (sprints, story implementation)
 These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring.
 ## Decision Drivers
 - **Reliability:** Tasks must complete even if workers restart
 - **Visibility:** Progress tracking for long-running operations
 - **Scalability:** Handle concurrent agent operations
 - **Rate Limiting:** Respect LLM API rate limits
 - **Async Compatibility:** Work with async FastAPI
 ## Considered Options
 ### Option 1: FastAPI BackgroundTasks
 Use FastAPI's built-in background tasks.
 **Pros:**
 - Simple, no additional infrastructure
 - Direct async integration
 **Cons:**
 - No persistence (lost on restart)
 - No retry mechanism
 - No distributed workers
 ### Option 2: Celery + Redis (Selected)
 Use Celery for task queue with Redis as broker/backend.
 **Pros:**
 - Mature, battle-tested
 - Persistent task queue
 - Built-in retry with backoff
 - Distributed workers
 - Task chaining and workflows
 - Monitoring with Flower
 **Cons:**
 - Additional infrastructure
 - Sync-only task execution (bridge needed for async)
 ### Option 3: Dramatiq + Redis
 Use Dramatiq as a simpler Celery alternative.
 **Pros:**
 - Simpler API than Celery
 - Good async support
 **Cons:**
 - Less mature ecosystem
 - Fewer monitoring tools
 ### Option 4: ARQ (Async Redis Queue)
 Use ARQ for native async task processing.
 **Pros:**
 - Native async
 - Simple API
 **Cons:**
 - Less feature-rich
 - Smaller community
 ## Decision
 **Adopt Option 2: Celery + Redis.**
 Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend.
 ## Implementation
 ### Queue Architecture
 ```
 ┌─────────────────────────────────────────────────┐
 │                 Redis (Broker + Backend)         │
 ├─────────────┬─────────────┬─────────────────────┤
 │ agent_queue │  git_queue  │     sync_queue      │
 │ (prefetch=1)│ (prefetch=4)│    (prefetch=4)     │
 └──────┬──────┴──────┬──────┴──────────┬──────────┘
       │             │                 │
       ▼             ▼                 ▼
  ┌─────────┐  ┌─────────┐       ┌─────────┐
  │ Agent   │  │  Git    │       │  Sync   │
  │ Workers │  │ Workers │       │ Workers │
  └─────────┘  └─────────┘       └─────────┘
 ```
 ### Queue Configuration
 | Queue | Prefetch | Concurrency | Purpose |
 |-------|----------|-------------|---------|
 | `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) |
 | `git_queue` | 4 | 8 | Git operations |
 | `sync_queue` | 4 | 4 | External sync |
 | `cicd_queue` | 4 | 4 | Pipeline operations |
 ### Task Patterns
 **Progress Reporting:**
 ```python
@celery_app.task(bind=True)
 def implement_story(self, story_id: str, agent_id: str, project_id: str):
    for i, step in enumerate(steps):
        self.update_state(
            state="PROGRESS",
            meta={"current": i + 1, "total": len(steps)}
        )
        # Publish SSE event for real-time UI update
        event_bus.publish(f"project:{project_id}", {
            "type": "agent_progress",
            "step": i + 1,
            "total": len(steps)
        })
        execute_step(step)
 ```
 **Task Chaining:**
 ```python
 workflow = chain(
    analyze_requirements.s(story_id),
    design_solution.s(),
    implement_code.s(),
    run_tests.s(),
    create_pr.s()
 )
 ```
 ### Monitoring
 - **Flower:** Web UI for task monitoring (port 5555)
 - **Prometheus:** Metrics export for alerting
 - **Dead Letter Queue:** Failed tasks for investigation
 ## Consequences
 ### Positive
 - Reliable task execution with persistence
 - Automatic retry with exponential backoff
 - Progress tracking for long operations
 - Distributed workers for scalability
 - Rich monitoring and debugging tools
 ### Negative
 - Additional infrastructure (Redis, workers)
 - Celery is synchronous (event_loop bridge for async calls)
 - Learning curve for task patterns
 ### Mitigation
 - Use existing Redis instance (already needed for SSE)
 - Wrap async calls with `asyncio.run()` or `sync_to_async`
 - Document common task patterns
 ## Compliance
 This decision aligns with:
 - FR-304: Long-running implementation workflow
 - NFR-102: 500+ background jobs per minute
 - NFR-402: Task reliability and fault tolerance
 ---
 *This ADR supersedes any previous decisions regarding background task processing.*
--- a/docs/adrs/ADR-004-llm-provider-abstraction.md
+++ b/docs/adrs/ADR-004-llm-provider-abstraction.md
@@ -0,0 +1,189 @@
 # ADR-004: LLM Provider Abstraction
 **Status:** Accepted
 **Date:** 2025-12-29
 **Deciders:** Architecture Team
 **Related Spikes:** SPIKE-005
 ---
 ## Context
 Syndarix agents require access to large language models (LLMs) from multiple providers:
 - **Anthropic** (Claude) - Primary provider
 - **OpenAI** (GPT-4) - Fallback provider
 - **Local models** (Ollama/Llama) - Cost optimization, privacy
 We need a unified abstraction layer that provides:
 - Consistent API across providers
 - Automatic failover on errors
 - Usage tracking and cost management
 - Rate limiting compliance
 ## Decision Drivers
 - **Reliability:** Automatic failover on provider outages
 - **Cost Control:** Track and limit API spending
 - **Flexibility:** Easy to add/swap providers
 - **Consistency:** Single interface for all agents
 - **Async Support:** Compatible with async FastAPI
 ## Considered Options
 ### Option 1: Direct Provider SDKs
 Use Anthropic and OpenAI SDKs directly with custom abstraction.
 **Pros:**
 - Full control over implementation
 - No external dependencies
 **Cons:**
 - Significant development effort
 - Must maintain failover logic
 - Must track token costs manually
 ### Option 2: LiteLLM (Selected)
 Use LiteLLM as unified abstraction layer.
 **Pros:**
 - Unified API for 100+ providers
 - Built-in failover and routing
 - Automatic token counting
 - Cost tracking built-in
 - Redis caching support
 - Active community
 **Cons:**
 - External dependency
 - May lag behind provider SDK updates
 ### Option 3: LangChain
 Use LangChain's LLM abstraction.
 **Pros:**
 - Large ecosystem
 - Many integrations
 **Cons:**
 - Heavy dependency
 - Overkill for just LLM abstraction
 - Complexity overhead
 ## Decision
 **Adopt Option 2: LiteLLM for unified LLM provider abstraction.**
 LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.
 ## Implementation
 ### Model Groups
 | Group Name | Use Case | Primary Model | Fallback |
 |------------|----------|---------------|----------|
 | `high-reasoning` | Complex analysis, architecture | Claude 3.5 Sonnet | GPT-4 Turbo |
 | `fast-response` | Quick tasks, simple queries | Claude 3 Haiku | GPT-4o Mini |
 | `cost-optimized` | High-volume, non-critical | Local Llama 3 | Claude 3 Haiku |
 ### Failover Chain
 ```
 Claude 3.5 Sonnet (Anthropic)
         │
         ▼ (on failure)
    GPT-4 Turbo (OpenAI)
         │
         ▼ (on failure)
    Llama 3 (Ollama/Local)
         │
         ▼ (on failure)
    Error with retry
 ```
 ### LLM Gateway Service
 ```python
 class LLMGateway:
    def __init__(self):
        self.router = Router(
            model_list=model_list,
            fallbacks=[
                {"high-reasoning": ["high-reasoning", "local-fallback"]},
            ],
            routing_strategy="latency-based-routing",
            num_retries=3,
        )
    async def complete(
        self,
        agent_id: str,
        project_id: str,
        messages: list[dict],
        model_preference: str = "high-reasoning",
    ) -> dict:
        response = await self.router.acompletion(
            model=model_preference,
            messages=messages,
        )
        await self._track_usage(agent_id, project_id, response)
        return response
 ```
 ### Cost Tracking
 | Model | Input (per 1M tokens) | Output (per 1M tokens) |
 |-------|----------------------|------------------------|
 | Claude 3.5 Sonnet | $3.00 | $15.00 |
 | Claude 3 Haiku | $0.25 | $1.25 |
 | GPT-4 Turbo | $10.00 | $30.00 |
 | GPT-4o Mini | $0.15 | $0.60 |
 | Ollama (local) | $0.00 | $0.00 |
 ### Agent Type Mapping
 | Agent Type | Model Preference | Rationale |
 |------------|------------------|-----------|
 | Product Owner | high-reasoning | Complex requirements analysis |
 | Software Architect | high-reasoning | Architecture decisions |
 | Software Engineer | high-reasoning | Code generation |
 | QA Engineer | fast-response | Test case generation |
 | DevOps Engineer | fast-response | Config generation |
 | Project Manager | fast-response | Status updates |
 ### Caching Strategy
 - **Redis-backed cache** for repeated queries
 - **TTL:** 1 hour for general queries
 - **Skip cache:** For context-dependent generation
 - **Cache key:** Hash of (model, messages, temperature)
 ## Consequences
 ### Positive
 - Single interface for all LLM operations
 - Automatic failover improves reliability
 - Built-in cost tracking and budgeting
 - Easy to add new providers
 - Caching reduces API costs
 ### Negative
 - Dependency on LiteLLM library
 - May lag behind provider SDK features
 - Additional abstraction layer
 ### Mitigation
 - Pin LiteLLM version, test before upgrades
 - Direct SDK access available if needed
 - Monitor LiteLLM updates for breaking changes
 ## Compliance
 This decision aligns with:
 - FR-101: Agent type model configuration
 - NFR-103: Agent response time targets
 - NFR-402: Failover requirements
 - TR-001: LLM API unavailability mitigation
 ---
 *This ADR supersedes any previous decisions regarding LLM integration.*
--- a/docs/adrs/ADR-005-tech-stack-selection.md
+++ b/docs/adrs/ADR-005-tech-stack-selection.md
@@ -0,0 +1,156 @@
 # ADR-005: Technology Stack Selection
 **Status:** Accepted
 **Date:** 2025-12-29
 **Deciders:** Architecture Team
 ---
 ## Context
 Syndarix needs a robust, modern technology stack that can support:
 - Multi-agent orchestration with real-time communication
 - Full-stack web application with API backend
 - Background task processing for long-running operations
 - Vector search for RAG (Retrieval-Augmented Generation)
 - Multiple external integrations via MCP
 The decision was made to build upon **PragmaStack** as the foundation, extending it with Syndarix-specific components.
 ## Decision Drivers
 - **Productivity:** Rapid development with modern frameworks
 - **Type Safety:** Minimize runtime errors
 - **Async Performance:** Handle concurrent agent operations
 - **Ecosystem:** Rich library support
 - **Familiarity:** Team expertise with selected technologies
 - **Production-Ready:** Proven technologies for production workloads
 ## Decision
 **Adopt PragmaStack as foundation with Syndarix-specific extensions.**
 ### Core Stack (from PragmaStack)
 | Layer | Technology | Version | Rationale |
 |-------|------------|---------|-----------|
 | **Backend** | FastAPI | 0.115+ | Async, OpenAPI, type hints |
 | **Backend Language** | Python | 3.11+ | Type hints, async/await, ecosystem |
 | **Frontend** | Next.js | 16 | React 19, server components, App Router |
 | **Frontend Language** | TypeScript | 5.0+ | Type safety, IDE support |
 | **Database** | PostgreSQL | 15+ | Robust, extensible, pgvector |
 | **ORM** | SQLAlchemy | 2.0+ | Async support, type hints |
 | **Validation** | Pydantic | 2.0+ | Data validation, serialization |
 | **State Management** | Zustand | 4.0+ | Simple, performant |
 | **Data Fetching** | TanStack Query | 5.0+ | Caching, invalidation |
 | **UI Components** | shadcn/ui | Latest | Accessible, customizable |
 | **CSS** | Tailwind CSS | 4.0+ | Utility-first, fast styling |
 | **Auth** | JWT | - | Dual-token (access + refresh) |
 ### Syndarix Extensions
 | Component | Technology | Version | Purpose |
 |-----------|------------|---------|---------|
 | **Task Queue** | Celery | 5.3+ | Background job processing |
 | **Message Broker** | Redis | 7.0+ | Celery broker, caching, pub/sub |
 | **Vector Store** | pgvector | Latest | Embeddings for RAG |
 | **MCP Framework** | FastMCP | 2.0+ | MCP server development |
 | **LLM Abstraction** | LiteLLM | Latest | Multi-provider LLM access |
 | **Real-time** | SSE + WebSocket | - | Event streaming, chat |
 ### Testing Stack
 | Type | Technology | Purpose |
 |------|------------|---------|
 | **Backend Unit** | pytest | 8.0+ | Python testing |
 | **Backend Async** | pytest-asyncio | Async test support |
 | **Backend Coverage** | coverage.py | Code coverage |
 | **Frontend Unit** | Jest | 29+ | React testing |
 | **Frontend Components** | React Testing Library | Component testing |
 | **E2E** | Playwright | 1.40+ | Browser automation |
 ### DevOps Stack
 | Component | Technology | Purpose |
 |-----------|------------|---------|
 | **Containerization** | Docker | 24+ | Application packaging |
 | **Orchestration** | Docker Compose | Local development |
 | **CI/CD** | Gitea Actions | Automated pipelines |
 | **Database Migrations** | Alembic | Schema versioning |
 ## Architecture Overview
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                         Frontend (Next.js 16)                    │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
 │  │   Pages     │  │ Components  │  │   Stores    │              │
 │  │ (App Router)│  │ (shadcn/ui) │  │  (Zustand)  │              │
 │  └─────────────┘  └─────────────┘  └─────────────┘              │
 └────────────────────────────┬────────────────────────────────────┘
                             │ REST + SSE + WebSocket
                             ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │                      Backend (FastAPI 0.115+)                    │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
 │  │    API      │  │  Services   │  │    CRUD     │              │
 │  │   Routes    │  │   Layer     │  │   Layer     │              │
 │  └─────────────┘  └─────────────┘  └─────────────┘              │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
 │  │ LLM Gateway │  │  MCP Client │  │ Event Bus   │              │
 │  │ (LiteLLM)   │  │  Manager    │  │ (Redis)     │              │
 │  └─────────────┘  └─────────────┘  └─────────────┘              │
 └────────────────────────────┬────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        ▼                    ▼                    ▼
 ┌───────────────┐  ┌───────────────┐  ┌───────────────────────────┐
 │  PostgreSQL   │  │     Redis     │  │      MCP Servers          │
 │  + pgvector   │  │ (Cache/Queue) │  │ (LLM, Git, KB, Issues...) │
 └───────────────┘  └───────────────┘  └───────────────────────────┘
                             │
                             ▼
                   ┌───────────────┐
                   │    Celery     │
                   │   Workers     │
                   └───────────────┘
 ```
 ## Consequences
 ### Positive
 - Proven, production-ready stack
 - Strong typing throughout (Python + TypeScript)
 - Excellent async performance
 - Rich ecosystem for extensions
 - Team familiarity reduces learning curve
 ### Negative
 - Python GIL limits CPU-bound concurrency (mitigated by Celery)
 - Multiple languages (Python + TypeScript) to maintain
 - PostgreSQL requires management (vs serverless options)
 ### Neutral
 - PragmaStack provides solid foundation but may include unused features
 - Stack is opinionated, limiting some technology choices
 ## Version Pinning Strategy
 | Component | Strategy | Rationale |
 |-----------|----------|-----------|
 | Python | 3.11+ (specific minor) | Stability |
 | Node.js | 20 LTS | Long-term support |
 | FastAPI | 0.115+ | Latest stable |
 | Next.js | 16 | Current major |
 | PostgreSQL | 15+ | Required for features |
 ## Compliance
 This decision aligns with:
 - NFR-601: Code quality standards (TypeScript, type hints)
 - NFR-603: Docker containerization requirement
 - TC-001 through TC-006: Technical constraints
 ---
 *This ADR establishes the foundational technology choices for Syndarix.*
--- a/docs/adrs/ADR-006-agent-orchestration.md
+++ b/docs/adrs/ADR-006-agent-orchestration.md
@@ -0,0 +1,260 @@
 # ADR-006: Agent Orchestration Architecture
 **Status:** Accepted
 **Date:** 2025-12-29
 **Deciders:** Architecture Team
 **Related Spikes:** SPIKE-002
 ---
 ## Context
 Syndarix requires an agent orchestration system that can:
 - Define reusable agent types with specific capabilities
 - Spawn multiple instances of the same type with unique identities
 - Manage agent state, context, and conversation history
 - Route messages between agents
 - Handle agent failover and recovery
 - Track resource usage per agent
 ## Decision Drivers
 - **Flexibility:** Support diverse agent roles and capabilities
 - **Scalability:** Handle 50+ concurrent agent instances
 - **Isolation:** Each instance maintains separate state
 - **Observability:** Full visibility into agent activities
 - **Reliability:** Graceful handling of failures
 ## Decision
 **Adopt a Type-Instance pattern** where:
 - **Agent Types** define templates (model, expertise, personality)
 - **Agent Instances** are spawned from types with unique identities
 - **Agent Orchestrator** manages lifecycle and communication
 ## Architecture
 ### Agent Type Definition
 ```python
 class AgentType(Base):
    id = Column(UUID, primary_key=True)
    name = Column(String(50), unique=True)  # "Software Engineer"
    role = Column(Enum(AgentRole))          # ENGINEER
    base_model = Column(String(100))        # "claude-3-5-sonnet-20241022"
    failover_model = Column(String(100))    # "gpt-4-turbo"
    expertise = Column(ARRAY(String))       # ["python", "fastapi", "testing"]
    personality = Column(JSONB)             # {"style": "detailed", "tone": "professional"}
    system_prompt = Column(Text)            # Base system prompt template
    capabilities = Column(ARRAY(String))    # ["code_generation", "code_review"]
    is_active = Column(Boolean, default=True)
 ```
 ### Agent Instance Definition
 ```python
 class AgentInstance(Base):
    id = Column(UUID, primary_key=True)
    name = Column(String(50))               # "Dave"
    agent_type_id = Column(UUID, ForeignKey)
    project_id = Column(UUID, ForeignKey)
    status = Column(Enum(InstanceStatus))   # ACTIVE, IDLE, TERMINATED
    context = Column(JSONB)                 # Current working context
    conversation_id = Column(UUID)          # Active conversation
    rag_collection_id = Column(String)      # Domain knowledge collection
    token_usage = Column(JSONB)             # {"prompt": 0, "completion": 0}
    last_active_at = Column(DateTime)
    created_at = Column(DateTime)
    terminated_at = Column(DateTime)
 ```
 ### Orchestrator Service
 ```python
 class AgentOrchestrator:
    """Central service for agent lifecycle management."""
    async def spawn_agent(
        self,
        agent_type_id: UUID,
        project_id: UUID,
        name: str,
        domain_knowledge: list[str] = None
    ) -> AgentInstance:
        """Spawn a new agent instance from a type definition."""
        agent_type = await self.get_agent_type(agent_type_id)
        instance = AgentInstance(
            name=name,
            agent_type_id=agent_type_id,
            project_id=project_id,
            status=InstanceStatus.ACTIVE,
            context={"initialized_at": datetime.utcnow().isoformat()},
        )
        # Initialize RAG collection if domain knowledge provided
        if domain_knowledge:
            instance.rag_collection_id = await self._init_rag_collection(
                instance.id, domain_knowledge
            )
        await self.db.add(instance)
        await self.db.commit()
        # Publish spawn event
        await self.event_bus.publish(f"project:{project_id}", {
            "type": "agent_spawned",
            "agent_id": str(instance.id),
            "name": name,
            "role": agent_type.role.value
        })
        return instance
    async def terminate_agent(self, instance_id: UUID) -> None:
        """Terminate an agent instance and release resources."""
        instance = await self.get_instance(instance_id)
        instance.status = InstanceStatus.TERMINATED
        instance.terminated_at = datetime.utcnow()
        # Cleanup RAG collection
        if instance.rag_collection_id:
            await self._cleanup_rag_collection(instance.rag_collection_id)
        await self.db.commit()
    async def send_message(
        self,
        from_id: UUID,
        to_id: UUID,
        message: AgentMessage
    ) -> None:
        """Route a message from one agent to another."""
        # Validate both agents exist and are active
        sender = await self.get_instance(from_id)
        recipient = await self.get_instance(to_id)
        # Persist message
        await self.message_store.save(message)
        # If recipient is idle, trigger action
        if recipient.status == InstanceStatus.IDLE:
            await self._trigger_agent_action(recipient.id, message)
        # Publish for real-time tracking
        await self.event_bus.publish(f"project:{sender.project_id}", {
            "type": "agent_message",
            "from": str(from_id),
            "to": str(to_id),
            "preview": message.content[:100]
        })
    async def broadcast(
        self,
        from_id: UUID,
        target_role: AgentRole,
        message: AgentMessage
    ) -> None:
        """Broadcast a message to all agents of a specific role."""
        sender = await self.get_instance(from_id)
        recipients = await self.get_instances_by_role(
            sender.project_id, target_role
        )
        for recipient in recipients:
            await self.send_message(from_id, recipient.id, message)
 ```
 ### Agent Execution Pattern
 ```python
 class AgentRunner:
    """Executes agent actions using LLM."""
    def __init__(self, instance: AgentInstance, llm_gateway: LLMGateway):
        self.instance = instance
        self.llm = llm_gateway
    async def execute(self, action: str, context: dict) -> dict:
        """Execute an action using the agent's configured model."""
        agent_type = await self.get_agent_type(self.instance.agent_type_id)
        # Build messages with system prompt and context
        messages = [
            {"role": "system", "content": self._build_system_prompt(agent_type)},
            *self._get_conversation_history(),
            {"role": "user", "content": self._build_action_prompt(action, context)}
        ]
        # Add RAG context if available
        if self.instance.rag_collection_id:
            rag_context = await self._query_rag(action, context)
            messages.insert(1, {
                "role": "system",
                "content": f"Relevant context:\n{rag_context}"
            })
        # Execute with failover
        response = await self.llm.complete(
            agent_id=str(self.instance.id),
            project_id=str(self.instance.project_id),
            messages=messages,
            model_preference=self._get_model_preference(agent_type)
        )
        # Update instance context
        self.instance.context = {
            **self.instance.context,
            "last_action": action,
            "last_response_at": datetime.utcnow().isoformat()
        }
        return response
 ```
 ### Agent Roles
 | Role | Instances | Primary Capabilities |
 |------|-----------|---------------------|
 | Product Owner | 1 | requirements, prioritization, client_communication |
 | Project Manager | 1 | planning, tracking, coordination |
 | Business Analyst | 1 | analysis, documentation, process_modeling |
 | Software Architect | 1 | design, architecture_decisions, tech_selection |
 | Software Engineer | 1-5 | code_generation, code_review, testing |
 | UI/UX Designer | 1 | design, wireframes, accessibility |
 | QA Engineer | 1-2 | test_planning, test_automation, bug_reporting |
 | DevOps Engineer | 1 | cicd, infrastructure, deployment |
 | AI/ML Engineer | 1 | ml_development, model_training, mlops |
 | Security Expert | 1 | security_review, vulnerability_assessment |
 ## Consequences
 ### Positive
 - Clear separation between type definition and instance runtime
 - Multiple instances share type configuration (DRY)
 - Easy to add new agent roles
 - Full observability through events
 - Graceful failure handling with model failover
 ### Negative
 - Complexity in managing instance lifecycle
 - State synchronization across instances
 - Memory overhead for context storage
 ### Mitigation
 - Context archival for long-running instances
 - Periodic cleanup of terminated instances
 - State compression for large contexts
 ## Compliance
 This decision aligns with:
 - FR-101: Agent type configuration
 - FR-102: Agent instance spawning
 - FR-103: Agent domain knowledge (RAG)
 - FR-104: Inter-agent communication
 - FR-105: Agent activity monitoring
 ---
 *This ADR establishes the agent orchestration architecture for Syndarix.*
--- a/docs/architecture/ARCHITECTURE_OVERVIEW.md
+++ b/docs/architecture/ARCHITECTURE_OVERVIEW.md
@@ -0,0 +1,487 @@
 # Syndarix Architecture Overview
 **Version:** 1.0
 **Date:** 2025-12-29
 **Status:** Draft
 ---
 ## Table of Contents
 1. [Executive Summary](#1-executive-summary)
 2. [System Context](#2-system-context)
 3. [High-Level Architecture](#3-high-level-architecture)
 4. [Core Components](#4-core-components)
 5. [Data Architecture](#5-data-architecture)
 6. [Integration Architecture](#6-integration-architecture)
 7. [Security Architecture](#7-security-architecture)
 8. [Deployment Architecture](#8-deployment-architecture)
 9. [Cross-Cutting Concerns](#9-cross-cutting-concerns)
 10. [Architecture Decisions](#10-architecture-decisions)
 ---
 ## 1. Executive Summary
 Syndarix is an AI-powered software consulting agency platform that orchestrates specialized AI agents to deliver complete software solutions autonomously. This document describes the technical architecture that enables:
 - **Multi-Agent Orchestration:** 10 specialized agent roles collaborating on projects
 - **MCP-First Integration:** All external tools via Model Context Protocol
 - **Real-time Visibility:** SSE-based event streaming for progress tracking
 - **Autonomous Workflows:** Configurable autonomy levels from full control to autonomous
 - **Full Artifact Delivery:** Code, documentation, tests, and ADRs
 ### Architecture Principles
 1. **MCP-First:** All integrations through unified MCP servers
 2. **Event-Driven:** Async communication via Redis Pub/Sub
 3. **Type-Safe:** Full typing in Python and TypeScript
 4. **Stateless Services:** Horizontal scaling through stateless design
 5. **Explicit Scoping:** All operations scoped to project/agent
 ---
 ## 2. System Context
 ### Context Diagram
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                              EXTERNAL ACTORS                                 │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                                                                             │
 │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
 │  │   Client    │    │   Admin     │    │ LLM APIs    │    │ Git Hosts   │  │
 │  │   (Human)   │    │   (Human)   │    │ (Anthropic) │    │  (Gitea)    │  │
 │  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘  │
 │         │                  │                  │                  │          │
 └─────────│──────────────────│──────────────────│──────────────────│──────────┘
          │                  │                  │                  │
          │ Web UI           │ Admin UI         │ API              │ API
          │ SSE              │                  │                  │
          ▼                  ▼                  ▼                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                                                                             │
 │                              SYNDARIX PLATFORM                              │
 │                                                                             │
 │   ┌─────────────────────────────────────────────────────────────────────┐   │
 │   │                         Agent Orchestration                          │   │
 │   │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐            │   │
 │   │  │   PO   │ │   PM   │ │  Arch  │ │  Eng   │ │   QA   │  ...       │   │
 │   │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘            │   │
 │   └─────────────────────────────────────────────────────────────────────┘   │
 │                                                                             │
 └─────────────────────────────────────────────────────────────────────────────┘
          │                  │                  │                  │
          │ Storage          │ Events           │ Tasks            │
          ▼                  ▼                  ▼                  ▼
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                              INFRASTRUCTURE                                  │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
 │  │ PostgreSQL  │    │    Redis    │    │   Celery    │    │MCP Servers  │  │
 │  │ + pgvector  │    │   Pub/Sub   │    │   Workers   │    │ (7 types)   │  │
 │  └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘  │
 └─────────────────────────────────────────────────────────────────────────────┘
 ```
 ### Key Actors
 | Actor | Type | Interaction |
 |-------|------|-------------|
 | Client | Human | Web UI, approvals, feedback |
 | Admin | Human | Configuration, monitoring |
 | LLM Providers | External | Claude, GPT-4, local models |
 | Git Hosts | External | Gitea, GitHub, GitLab |
 | CI/CD Systems | External | Gitea Actions, etc. |
 ---
 ## 3. High-Level Architecture
 ### Layered Architecture
 ```
 ┌───────────────────────────────────────────────────────────────────┐
 │                      PRESENTATION LAYER                           │
 │  ┌─────────────────────────────────────────────────────────────┐  │
 │  │                    Next.js 16 Frontend                       │  │
 │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │  │
 │  │  │Dashboard │  │ Projects │  │  Agents  │  │  Issues  │    │  │
 │  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │  │
 │  └─────────────────────────────────────────────────────────────┘  │
 └───────────────────────────────────────────────────────────────────┘
                                │
                                │ REST + SSE + WebSocket
                                ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                       APPLICATION LAYER                           │
 │  ┌─────────────────────────────────────────────────────────────┐  │
 │  │                    FastAPI Backend                           │  │
 │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │  │
 │  │  │   Auth   │  │   API    │  │ Services │  │  Events  │    │  │
 │  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │  │
 │  └─────────────────────────────────────────────────────────────┘  │
 └───────────────────────────────────────────────────────────────────┘
                                │
                                ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                       ORCHESTRATION LAYER                         │
 │  ┌─────────────────────────────────────────────────────────────┐  │
 │  │  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐   │  │
 │  │  │    Agent      │  │   Workflow    │  │    Project    │   │  │
 │  │  │ Orchestrator  │  │    Engine     │  │   Manager     │   │  │
 │  │  └───────────────┘  └───────────────┘  └───────────────┘   │  │
 │  └─────────────────────────────────────────────────────────────┘  │
 └───────────────────────────────────────────────────────────────────┘
                                │
                                ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                      INTEGRATION LAYER                            │
 │  ┌─────────────────────────────────────────────────────────────┐  │
 │  │                    MCP Client Manager                        │  │
 │  │  Connects to: LLM, Git, KB, Issues, FS, Code, CI/CD MCPs    │  │
 │  └─────────────────────────────────────────────────────────────┘  │
 └───────────────────────────────────────────────────────────────────┘
                                │
                                ▼
 ┌───────────────────────────────────────────────────────────────────┐
 │                       DATA LAYER                                  │
 │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │
 │  │  PostgreSQL  │  │    Redis     │  │  File Store  │            │
 │  │  + pgvector  │  │              │  │              │            │
 │  └──────────────┘  └──────────────┘  └──────────────┘            │
 └───────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## 4. Core Components
 ### 4.1 Agent Orchestrator
 **Purpose:** Manages agent lifecycle, spawning, communication, and coordination.
 **Responsibilities:**
 - Spawn agent instances from type definitions
 - Route messages between agents
 - Manage agent context and memory
 - Handle agent failover
 - Track resource usage
 **Key Patterns:**
 - Type-Instance pattern (types define templates, instances are runtime)
 - Message routing with priority queues
 - Context compression for long-running agents
 See: [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md)
 ### 4.2 Workflow Engine
 **Purpose:** Orchestrates multi-step workflows and agent collaboration.
 **Responsibilities:**
 - Execute workflow templates (requirements discovery, sprint, etc.)
 - Track workflow state and progress
 - Handle branching and conditions
 - Manage approval gates
 **Workflow Types:**
 - Requirements Discovery
 - Architecture Spike
 - Sprint Planning
 - Implementation
 - Sprint Demo
 ### 4.3 Project Manager (Component)
 **Purpose:** Manages project lifecycle, configuration, and state.
 **Responsibilities:**
 - Create and configure projects
 - Manage complexity levels
 - Track project status
 - Generate reports
 ### 4.4 LLM Gateway
 **Purpose:** Unified LLM access with failover and cost tracking.
 **Implementation:** LiteLLM-based router with:
 - Multiple model groups (high-reasoning, fast-response)
 - Automatic failover chain
 - Per-agent token tracking
 - Redis-backed caching
 See: [ADR-004: LLM Provider Abstraction](../adrs/ADR-004-llm-provider-abstraction.md)
 ### 4.5 MCP Client Manager
 **Purpose:** Connects to all MCP servers and routes tool calls.
 **Implementation:**
 - SSE connections to 7 MCP server types
 - Automatic reconnection
 - Request/response correlation
 - Scoped tool calls with project_id/agent_id
 See: [ADR-001: MCP Integration Architecture](../adrs/ADR-001-mcp-integration-architecture.md)
 ### 4.6 Event Bus
 **Purpose:** Real-time event distribution using Redis Pub/Sub.
 **Channels:**
 - `project:{project_id}` - Project-scoped events
 - `agent:{agent_id}` - Agent-specific events
 - `system` - System-wide announcements
 See: [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md)
 ---
 ## 5. Data Architecture
 ### 5.1 Entity Model
 ```
 ┌─────────────┐       ┌─────────────┐       ┌─────────────┐
 │    User     │───1:N─│   Project   │───1:N─│   Sprint    │
 └─────────────┘       └─────────────┘       └─────────────┘
                             │ 1:N                │ 1:N
                             │                    │
                      ┌──────┴──────┐      ┌──────┴──────┐
                      │             │      │             │
               ┌──────┴──────┐ ┌────┴────┐ │       ┌─────┴─────┐
               │ AgentInstance│ │Repository│ │       │   Issue   │
               └─────────────┘ └─────────┘ │       └───────────┘
                      │               │     │              │
                      │ 1:N           │ 1:N │              │ 1:N
               ┌──────┴──────┐ ┌──────┴────┐│       ┌──────┴──────┐
               │   Message   │ │PullRequest│└───────│IssueComment │
               └─────────────┘ └───────────┘        └─────────────┘
 ```
 ### 5.2 Key Entities
 | Entity | Purpose | Key Fields |
 |--------|---------|------------|
 | User | Human users | email, auth |
 | Project | Work containers | name, complexity, autonomy_level |
 | AgentType | Agent templates | base_model, expertise, system_prompt |
 | AgentInstance | Running agents | name, project_id, context |
 | Issue | Work items | type, status, external_tracker_fields |
 | Sprint | Time-boxed iterations | goal, velocity |
 | Repository | Git repos | provider, clone_url |
 | KnowledgeDocument | RAG documents | content, embedding_id |
 ### 5.3 Vector Storage
 **pgvector** extension for:
 - Document embeddings (RAG)
 - Semantic search across knowledge base
 - Agent context similarity
 ---
 ## 6. Integration Architecture
 ### 6.1 MCP Server Registry
 | Server | Port | Purpose | Priority Providers |
 |--------|------|---------|-------------------|
 | LLM Gateway | 9001 | LLM routing | Anthropic, OpenAI, Ollama |
 | Git MCP | 9002 | Git operations | Gitea, GitHub, GitLab |
 | Knowledge Base | 9003 | RAG search | pgvector |
 | Issues MCP | 9004 | Issue tracking | Gitea, GitHub, GitLab |
 | File System | 9005 | Workspace files | Local FS |
 | Code Analysis | 9006 | Static analysis | Ruff, ESLint |
 | CI/CD MCP | 9007 | Pipelines | Gitea Actions |
 ### 6.2 External Integration Diagram
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                        Syndarix Backend                          │
 │                                                                  │
 │  ┌──────────────────────────────────────────────────────────┐   │
 │  │                    MCP Client Manager                     │   │
 │  │                                                          │   │
 │  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │   │
 │  │  │  LLM   │ │  Git   │ │   KB   │ │ Issues │ │ CI/CD  │ │   │
 │  │  │ Client │ │ Client │ │ Client │ │ Client │ │ Client │ │   │
 │  │  └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │   │
 │  └──────│──────────│──────────│──────────│──────────│──────┘   │
 └─────────│──────────│──────────│──────────│──────────│──────────┘
          │          │          │          │          │
          │ SSE      │ SSE      │ SSE      │ SSE      │ SSE
          ▼          ▼          ▼          ▼          ▼
     ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
     │  LLM   │ │  Git   │ │   KB   │ │ Issues │ │ CI/CD  │
     │  MCP   │ │  MCP   │ │  MCP   │ │  MCP   │ │  MCP   │
     │ Server │ │ Server │ │ Server │ │ Server │ │ Server │
     └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘
         │          │          │          │          │
         ▼          ▼          ▼          ▼          ▼
    ┌─────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
    │Anthropic│ │ Gitea  │ │pgvector│ │ Gitea  │ │ Gitea  │
    │ OpenAI  │ │ GitHub │ │        │ │ Issues │ │Actions │
    │ Ollama  │ │ GitLab │ │        │ │        │ │        │
    └─────────┘ └────────┘ └────────┘ └────────┘ └────────┘
 ```
 ---
 ## 7. Security Architecture
 ### 7.1 Authentication
 - **JWT Dual-Token:** Access token (15 min) + Refresh token (7 days)
 - **OAuth 2.0 Provider:** For MCP client authentication
 - **Service Tokens:** Internal service-to-service auth
 ### 7.2 Authorization
 - **RBAC:** Role-based access control
 - **Project Scoping:** All operations scoped to projects
 - **Agent Permissions:** Agents operate within project scope
 ### 7.3 Data Protection
 - **TLS 1.3:** All external communications
 - **Encryption at Rest:** Database encryption
 - **Secrets Management:** Environment-based, never in code
 ---
 ## 8. Deployment Architecture
 ### 8.1 Container Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                        Docker Compose                            │
 ├─────────────────────────────────────────────────────────────────┤
 │                                                                  │
 │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
 │  │ Frontend │  │ Backend  │  │ Workers  │  │  Flower  │        │
 │  │ (Next.js)│  │ (FastAPI)│  │ (Celery) │  │(Monitor) │        │
 │  │  :3000   │  │  :8000   │  │          │  │  :5555   │        │
 │  └──────────┘  └──────────┘  └──────────┘  └──────────┘        │
 │                                                                  │
 │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
 │  │ LLM MCP  │  │ Git MCP  │  │  KB MCP  │  │Issues MCP│        │
 │  │  :9001   │  │  :9002   │  │  :9003   │  │  :9004   │        │
 │  └──────────┘  └──────────┘  └──────────┘  └──────────┘        │
 │                                                                  │
 │  ┌──────────┐  ┌──────────┐  ┌──────────┐                      │
 │  │  FS MCP  │  │ Code MCP │  │CI/CD MCP │                      │
 │  │  :9005   │  │  :9006   │  │  :9007   │                      │
 │  └──────────┘  └──────────┘  └──────────┘                      │
 │                                                                  │
 │  ┌──────────────────────────────────────────────────────────┐   │
 │  │                      Infrastructure                       │   │
 │  │  ┌──────────┐  ┌──────────┐                              │   │
 │  │  │PostgreSQL│  │  Redis   │                              │   │
 │  │  │  :5432   │  │  :6379   │                              │   │
 │  │  └──────────┘  └──────────┘                              │   │
 │  └──────────────────────────────────────────────────────────┘   │
 │                                                                  │
 └─────────────────────────────────────────────────────────────────┘
 ```
 ### 8.2 Scaling Strategy
 | Component | Scaling | Strategy |
 |-----------|---------|----------|
 | Frontend | Horizontal | Stateless, behind LB |
 | Backend | Horizontal | Stateless, behind LB |
 | Celery Workers | Horizontal | Queue-based routing |
 | MCP Servers | Horizontal | Stateless singletons |
 | PostgreSQL | Vertical + Read Replicas | Primary/replica |
 | Redis | Cluster | Sentinel or Cluster mode |
 ---
 ## 9. Cross-Cutting Concerns
 ### 9.1 Logging
 - **Format:** Structured JSON
 - **Correlation:** Request IDs across services
 - **Levels:** DEBUG, INFO, WARNING, ERROR, CRITICAL
 ### 9.2 Monitoring
 - **Metrics:** Prometheus-compatible export
 - **Traces:** OpenTelemetry (future)
 - **Dashboards:** Grafana (optional)
 ### 9.3 Error Handling
 - **Agent Errors:** Logged, published via SSE
 - **Task Failures:** Celery retry with backoff
 - **Integration Errors:** Circuit breaker pattern
 ---
 ## 10. Architecture Decisions
 ### Summary of ADRs
 | ADR | Title | Status |
 |-----|-------|--------|
 | [ADR-001](../adrs/ADR-001-mcp-integration-architecture.md) | MCP Integration Architecture | Accepted |
 | [ADR-002](../adrs/ADR-002-realtime-communication.md) | Real-time Communication | Accepted |
 | [ADR-003](../adrs/ADR-003-background-task-architecture.md) | Background Task Architecture | Accepted |
 | [ADR-004](../adrs/ADR-004-llm-provider-abstraction.md) | LLM Provider Abstraction | Accepted |
 | [ADR-005](../adrs/ADR-005-tech-stack-selection.md) | Tech Stack Selection | Accepted |
 | [ADR-006](../adrs/ADR-006-agent-orchestration.md) | Agent Orchestration | Accepted |
 ### Key Decisions Summary
 1. **Unified Singleton MCP Servers** with project/agent scoping
 2. **SSE for real-time events**, WebSocket only for chat
 3. **Celery + Redis** for background tasks
 4. **LiteLLM** for unified LLM abstraction with failover
 5. **PragmaStack** as foundation with Syndarix extensions
 6. **Type-Instance pattern** for agent orchestration
 ---
 ## Appendix A: Technology Stack Quick Reference
 | Layer | Technology |
 |-------|------------|
 | Frontend | Next.js 16, React 19, TypeScript, Tailwind, shadcn/ui |
 | Backend | FastAPI, Python 3.11+, SQLAlchemy 2.0, Pydantic 2.0 |
 | Database | PostgreSQL 15+ with pgvector |
 | Cache/Queue | Redis 7.0+ |
 | Task Queue | Celery 5.3+ |
 | MCP | FastMCP 2.0 |
 | LLM | LiteLLM (Claude, GPT-4, Ollama) |
 | Testing | pytest, Jest, Playwright |
 | Container | Docker, Docker Compose |
 ---
 ## Appendix B: Port Reference
 | Service | Port |
 |---------|------|
 | Frontend | 3000 |
 | Backend | 8000 |
 | PostgreSQL | 5432 |
 | Redis | 6379 |
 | Flower | 5555 |
 | LLM MCP | 9001 |
 | Git MCP | 9002 |
 | KB MCP | 9003 |
 | Issues MCP | 9004 |
 | FS MCP | 9005 |
 | Code MCP | 9006 |
 | CI/CD MCP | 9007 |
 ---
 *This document provides the comprehensive architecture overview for Syndarix. For detailed decisions, see the individual ADRs.*
--- a/docs/spikes/SPIKE-001-mcp-integration-pattern.md
+++ b/docs/spikes/SPIKE-001-mcp-integration-pattern.md
@@ -0,0 +1,288 @@
 # SPIKE-001: MCP Integration Pattern
 **Status:** Completed
 **Date:** 2025-12-29
 **Author:** Architecture Team
 **Related Issue:** #1
 ---
 ## Objective
 Research the optimal pattern for integrating Model Context Protocol (MCP) servers with FastAPI backend, focusing on unified singleton servers with project/agent scoping.
 ## Research Questions
 1. What is the recommended MCP SDK for Python/FastAPI?
 2. How should we structure unified MCP servers vs per-project servers?
 3. What is the best pattern for project/agent scoping in MCP tools?
 4. How do we handle authentication between Syndarix and MCP servers?
 ## Findings
 ### 1. FastMCP 2.0 - Recommended Framework
 **FastMCP** is a high-level, Pythonic framework for building MCP servers that significantly reduces boilerplate compared to the low-level MCP SDK.
 **Key Features:**
 - Decorator-based tool registration (`@mcp.tool()`)
 - Built-in context management for resources and prompts
 - Support for server-sent events (SSE) and stdio transports
 - Type-safe with Pydantic model support
 - Async-first design compatible with FastAPI
 **Installation:**
 ```bash
 pip install fastmcp
 ```
 **Basic Example:**
 ```python
 from fastmcp import FastMCP
 mcp = FastMCP("syndarix-knowledge-base")
@mcp.tool()
 def search_knowledge(
    project_id: str,
    query: str,
    scope: str = "project"
 ) -> list[dict]:
    """Search the knowledge base with project scoping."""
    # Implementation here
    return results
@mcp.resource("project://{project_id}/config")
 def get_project_config(project_id: str) -> dict:
    """Get project configuration."""
    return config
 ```
 ### 2. Unified Singleton Pattern (Recommended)
 **Decision:** Use unified singleton MCP servers instead of per-project servers.
 **Architecture:**
 ```
 ┌─────────────────────────────────────────────────────────┐
 │                    Syndarix Backend                      │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
 │  │   Agent 1   │  │   Agent 2   │  │   Agent 3   │      │
 │  │ (project A) │  │ (project A) │  │ (project B) │      │
 │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘      │
 │         │                │                │              │
 │         └────────────────┼────────────────┘              │
 │                          │                               │
 │                          ▼                               │
 │  ┌─────────────────────────────────────────────────┐    │
 │  │              MCP Client (Singleton)              │    │
 │  │   Maintains connections to all MCP servers       │    │
 │  └─────────────────────────────────────────────────┘    │
 └──────────────────────────┬──────────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
           ▼               ▼               ▼
    ┌────────────┐  ┌────────────┐  ┌────────────┐
    │ Git MCP    │  │   KB MCP   │  │  LLM MCP   │
    │ (Singleton)│  │ (Singleton)│  │ (Singleton)│
    └────────────┘  └────────────┘  └────────────┘
 ```
 **Why Singleton:**
 - Resource efficiency (one process per MCP type)
 - Shared connection pools
 - Centralized logging and monitoring
 - Simpler deployment (7 services vs N×7)
 - Cross-project learning possible (if needed)
 **Scoping Pattern:**
 ```python
@mcp.tool()
 def search_knowledge(
    project_id: str,       # Required - scopes to project
    agent_id: str,         # Required - identifies calling agent
    query: str,
    scope: Literal["project", "global"] = "project"
 ) -> SearchResults:
    """
    All tools accept project_id and agent_id for:
    - Access control validation
    - Audit logging
    - Context filtering
    """
    # Validate agent has access to project
    validate_access(agent_id, project_id)
    # Log the access
    log_tool_usage(agent_id, project_id, "search_knowledge")
    # Perform scoped search
    if scope == "project":
        return search_project_kb(project_id, query)
    else:
        return search_global_kb(query)
 ```
 ### 3. MCP Server Registry Architecture
 ```python
 # mcp/registry.py
 from dataclasses import dataclass
 from typing import Dict
@dataclass
 class MCPServerConfig:
    name: str
    port: int
    transport: str  # "sse" or "stdio"
    enabled: bool = True
 MCP_SERVERS: Dict[str, MCPServerConfig] = {
    "llm_gateway": MCPServerConfig("llm-gateway", 9001, "sse"),
    "git": MCPServerConfig("git-mcp", 9002, "sse"),
    "knowledge_base": MCPServerConfig("kb-mcp", 9003, "sse"),
    "issues": MCPServerConfig("issues-mcp", 9004, "sse"),
    "file_system": MCPServerConfig("fs-mcp", 9005, "sse"),
    "code_analysis": MCPServerConfig("code-mcp", 9006, "sse"),
    "cicd": MCPServerConfig("cicd-mcp", 9007, "sse"),
 }
 ```
 ### 4. Authentication Pattern
 **MCP OAuth 2.0 Integration:**
 ```python
 from fastmcp import FastMCP
 from fastmcp.auth import OAuth2Bearer
 mcp = FastMCP(
    "syndarix-mcp",
    auth=OAuth2Bearer(
        token_url="https://syndarix.local/oauth/token",
        scopes=["mcp:read", "mcp:write"]
    )
 )
 ```
 **Internal Service Auth (Recommended for v1):**
 ```python
 # For internal deployment, use service tokens
@mcp.tool()
 def create_issue(
    service_token: str,  # Validated internally
    project_id: str,
    title: str,
    body: str
 ) -> Issue:
    validate_service_token(service_token)
    # ... implementation
 ```
 ### 5. FastAPI Integration Pattern
 ```python
 # app/mcp/client.py
 from mcp import ClientSession
 from mcp.client.sse import sse_client
 from contextlib import asynccontextmanager
 class MCPClientManager:
    def __init__(self):
        self._sessions: dict[str, ClientSession] = {}
    async def connect_all(self):
        """Connect to all configured MCP servers."""
        for name, config in MCP_SERVERS.items():
            if config.enabled:
                session = await self._connect_server(config)
                self._sessions[name] = session
    async def call_tool(
        self,
        server: str,
        tool_name: str,
        arguments: dict
    ) -> Any:
        """Call a tool on a specific MCP server."""
        session = self._sessions[server]
        result = await session.call_tool(tool_name, arguments)
        return result.content
 # Usage in FastAPI
 mcp_client = MCPClientManager()
@app.on_event("startup")
 async def startup():
    await mcp_client.connect_all()
@app.post("/api/v1/knowledge/search")
 async def search_knowledge(request: SearchRequest):
    result = await mcp_client.call_tool(
        "knowledge_base",
        "search_knowledge",
        {
            "project_id": request.project_id,
            "agent_id": request.agent_id,
            "query": request.query
        }
    )
    return result
 ```
 ## Recommendations
 ### Immediate Actions
 1. **Use FastMCP 2.0** for all MCP server implementations
 2. **Implement unified singleton pattern** with explicit scoping
 3. **Use SSE transport** for MCP server connections
 4. **Service tokens** for internal auth (v1), OAuth 2.0 for future
 ### MCP Server Priority
 1. **LLM Gateway** - Critical for agent operation
 2. **Knowledge Base** - Required for RAG functionality
 3. **Git MCP** - Required for code delivery
 4. **Issues MCP** - Required for project management
 5. **File System** - Required for workspace operations
 6. **Code Analysis** - Enhance code quality
 7. **CI/CD** - Automate deployments
 ### Code Organization
 ```
 syndarix/
 ├── backend/
 │   └── app/
 │       └── mcp/
 │           ├── __init__.py
 │           ├── client.py         # MCP client manager
 │           ├── registry.py       # Server configurations
 │           └── schemas.py        # Tool argument schemas
 └── mcp_servers/
    ├── llm_gateway/
    │   ├── __init__.py
    │   ├── server.py
    │   └── tools.py
    ├── knowledge_base/
    ├── git/
    ├── issues/
    ├── file_system/
    ├── code_analysis/
    └── cicd/
 ```
 ## References
 - [FastMCP Documentation](https://gofastmcp.com)
 - [MCP Protocol Specification](https://spec.modelcontextprotocol.io)
 - [Anthropic MCP SDK](https://github.com/anthropics/anthropic-sdk-mcp)
 ## Decision
 **Adopt FastMCP 2.0** with unified singleton servers and explicit project/agent scoping for all MCP integrations.
 ---
 *Spike completed. Findings will inform ADR-001: MCP Integration Architecture.*
--- a/docs/spikes/SPIKE-003-realtime-updates.md
+++ b/docs/spikes/SPIKE-003-realtime-updates.md
@@ -0,0 +1,338 @@
 # SPIKE-003: Real-time Updates Architecture
 **Status:** Completed
 **Date:** 2025-12-29
 **Author:** Architecture Team
 **Related Issue:** #3
 ---
 ## Objective
 Evaluate WebSocket vs Server-Sent Events (SSE) for real-time updates in Syndarix, focusing on agent activity streams, progress updates, and client notifications.
 ## Research Questions
 1. What are the trade-offs between WebSocket and SSE?
 2. Which pattern best fits Syndarix's use cases?
 3. How do we handle reconnection and reliability?
 4. What is the FastAPI implementation approach?
 ## Findings
 ### 1. Use Case Analysis
 | Use Case | Direction | Frequency | Latency Req |
 |----------|-----------|-----------|-------------|
 | Agent activity feed | Server → Client | High | Low |
 | Sprint progress | Server → Client | Medium | Low |
 | Build status | Server → Client | Low | Medium |
 | Client approval requests | Server → Client | Low | High |
 | Client messages | Client → Server | Low | Medium |
 | Issue updates | Server → Client | Medium | Low |
 **Key Insight:** 90%+ of real-time communication is **server-to-client** (unidirectional).
 ### 2. Technology Comparison
 | Feature | Server-Sent Events (SSE) | WebSocket |
 |---------|-------------------------|-----------|
 | Direction | Unidirectional (server → client) | Bidirectional |
 | Protocol | HTTP/1.1 or HTTP/2 | Custom (ws://) |
 | Reconnection | Built-in automatic | Manual implementation |
 | Connection limits | Limited per domain | Similar limits |
 | Browser support | Excellent | Excellent |
 | Through proxies | Native HTTP | May require config |
 | Complexity | Simple | More complex |
 | FastAPI support | Native | Native |
 ### 3. Recommendation: SSE for Primary, WebSocket for Chat
 **SSE (Recommended for 90% of use cases):**
 - Agent activity streams
 - Progress updates
 - Build/pipeline status
 - Issue change notifications
 - Approval request alerts
 **WebSocket (For bidirectional needs):**
 - Live chat with agents
 - Interactive debugging sessions
 - Real-time collaboration (future)
 ### 4. FastAPI SSE Implementation
 ```python
 # app/api/v1/events.py
 from fastapi import APIRouter, Request
 from fastapi.responses import StreamingResponse
 from app.services.events import EventBus
 import asyncio
 router = APIRouter()
@router.get("/projects/{project_id}/events")
 async def project_events(
    project_id: str,
    request: Request,
    current_user: User = Depends(get_current_user)
 ):
    """Stream real-time events for a project."""
    async def event_generator():
        event_bus = EventBus()
        subscriber = await event_bus.subscribe(
            channel=f"project:{project_id}",
            user_id=current_user.id
        )
        try:
            while True:
                # Check if client disconnected
                if await request.is_disconnected():
                    break
                # Wait for next event (with timeout for keepalive)
                try:
                    event = await asyncio.wait_for(
                        subscriber.get_event(),
                        timeout=30.0
                    )
                    yield f"event: {event.type}\ndata: {event.json()}\n\n"
                except asyncio.TimeoutError:
                    # Send keepalive comment
                    yield ": keepalive\n\n"
        finally:
            await event_bus.unsubscribe(subscriber)
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",  # Disable nginx buffering
        }
    )
 ```
 ### 5. Event Bus Architecture with Redis
 ```python
 # app/services/events.py
 from dataclasses import dataclass
 from typing import AsyncIterator
 import redis.asyncio as redis
 import json
@dataclass
 class Event:
    type: str
    data: dict
    project_id: str
    agent_id: str | None = None
    timestamp: float = None
 class EventBus:
    def __init__(self, redis_url: str):
        self.redis = redis.from_url(redis_url)
        self.pubsub = self.redis.pubsub()
    async def publish(self, channel: str, event: Event):
        """Publish an event to a channel."""
        await self.redis.publish(
            channel,
            json.dumps(event.__dict__)
        )
    async def subscribe(self, channel: str) -> "Subscriber":
        """Subscribe to a channel."""
        await self.pubsub.subscribe(channel)
        return Subscriber(self.pubsub, channel)
 class Subscriber:
    def __init__(self, pubsub, channel: str):
        self.pubsub = pubsub
        self.channel = channel
    async def get_event(self) -> Event:
        """Get the next event (blocking)."""
        while True:
            message = await self.pubsub.get_message(
                ignore_subscribe_messages=True,
                timeout=1.0
            )
            if message and message["type"] == "message":
                data = json.loads(message["data"])
                return Event(**data)
    async def unsubscribe(self):
        await self.pubsub.unsubscribe(self.channel)
 ```
 ### 6. Client-Side Implementation
 ```typescript
 // frontend/lib/events.ts
 class EventSource {
  private eventSource: EventSource | null = null;
  private reconnectDelay = 1000;
  private maxReconnectDelay = 30000;
  connect(projectId: string, onEvent: (event: ProjectEvent) => void) {
    const url = `/api/v1/projects/${projectId}/events`;
    this.eventSource = new EventSource(url, {
      withCredentials: true
    });
    this.eventSource.onopen = () => {
      console.log('SSE connected');
      this.reconnectDelay = 1000; // Reset on success
    };
    this.eventSource.addEventListener('agent_activity', (e) => {
      onEvent({ type: 'agent_activity', data: JSON.parse(e.data) });
    });
    this.eventSource.addEventListener('issue_update', (e) => {
      onEvent({ type: 'issue_update', data: JSON.parse(e.data) });
    });
    this.eventSource.addEventListener('approval_required', (e) => {
      onEvent({ type: 'approval_required', data: JSON.parse(e.data) });
    });
    this.eventSource.onerror = () => {
      this.eventSource?.close();
      // Exponential backoff reconnect
      setTimeout(() => this.connect(projectId, onEvent), this.reconnectDelay);
      this.reconnectDelay = Math.min(
        this.reconnectDelay * 2,
        this.maxReconnectDelay
      );
    };
  }
  disconnect() {
    this.eventSource?.close();
    this.eventSource = null;
  }
 }
 ```
 ### 7. Event Types
 ```python
 # app/schemas/events.py
 from enum import Enum
 from pydantic import BaseModel
 from datetime import datetime
 class EventType(str, Enum):
    # Agent Events
    AGENT_STARTED = "agent_started"
    AGENT_ACTIVITY = "agent_activity"
    AGENT_COMPLETED = "agent_completed"
    AGENT_ERROR = "agent_error"
    # Project Events
    ISSUE_CREATED = "issue_created"
    ISSUE_UPDATED = "issue_updated"
    ISSUE_CLOSED = "issue_closed"
    # Git Events
    BRANCH_CREATED = "branch_created"
    COMMIT_PUSHED = "commit_pushed"
    PR_CREATED = "pr_created"
    PR_MERGED = "pr_merged"
    # Workflow Events
    APPROVAL_REQUIRED = "approval_required"
    SPRINT_STARTED = "sprint_started"
    SPRINT_COMPLETED = "sprint_completed"
    # Pipeline Events
    PIPELINE_STARTED = "pipeline_started"
    PIPELINE_COMPLETED = "pipeline_completed"
    PIPELINE_FAILED = "pipeline_failed"
 class ProjectEvent(BaseModel):
    id: str
    type: EventType
    project_id: str
    agent_id: str | None
    data: dict
    timestamp: datetime
 ```
 ### 8. WebSocket for Chat (Secondary)
 ```python
 # app/api/v1/chat.py
 from fastapi import WebSocket, WebSocketDisconnect
 from app.services.agent_chat import AgentChatService
@router.websocket("/projects/{project_id}/agents/{agent_id}/chat")
 async def agent_chat(
    websocket: WebSocket,
    project_id: str,
    agent_id: str
 ):
    """Bidirectional chat with an agent."""
    await websocket.accept()
    chat_service = AgentChatService(project_id, agent_id)
    try:
        while True:
            # Receive message from client
            message = await websocket.receive_json()
            # Stream response from agent
            async for chunk in chat_service.get_response(message):
                await websocket.send_json({
                    "type": "chunk",
                    "content": chunk
                })
            await websocket.send_json({"type": "done"})
    except WebSocketDisconnect:
        pass
 ```
 ## Performance Considerations
 ### Connection Limits
 - Browser limit: ~6 connections per domain (HTTP/1.1)
 - Recommendation: Use single SSE connection per project, multiplex events
 ### Scalability
 - Redis Pub/Sub handles cross-instance event distribution
 - Consider Redis Streams for message persistence (audit/replay)
 ### Keepalive
 - Send comment every 30 seconds to prevent timeout
 - Client reconnects automatically on disconnect
 ## Recommendations
 1. **Use SSE for all server-to-client events** (simpler, auto-reconnect)
 2. **Use WebSocket only for interactive chat** with agents
 3. **Redis Pub/Sub for event distribution** across instances
 4. **Single SSE connection per project** with event multiplexing
 5. **Exponential backoff** for client reconnection
 ## References
 - [FastAPI SSE](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
 - [MDN EventSource](https://developer.mozilla.org/en-US/docs/Web/API/EventSource)
 - [Redis Pub/Sub](https://redis.io/topics/pubsub)
 ## Decision
 **Adopt SSE as the primary real-time transport** with WebSocket reserved for bidirectional chat. Use Redis Pub/Sub for event distribution.
 ---
 *Spike completed. Findings will inform ADR-002: Real-time Communication Architecture.*
--- a/docs/spikes/SPIKE-004-celery-redis-integration.md
+++ b/docs/spikes/SPIKE-004-celery-redis-integration.md
@@ -0,0 +1,420 @@
 # SPIKE-004: Celery + Redis Integration
 **Status:** Completed
 **Date:** 2025-12-29
 **Author:** Architecture Team
 **Related Issue:** #4
 ---
 ## Objective
 Research best practices for integrating Celery with FastAPI for background task processing, focusing on agent orchestration, long-running workflows, and task monitoring.
 ## Research Questions
 1. How to properly integrate Celery with async FastAPI?
 2. What is the optimal task queue architecture for Syndarix?
 3. How to handle long-running agent tasks?
 4. What monitoring and visibility patterns should we use?
 ## Findings
 ### 1. Celery + FastAPI Integration Pattern
 **Challenge:** Celery is synchronous, FastAPI is async.
 **Solution:** Use `celery.result.AsyncResult` with async polling or callbacks.
 ```python
 # app/core/celery.py
 from celery import Celery
 from app.core.config import settings
 celery_app = Celery(
    "syndarix",
    broker=settings.REDIS_URL,
    backend=settings.REDIS_URL,
    include=[
        "app.tasks.agent_tasks",
        "app.tasks.git_tasks",
        "app.tasks.sync_tasks",
    ]
 )
 celery_app.conf.update(
    task_serializer="json",
    accept_content=["json"],
    result_serializer="json",
    timezone="UTC",
    enable_utc=True,
    task_track_started=True,
    task_time_limit=3600,  # 1 hour max
    task_soft_time_limit=3300,  # 55 min soft limit
    worker_prefetch_multiplier=1,  # One task at a time for LLM tasks
    task_acks_late=True,  # Acknowledge after completion
    task_reject_on_worker_lost=True,  # Retry if worker dies
 )
 ```
 ### 2. Task Queue Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                       FastAPI Backend                            │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
 │  │  API Layer  │  │   Services  │  │   Events    │              │
 │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘              │
 │         │                │                │                      │
 │         └────────────────┼────────────────┘                      │
 │                          │                                       │
 │                          ▼                                       │
 │         ┌────────────────────────────────┐                      │
 │         │       Task Dispatcher          │                      │
 │         │   (Celery send_task)           │                      │
 │         └────────────────┬───────────────┘                      │
 └──────────────────────────┼──────────────────────────────────────┘
                           │
                           ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │                      Redis (Broker + Backend)                     │
 │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │
 │  │ agent_queue  │  │  git_queue   │  │  sync_queue  │            │
 │  │ (priority)   │  │              │  │              │            │
 │  └──────────────┘  └──────────────┘  └──────────────┘            │
 └──────────────────────────────────────────────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
           ▼               ▼               ▼
    ┌────────────┐  ┌────────────┐  ┌────────────┐
    │   Worker   │  │   Worker   │  │   Worker   │
    │  (agents)  │  │   (git)    │  │   (sync)   │
    │ prefetch=1 │  │ prefetch=4 │  │ prefetch=4 │
    └────────────┘  └────────────┘  └────────────┘
 ```
 ### 3. Queue Configuration
 ```python
 # app/core/celery.py
 celery_app.conf.task_queues = [
    Queue("agent_queue", routing_key="agent.#"),
    Queue("git_queue", routing_key="git.#"),
    Queue("sync_queue", routing_key="sync.#"),
    Queue("cicd_queue", routing_key="cicd.#"),
 ]
 celery_app.conf.task_routes = {
    "app.tasks.agent_tasks.*": {"queue": "agent_queue"},
    "app.tasks.git_tasks.*": {"queue": "git_queue"},
    "app.tasks.sync_tasks.*": {"queue": "sync_queue"},
    "app.tasks.cicd_tasks.*": {"queue": "cicd_queue"},
 }
 ```
 ### 4. Agent Task Implementation
 ```python
 # app/tasks/agent_tasks.py
 from celery import Task
 from app.core.celery import celery_app
 from app.services.agent_runner import AgentRunner
 from app.services.events import EventBus
 class AgentTask(Task):
    """Base class for agent tasks with retry and monitoring."""
    autoretry_for = (ConnectionError, TimeoutError)
    retry_backoff = True
    retry_backoff_max = 600
    retry_jitter = True
    max_retries = 3
    def on_failure(self, exc, task_id, args, kwargs, einfo):
        """Handle task failure."""
        project_id = kwargs.get("project_id")
        agent_id = kwargs.get("agent_id")
        EventBus().publish(f"project:{project_id}", {
            "type": "agent_error",
            "agent_id": agent_id,
            "error": str(exc)
        })
@celery_app.task(bind=True, base=AgentTask)
 def run_agent_action(
    self,
    agent_id: str,
    project_id: str,
    action: str,
    context: dict
 ) -> dict:
    """
    Execute an agent action as a background task.
    Args:
        agent_id: The agent instance ID
        project_id: The project context
        action: The action to perform
        context: Action-specific context
    Returns:
        Action result dictionary
    """
    runner = AgentRunner(agent_id, project_id)
    # Update task state for monitoring
    self.update_state(
        state="RUNNING",
        meta={"agent_id": agent_id, "action": action}
    )
    # Publish start event
    EventBus().publish(f"project:{project_id}", {
        "type": "agent_started",
        "agent_id": agent_id,
        "action": action,
        "task_id": self.request.id
    })
    try:
        result = runner.execute(action, context)
        # Publish completion event
        EventBus().publish(f"project:{project_id}", {
            "type": "agent_completed",
            "agent_id": agent_id,
            "action": action,
            "result_summary": result.get("summary")
        })
        return result
    except Exception as e:
        # Will trigger on_failure
        raise
 ```
 ### 5. Long-Running Task Patterns
 **Progress Reporting:**
 ```python
@celery_app.task(bind=True)
 def implement_story(self, story_id: str, agent_id: str, project_id: str):
    """Implement a user story with progress reporting."""
    steps = [
        ("analyzing", "Analyzing requirements"),
        ("designing", "Designing solution"),
        ("implementing", "Writing code"),
        ("testing", "Running tests"),
        ("documenting", "Updating documentation"),
    ]
    for i, (state, description) in enumerate(steps):
        self.update_state(
            state="PROGRESS",
            meta={
                "current": i + 1,
                "total": len(steps),
                "status": description
            }
        )
        # Do the actual work
        execute_step(state, story_id, agent_id)
        # Publish progress event
        EventBus().publish(f"project:{project_id}", {
            "type": "agent_progress",
            "agent_id": agent_id,
            "step": i + 1,
            "total": len(steps),
            "description": description
        })
    return {"status": "completed", "story_id": story_id}
 ```
 **Task Chaining:**
 ```python
 from celery import chain, group
 # Sequential workflow
 workflow = chain(
    analyze_requirements.s(story_id),
    design_solution.s(),
    implement_code.s(),
    run_tests.s(),
    create_pr.s()
 )
 # Parallel execution
 parallel_tests = group(
    run_unit_tests.s(project_id),
    run_integration_tests.s(project_id),
    run_linting.s(project_id)
 )
 ```
 ### 6. FastAPI Integration
 ```python
 # app/api/v1/agents.py
 from fastapi import APIRouter, BackgroundTasks
 from app.tasks.agent_tasks import run_agent_action
 from celery.result import AsyncResult
 router = APIRouter()
@router.post("/agents/{agent_id}/actions")
 async def trigger_agent_action(
    agent_id: str,
    action: AgentActionRequest,
    background_tasks: BackgroundTasks
 ):
    """Trigger an agent action as a background task."""
    # Dispatch to Celery
    task = run_agent_action.delay(
        agent_id=agent_id,
        project_id=action.project_id,
        action=action.action,
        context=action.context
    )
    return {
        "task_id": task.id,
        "status": "queued"
    }
@router.get("/tasks/{task_id}")
 async def get_task_status(task_id: str):
    """Get the status of a background task."""
    result = AsyncResult(task_id)
    if result.state == "PENDING":
        return {"status": "pending"}
    elif result.state == "RUNNING":
        return {"status": "running", **result.info}
    elif result.state == "PROGRESS":
        return {"status": "progress", **result.info}
    elif result.state == "SUCCESS":
        return {"status": "completed", "result": result.result}
    elif result.state == "FAILURE":
        return {"status": "failed", "error": str(result.result)}
    return {"status": result.state}
 ```
 ### 7. Worker Configuration
 ```bash
 # Run different workers for different queues
 # Agent worker (single task at a time for LLM rate limiting)
 celery -A app.core.celery worker \
    -Q agent_queue \
    -c 4 \
    --prefetch-multiplier=1 \
    -n agent_worker@%h
 # Git worker (can handle multiple concurrent tasks)
 celery -A app.core.celery worker \
    -Q git_queue \
    -c 8 \
    --prefetch-multiplier=4 \
    -n git_worker@%h
 # Sync worker
 celery -A app.core.celery worker \
    -Q sync_queue \
    -c 4 \
    --prefetch-multiplier=4 \
    -n sync_worker@%h
 ```
 ### 8. Monitoring with Flower
 ```python
 # docker-compose.yml
 services:
  flower:
    image: mher/flower:latest
    command: celery flower --broker=redis://redis:6379/0
    ports:
      - "5555:5555"
    environment:
      - CELERY_BROKER_URL=redis://redis:6379/0
      - FLOWER_BASIC_AUTH=admin:password
 ```
 ### 9. Task Scheduling (Celery Beat)
 ```python
 # app/core/celery.py
 from celery.schedules import crontab
 celery_app.conf.beat_schedule = {
    # Sync issues every minute
    "sync-external-issues": {
        "task": "app.tasks.sync_tasks.sync_all_issues",
        "schedule": 60.0,
    },
    # Health check every 5 minutes
    "agent-health-check": {
        "task": "app.tasks.agent_tasks.health_check_all_agents",
        "schedule": 300.0,
    },
    # Daily cleanup at midnight
    "cleanup-old-tasks": {
        "task": "app.tasks.maintenance.cleanup_old_tasks",
        "schedule": crontab(hour=0, minute=0),
    },
 }
 ```
 ## Best Practices
 1. **One task per LLM call** - Avoid rate limiting issues
 2. **Progress reporting** - Update state for long-running tasks
 3. **Idempotent tasks** - Handle retries gracefully
 4. **Separate queues** - Isolate slow tasks from fast ones
 5. **Task result expiry** - Set `result_expires` to avoid Redis bloat
 6. **Soft time limits** - Allow graceful shutdown before hard kill
 ## Recommendations
 1. **Use Celery for all long-running operations**
   - Agent actions
   - Git operations
   - External sync
   - CI/CD triggers
 2. **Use Redis as both broker and backend**
   - Simplifies infrastructure
   - Fast enough for our scale
 3. **Configure separate queues**
   - `agent_queue` with prefetch=1
   - `git_queue` with prefetch=4
   - `sync_queue` with prefetch=4
 4. **Implement proper monitoring**
   - Flower for web UI
   - Prometheus metrics export
   - Dead letter queue for failed tasks
 ## References
 - [Celery Documentation](https://docs.celeryq.dev/)
 - [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
 - [Celery Best Practices](https://docs.celeryq.dev/en/stable/userguide/tasks.html#tips-and-best-practices)
 ## Decision
 **Adopt Celery + Redis** for all background task processing with queue-based routing and progress reporting via Redis Pub/Sub events.
 ---
 *Spike completed. Findings will inform ADR-003: Background Task Architecture.*
--- a/docs/spikes/SPIKE-005-llm-provider-abstraction.md
+++ b/docs/spikes/SPIKE-005-llm-provider-abstraction.md
@@ -0,0 +1,516 @@
 # SPIKE-005: LLM Provider Abstraction
 **Status:** Completed
 **Date:** 2025-12-29
 **Author:** Architecture Team
 **Related Issue:** #5
 ---
 ## Objective
 Research the best approach for unified LLM provider abstraction with support for multiple providers, automatic failover, and cost tracking.
 ## Research Questions
 1. What libraries exist for unified LLM access?
 2. How to implement automatic failover between providers?
 3. How to track token usage and costs per agent/project?
 4. What caching strategies can reduce API costs?
 ## Findings
 ### 1. LiteLLM - Recommended Solution
 **LiteLLM** provides a unified interface to 100+ LLM providers using the OpenAI SDK format.
 **Key Features:**
 - Unified API across providers (Anthropic, OpenAI, local, etc.)
 - Built-in failover and load balancing
 - Token counting and cost tracking
 - Streaming support
 - Async support
 - Caching with Redis
 **Installation:**
 ```bash
 pip install litellm
 ```
 ### 2. Basic Usage
 ```python
 from litellm import completion, acompletion
 import litellm
 # Configure providers
 litellm.api_key = os.getenv("ANTHROPIC_API_KEY")
 litellm.set_verbose = True  # For debugging
 # Synchronous call
 response = completion(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Hello!"}]
 )
 # Async call (for FastAPI)
 response = await acompletion(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Hello!"}]
 )
 ```
 ### 3. Model Naming Convention
 LiteLLM uses prefixed model names:
 | Provider | Model Format |
 |----------|--------------|
 | Anthropic | `claude-3-5-sonnet-20241022` |
 | OpenAI | `gpt-4-turbo` |
 | Azure OpenAI | `azure/deployment-name` |
 | Ollama | `ollama/llama3` |
 | Together AI | `together_ai/togethercomputer/llama-2-70b` |
 ### 4. Failover Configuration
 ```python
 from litellm import Router
 # Define model list with fallbacks
 model_list = [
    {
        "model_name": "primary-agent",
        "litellm_params": {
            "model": "claude-3-5-sonnet-20241022",
            "api_key": os.getenv("ANTHROPIC_API_KEY"),
        },
        "model_info": {"id": 1}
    },
    {
        "model_name": "primary-agent",  # Same name = fallback
        "litellm_params": {
            "model": "gpt-4-turbo",
            "api_key": os.getenv("OPENAI_API_KEY"),
        },
        "model_info": {"id": 2}
    },
    {
        "model_name": "primary-agent",
        "litellm_params": {
            "model": "ollama/llama3",
            "api_base": "http://localhost:11434",
        },
        "model_info": {"id": 3}
    }
 ]
 # Initialize router with failover
 router = Router(
    model_list=model_list,
    fallbacks=[
        {"primary-agent": ["primary-agent"]}  # Try all models with same name
    ],
    routing_strategy="simple-shuffle",  # or "latency-based-routing"
    num_retries=3,
    retry_after=5,  # seconds
    timeout=60,
 )
 # Use router
 response = await router.acompletion(
    model="primary-agent",
    messages=[{"role": "user", "content": "Hello!"}]
 )
 ```
 ### 5. Syndarix LLM Gateway Architecture
 ```python
 # app/services/llm_gateway.py
 from litellm import Router, acompletion
 from app.core.config import settings
 from app.models.agent import AgentType
 from app.services.cost_tracker import CostTracker
 from app.services.events import EventBus
 class LLMGateway:
    """Unified LLM gateway with failover and cost tracking."""
    def __init__(self):
        self.router = self._build_router()
        self.cost_tracker = CostTracker()
        self.event_bus = EventBus()
    def _build_router(self) -> Router:
        """Build LiteLLM router from configuration."""
        model_list = []
        # Add Anthropic models
        if settings.ANTHROPIC_API_KEY:
            model_list.extend([
                {
                    "model_name": "high-reasoning",
                    "litellm_params": {
                        "model": "claude-3-5-sonnet-20241022",
                        "api_key": settings.ANTHROPIC_API_KEY,
                    }
                },
                {
                    "model_name": "fast-response",
                    "litellm_params": {
                        "model": "claude-3-haiku-20240307",
                        "api_key": settings.ANTHROPIC_API_KEY,
                    }
                }
            ])
        # Add OpenAI fallbacks
        if settings.OPENAI_API_KEY:
            model_list.extend([
                {
                    "model_name": "high-reasoning",
                    "litellm_params": {
                        "model": "gpt-4-turbo",
                        "api_key": settings.OPENAI_API_KEY,
                    }
                },
                {
                    "model_name": "fast-response",
                    "litellm_params": {
                        "model": "gpt-4o-mini",
                        "api_key": settings.OPENAI_API_KEY,
                    }
                }
            ])
        # Add local models (Ollama)
        if settings.OLLAMA_URL:
            model_list.append({
                "model_name": "local-fallback",
                "litellm_params": {
                    "model": "ollama/llama3",
                    "api_base": settings.OLLAMA_URL,
                }
            })
        return Router(
            model_list=model_list,
            fallbacks=[
                {"high-reasoning": ["high-reasoning", "local-fallback"]},
                {"fast-response": ["fast-response", "local-fallback"]},
            ],
            routing_strategy="latency-based-routing",
            num_retries=3,
            timeout=120,
        )
    async def complete(
        self,
        agent_id: str,
        project_id: str,
        messages: list[dict],
        model_preference: str = "high-reasoning",
        stream: bool = False,
        **kwargs
    ) -> dict:
        """
        Generate a completion with automatic failover and cost tracking.
        Args:
            agent_id: The calling agent's ID
            project_id: The project context
            messages: Chat messages
            model_preference: "high-reasoning" or "fast-response"
            stream: Whether to stream the response
            **kwargs: Additional LiteLLM parameters
        Returns:
            Completion response dictionary
        """
        try:
            if stream:
                return self._stream_completion(
                    agent_id, project_id, messages, model_preference, **kwargs
                )
            response = await self.router.acompletion(
                model=model_preference,
                messages=messages,
                **kwargs
            )
            # Track usage
            await self._track_usage(
                agent_id=agent_id,
                project_id=project_id,
                model=response.model,
                usage=response.usage,
            )
            return {
                "content": response.choices[0].message.content,
                "model": response.model,
                "usage": {
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens": response.usage.total_tokens,
                }
            }
        except Exception as e:
            # Publish error event
            await self.event_bus.publish(f"project:{project_id}", {
                "type": "llm_error",
                "agent_id": agent_id,
                "error": str(e)
            })
            raise
    async def _stream_completion(
        self,
        agent_id: str,
        project_id: str,
        messages: list[dict],
        model_preference: str,
        **kwargs
    ):
        """Stream a completion response."""
        response = await self.router.acompletion(
            model=model_preference,
            messages=messages,
            stream=True,
            **kwargs
        )
        async for chunk in response:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content
    async def _track_usage(
        self,
        agent_id: str,
        project_id: str,
        model: str,
        usage: dict
    ):
        """Track token usage and costs."""
        await self.cost_tracker.record_usage(
            agent_id=agent_id,
            project_id=project_id,
            model=model,
            prompt_tokens=usage.prompt_tokens,
            completion_tokens=usage.completion_tokens,
        )
 ```
 ### 6. Cost Tracking
 ```python
 # app/services/cost_tracker.py
 from sqlalchemy.ext.asyncio import AsyncSession
 from app.models.usage import TokenUsage
 from datetime import datetime
 # Cost per 1M tokens (approximate)
 MODEL_COSTS = {
    "claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
    "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
    "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "ollama/llama3": {"input": 0.00, "output": 0.00},  # Local
 }
 class CostTracker:
    def __init__(self, db: AsyncSession):
        self.db = db
    async def record_usage(
        self,
        agent_id: str,
        project_id: str,
        model: str,
        prompt_tokens: int,
        completion_tokens: int,
    ):
        """Record token usage and calculate cost."""
        costs = MODEL_COSTS.get(model, {"input": 0, "output": 0})
        input_cost = (prompt_tokens / 1_000_000) * costs["input"]
        output_cost = (completion_tokens / 1_000_000) * costs["output"]
        total_cost = input_cost + output_cost
        usage = TokenUsage(
            agent_id=agent_id,
            project_id=project_id,
            model=model,
            prompt_tokens=prompt_tokens,
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
            cost_usd=total_cost,
            timestamp=datetime.utcnow(),
        )
        self.db.add(usage)
        await self.db.commit()
    async def get_project_usage(
        self,
        project_id: str,
        start_date: datetime = None,
        end_date: datetime = None,
    ) -> dict:
        """Get usage summary for a project."""
        # Query aggregated usage
        ...
    async def check_budget(
        self,
        project_id: str,
        budget_limit: float,
    ) -> bool:
        """Check if project is within budget."""
        usage = await self.get_project_usage(project_id)
        return usage["total_cost_usd"] < budget_limit
 ```
 ### 7. Caching with Redis
 ```python
 import litellm
 from litellm import Cache
 # Configure Redis cache
 litellm.cache = Cache(
    type="redis",
    host=settings.REDIS_HOST,
    port=settings.REDIS_PORT,
    password=settings.REDIS_PASSWORD,
 )
 # Enable caching
 litellm.enable_cache()
 # Cached completions (same input = cached response)
 response = await litellm.acompletion(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    cache={"ttl": 3600}  # Cache for 1 hour
 )
 ```
 ### 8. Agent Type Model Mapping
 ```python
 # app/models/agent_type.py
 from sqlalchemy import Column, String, Enum as SQLEnum
 from app.db.base import Base
 class ModelPreference(str, Enum):
    HIGH_REASONING = "high-reasoning"
    FAST_RESPONSE = "fast-response"
    COST_OPTIMIZED = "cost-optimized"
 class AgentType(Base):
    __tablename__ = "agent_types"
    id = Column(UUID, primary_key=True)
    name = Column(String(50), unique=True)
    role = Column(String(50))
    # LLM configuration
    model_preference = Column(
        SQLEnum(ModelPreference),
        default=ModelPreference.HIGH_REASONING
    )
    max_tokens = Column(Integer, default=4096)
    temperature = Column(Float, default=0.7)
    # System prompt
    system_prompt = Column(Text)
 # Mapping agent types to models
 AGENT_MODEL_MAPPING = {
    "Product Owner": ModelPreference.HIGH_REASONING,
    "Project Manager": ModelPreference.FAST_RESPONSE,
    "Business Analyst": ModelPreference.HIGH_REASONING,
    "Software Architect": ModelPreference.HIGH_REASONING,
    "Software Engineer": ModelPreference.HIGH_REASONING,
    "UI/UX Designer": ModelPreference.HIGH_REASONING,
    "QA Engineer": ModelPreference.FAST_RESPONSE,
    "DevOps Engineer": ModelPreference.FAST_RESPONSE,
    "AI/ML Engineer": ModelPreference.HIGH_REASONING,
    "Security Expert": ModelPreference.HIGH_REASONING,
 }
 ```
 ## Rate Limiting Strategy
 ```python
 from litellm import Router
 import asyncio
 # Configure rate limits per model
 router = Router(
    model_list=model_list,
    redis_host=settings.REDIS_HOST,
    redis_port=settings.REDIS_PORT,
    routing_strategy="usage-based-routing",  # Route based on rate limits
 )
 # Custom rate limiter
 class RateLimiter:
    def __init__(self, requests_per_minute: int = 60):
        self.rpm = requests_per_minute
        self.semaphore = asyncio.Semaphore(requests_per_minute)
    async def acquire(self):
        await self.semaphore.acquire()
        # Release after 60 seconds
        asyncio.create_task(self._release_after(60))
    async def _release_after(self, seconds: int):
        await asyncio.sleep(seconds)
        self.semaphore.release()
 ```
 ## Recommendations
 1. **Use LiteLLM as the unified abstraction layer**
   - Simplifies multi-provider support
   - Built-in failover and retry
   - Consistent API across providers
 2. **Configure model groups by use case**
   - `high-reasoning`: Complex analysis, architecture decisions
   - `fast-response`: Quick tasks, simple queries
   - `cost-optimized`: Non-critical, high-volume tasks
 3. **Implement automatic failover chain**
   - Primary: Claude 3.5 Sonnet
   - Fallback 1: GPT-4 Turbo
   - Fallback 2: Local Llama 3 (if available)
 4. **Track all usage and costs**
   - Per agent, per project
   - Set budget alerts
   - Generate usage reports
 5. **Cache frequently repeated queries**
   - Use Redis-backed cache
   - Cache embeddings for RAG
   - Cache deterministic transformations
 ## References
 - [LiteLLM Documentation](https://docs.litellm.ai/)
 - [LiteLLM Router](https://docs.litellm.ai/docs/routing)
 - [Anthropic Rate Limits](https://docs.anthropic.com/en/api/rate-limits)
 ## Decision
 **Adopt LiteLLM** as the unified LLM abstraction layer with automatic failover, usage-based routing, and Redis-backed caching.
 ---
 *Spike completed. Findings will inform ADR-004: LLM Provider Integration Architecture.*