docs: add spike findings for LLM abstraction, MCP integration, and real-time updates

- Added research findings and recommendations as separate SPIKE documents in `docs/spikes/`: - `SPIKE-005-llm-provider-abstraction.md`: Research on unified abstraction for LLM providers with failover, cost tracking, and caching strategies. - `SPIKE-001-mcp-integration-pattern.md`: Optimal pattern for integrating MCP with project/agent scoping and authentication strategies. - `SPIKE-003-realtime-updates.md`: Evaluation of SSE vs WebSocket for real-time updates, aligned with use-case needs. - Focused on aligning implementation architectures with scalability, efficiency, and user needs. - Documentation intended to inform upcoming ADRs.
2025-12-29 13:15:50 +01:00
parent 9901dc7f51
commit a6a336b66e
4 changed files with 1562 additions and 0 deletions
--- a/docs/spikes/SPIKE-001-mcp-integration-pattern.md
+++ b/docs/spikes/SPIKE-001-mcp-integration-pattern.md
@@ -0,0 +1,288 @@
+# SPIKE-001: MCP Integration Pattern
+
+**Status:** Completed
+**Date:** 2025-12-29
+**Author:** Architecture Team
+**Related Issue:** #1
+
+---
+
+## Objective
+
+Research the optimal pattern for integrating Model Context Protocol (MCP) servers with FastAPI backend, focusing on unified singleton servers with project/agent scoping.
+
+## Research Questions
+
+1. What is the recommended MCP SDK for Python/FastAPI?
+2. How should we structure unified MCP servers vs per-project servers?
+3. What is the best pattern for project/agent scoping in MCP tools?
+4. How do we handle authentication between Syndarix and MCP servers?
+
+## Findings
+
+### 1. FastMCP 2.0 - Recommended Framework
+
+**FastMCP** is a high-level, Pythonic framework for building MCP servers that significantly reduces boilerplate compared to the low-level MCP SDK.
+
+**Key Features:**
+- Decorator-based tool registration (`@mcp.tool()`)
+- Built-in context management for resources and prompts
+- Support for server-sent events (SSE) and stdio transports
+- Type-safe with Pydantic model support
+- Async-first design compatible with FastAPI
+
+**Installation:**
+```bash
+pip install fastmcp
+```
+
+**Basic Example:**
+```python
+from fastmcp import FastMCP
+
+mcp = FastMCP("syndarix-knowledge-base")
+
+@mcp.tool()
+def search_knowledge(
+    project_id: str,
+    query: str,
+    scope: str = "project"
+) -> list[dict]:
+    """Search the knowledge base with project scoping."""
+    # Implementation here
+    return results
+
+@mcp.resource("project://{project_id}/config")
+def get_project_config(project_id: str) -> dict:
+    """Get project configuration."""
+    return config
+```
+
+### 2. Unified Singleton Pattern (Recommended)
+
+**Decision:** Use unified singleton MCP servers instead of per-project servers.
+
+**Architecture:**
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Syndarix Backend                      │
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
+│  │   Agent 1   │  │   Agent 2   │  │   Agent 3   │      │
+│  │ (project A) │  │ (project A) │  │ (project B) │      │
+│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘      │
+│         │                │                │              │
+│         └────────────────┼────────────────┘              │
+│                          │                               │
+│                          ▼                               │
+│  ┌─────────────────────────────────────────────────┐    │
+│  │              MCP Client (Singleton)              │    │
+│  │   Maintains connections to all MCP servers       │    │
+│  └─────────────────────────────────────────────────┘    │
+└──────────────────────────┬──────────────────────────────┘
+                           │
+           ┌───────────────┼───────────────┐
+           │               │               │
+           ▼               ▼               ▼
+    ┌────────────┐  ┌────────────┐  ┌────────────┐
+    │ Git MCP    │  │   KB MCP   │  │  LLM MCP   │
+    │ (Singleton)│  │ (Singleton)│  │ (Singleton)│
+    └────────────┘  └────────────┘  └────────────┘
+```
+
+**Why Singleton:**
+- Resource efficiency (one process per MCP type)
+- Shared connection pools
+- Centralized logging and monitoring
+- Simpler deployment (7 services vs N×7)
+- Cross-project learning possible (if needed)
+
+**Scoping Pattern:**
+```python
+@mcp.tool()
+def search_knowledge(
+    project_id: str,       # Required - scopes to project
+    agent_id: str,         # Required - identifies calling agent
+    query: str,
+    scope: Literal["project", "global"] = "project"
+) -> SearchResults:
+    """
+    All tools accept project_id and agent_id for:
+    - Access control validation
+    - Audit logging
+    - Context filtering
+    """
+    # Validate agent has access to project
+    validate_access(agent_id, project_id)
+
+    # Log the access
+    log_tool_usage(agent_id, project_id, "search_knowledge")
+
+    # Perform scoped search
+    if scope == "project":
+        return search_project_kb(project_id, query)
+    else:
+        return search_global_kb(query)
+```
+
+### 3. MCP Server Registry Architecture
+
+```python
+# mcp/registry.py
+from dataclasses import dataclass
+from typing import Dict
+
+@dataclass
+class MCPServerConfig:
+    name: str
+    port: int
+    transport: str  # "sse" or "stdio"
+    enabled: bool = True
+
+MCP_SERVERS: Dict[str, MCPServerConfig] = {
+    "llm_gateway": MCPServerConfig("llm-gateway", 9001, "sse"),
+    "git": MCPServerConfig("git-mcp", 9002, "sse"),
+    "knowledge_base": MCPServerConfig("kb-mcp", 9003, "sse"),
+    "issues": MCPServerConfig("issues-mcp", 9004, "sse"),
+    "file_system": MCPServerConfig("fs-mcp", 9005, "sse"),
+    "code_analysis": MCPServerConfig("code-mcp", 9006, "sse"),
+    "cicd": MCPServerConfig("cicd-mcp", 9007, "sse"),
+}
+```
+
+### 4. Authentication Pattern
+
+**MCP OAuth 2.0 Integration:**
+```python
+from fastmcp import FastMCP
+from fastmcp.auth import OAuth2Bearer
+
+mcp = FastMCP(
+    "syndarix-mcp",
+    auth=OAuth2Bearer(
+        token_url="https://syndarix.local/oauth/token",
+        scopes=["mcp:read", "mcp:write"]
+    )
+)
+```
+
+**Internal Service Auth (Recommended for v1):**
+```python
+# For internal deployment, use service tokens
+@mcp.tool()
+def create_issue(
+    service_token: str,  # Validated internally
+    project_id: str,
+    title: str,
+    body: str
+) -> Issue:
+    validate_service_token(service_token)
+    # ... implementation
+```
+
+### 5. FastAPI Integration Pattern
+
+```python
+# app/mcp/client.py
+from mcp import ClientSession
+from mcp.client.sse import sse_client
+from contextlib import asynccontextmanager
+
+class MCPClientManager:
+    def __init__(self):
+        self._sessions: dict[str, ClientSession] = {}
+
+    async def connect_all(self):
+        """Connect to all configured MCP servers."""
+        for name, config in MCP_SERVERS.items():
+            if config.enabled:
+                session = await self._connect_server(config)
+                self._sessions[name] = session
+
+    async def call_tool(
+        self,
+        server: str,
+        tool_name: str,
+        arguments: dict
+    ) -> Any:
+        """Call a tool on a specific MCP server."""
+        session = self._sessions[server]
+        result = await session.call_tool(tool_name, arguments)
+        return result.content
+
+# Usage in FastAPI
+mcp_client = MCPClientManager()
+
+@app.on_event("startup")
+async def startup():
+    await mcp_client.connect_all()
+
+@app.post("/api/v1/knowledge/search")
+async def search_knowledge(request: SearchRequest):
+    result = await mcp_client.call_tool(
+        "knowledge_base",
+        "search_knowledge",
+        {
+            "project_id": request.project_id,
+            "agent_id": request.agent_id,
+            "query": request.query
+        }
+    )
+    return result
+```
+
+## Recommendations
+
+### Immediate Actions
+
+1. **Use FastMCP 2.0** for all MCP server implementations
+2. **Implement unified singleton pattern** with explicit scoping
+3. **Use SSE transport** for MCP server connections
+4. **Service tokens** for internal auth (v1), OAuth 2.0 for future
+
+### MCP Server Priority
+
+1. **LLM Gateway** - Critical for agent operation
+2. **Knowledge Base** - Required for RAG functionality
+3. **Git MCP** - Required for code delivery
+4. **Issues MCP** - Required for project management
+5. **File System** - Required for workspace operations
+6. **Code Analysis** - Enhance code quality
+7. **CI/CD** - Automate deployments
+
+### Code Organization
+
+```
+syndarix/
+├── backend/
+│   └── app/
+│       └── mcp/
+│           ├── __init__.py
+│           ├── client.py         # MCP client manager
+│           ├── registry.py       # Server configurations
+│           └── schemas.py        # Tool argument schemas
+└── mcp_servers/
+    ├── llm_gateway/
+    │   ├── __init__.py
+    │   ├── server.py
+    │   └── tools.py
+    ├── knowledge_base/
+    ├── git/
+    ├── issues/
+    ├── file_system/
+    ├── code_analysis/
+    └── cicd/
+```
+
+## References
+
+- [FastMCP Documentation](https://gofastmcp.com)
+- [MCP Protocol Specification](https://spec.modelcontextprotocol.io)
+- [Anthropic MCP SDK](https://github.com/anthropics/anthropic-sdk-mcp)
+
+## Decision
+
+**Adopt FastMCP 2.0** with unified singleton servers and explicit project/agent scoping for all MCP integrations.
+
+---
+
+*Spike completed. Findings will inform ADR-001: MCP Integration Architecture.*
--- a/docs/spikes/SPIKE-003-realtime-updates.md
+++ b/docs/spikes/SPIKE-003-realtime-updates.md
@@ -0,0 +1,338 @@
+# SPIKE-003: Real-time Updates Architecture
+
+**Status:** Completed
+**Date:** 2025-12-29
+**Author:** Architecture Team
+**Related Issue:** #3
+
+---
+
+## Objective
+
+Evaluate WebSocket vs Server-Sent Events (SSE) for real-time updates in Syndarix, focusing on agent activity streams, progress updates, and client notifications.
+
+## Research Questions
+
+1. What are the trade-offs between WebSocket and SSE?
+2. Which pattern best fits Syndarix's use cases?
+3. How do we handle reconnection and reliability?
+4. What is the FastAPI implementation approach?
+
+## Findings
+
+### 1. Use Case Analysis
+
+| Use Case | Direction | Frequency | Latency Req |
+|----------|-----------|-----------|-------------|
+| Agent activity feed | Server → Client | High | Low |
+| Sprint progress | Server → Client | Medium | Low |
+| Build status | Server → Client | Low | Medium |
+| Client approval requests | Server → Client | Low | High |
+| Client messages | Client → Server | Low | Medium |
+| Issue updates | Server → Client | Medium | Low |
+
+**Key Insight:** 90%+ of real-time communication is **server-to-client** (unidirectional).
+
+### 2. Technology Comparison
+
+| Feature | Server-Sent Events (SSE) | WebSocket |
+|---------|-------------------------|-----------|
+| Direction | Unidirectional (server → client) | Bidirectional |
+| Protocol | HTTP/1.1 or HTTP/2 | Custom (ws://) |
+| Reconnection | Built-in automatic | Manual implementation |
+| Connection limits | Limited per domain | Similar limits |
+| Browser support | Excellent | Excellent |
+| Through proxies | Native HTTP | May require config |
+| Complexity | Simple | More complex |
+| FastAPI support | Native | Native |
+
+### 3. Recommendation: SSE for Primary, WebSocket for Chat
+
+**SSE (Recommended for 90% of use cases):**
+- Agent activity streams
+- Progress updates
+- Build/pipeline status
+- Issue change notifications
+- Approval request alerts
+
+**WebSocket (For bidirectional needs):**
+- Live chat with agents
+- Interactive debugging sessions
+- Real-time collaboration (future)
+
+### 4. FastAPI SSE Implementation
+
+```python
+# app/api/v1/events.py
+from fastapi import APIRouter, Request
+from fastapi.responses import StreamingResponse
+from app.services.events import EventBus
+import asyncio
+
+router = APIRouter()
+
+@router.get("/projects/{project_id}/events")
+async def project_events(
+    project_id: str,
+    request: Request,
+    current_user: User = Depends(get_current_user)
+):
+    """Stream real-time events for a project."""
+
+    async def event_generator():
+        event_bus = EventBus()
+        subscriber = await event_bus.subscribe(
+            channel=f"project:{project_id}",
+            user_id=current_user.id
+        )
+
+        try:
+            while True:
+                # Check if client disconnected
+                if await request.is_disconnected():
+                    break
+
+                # Wait for next event (with timeout for keepalive)
+                try:
+                    event = await asyncio.wait_for(
+                        subscriber.get_event(),
+                        timeout=30.0
+                    )
+                    yield f"event: {event.type}\ndata: {event.json()}\n\n"
+                except asyncio.TimeoutError:
+                    # Send keepalive comment
+                    yield ": keepalive\n\n"
+        finally:
+            await event_bus.unsubscribe(subscriber)
+
+    return StreamingResponse(
+        event_generator(),
+        media_type="text/event-stream",
+        headers={
+            "Cache-Control": "no-cache",
+            "Connection": "keep-alive",
+            "X-Accel-Buffering": "no",  # Disable nginx buffering
+        }
+    )
+```
+
+### 5. Event Bus Architecture with Redis
+
+```python
+# app/services/events.py
+from dataclasses import dataclass
+from typing import AsyncIterator
+import redis.asyncio as redis
+import json
+
+@dataclass
+class Event:
+    type: str
+    data: dict
+    project_id: str
+    agent_id: str | None = None
+    timestamp: float = None
+
+class EventBus:
+    def __init__(self, redis_url: str):
+        self.redis = redis.from_url(redis_url)
+        self.pubsub = self.redis.pubsub()
+
+    async def publish(self, channel: str, event: Event):
+        """Publish an event to a channel."""
+        await self.redis.publish(
+            channel,
+            json.dumps(event.__dict__)
+        )
+
+    async def subscribe(self, channel: str) -> "Subscriber":
+        """Subscribe to a channel."""
+        await self.pubsub.subscribe(channel)
+        return Subscriber(self.pubsub, channel)
+
+class Subscriber:
+    def __init__(self, pubsub, channel: str):
+        self.pubsub = pubsub
+        self.channel = channel
+
+    async def get_event(self) -> Event:
+        """Get the next event (blocking)."""
+        while True:
+            message = await self.pubsub.get_message(
+                ignore_subscribe_messages=True,
+                timeout=1.0
+            )
+            if message and message["type"] == "message":
+                data = json.loads(message["data"])
+                return Event(**data)
+
+    async def unsubscribe(self):
+        await self.pubsub.unsubscribe(self.channel)
+```
+
+### 6. Client-Side Implementation
+
+```typescript
+// frontend/lib/events.ts
+class EventSource {
+  private eventSource: EventSource | null = null;
+  private reconnectDelay = 1000;
+  private maxReconnectDelay = 30000;
+
+  connect(projectId: string, onEvent: (event: ProjectEvent) => void) {
+    const url = `/api/v1/projects/${projectId}/events`;
+
+    this.eventSource = new EventSource(url, {
+      withCredentials: true
+    });
+
+    this.eventSource.onopen = () => {
+      console.log('SSE connected');
+      this.reconnectDelay = 1000; // Reset on success
+    };
+
+    this.eventSource.addEventListener('agent_activity', (e) => {
+      onEvent({ type: 'agent_activity', data: JSON.parse(e.data) });
+    });
+
+    this.eventSource.addEventListener('issue_update', (e) => {
+      onEvent({ type: 'issue_update', data: JSON.parse(e.data) });
+    });
+
+    this.eventSource.addEventListener('approval_required', (e) => {
+      onEvent({ type: 'approval_required', data: JSON.parse(e.data) });
+    });
+
+    this.eventSource.onerror = () => {
+      this.eventSource?.close();
+      // Exponential backoff reconnect
+      setTimeout(() => this.connect(projectId, onEvent), this.reconnectDelay);
+      this.reconnectDelay = Math.min(
+        this.reconnectDelay * 2,
+        this.maxReconnectDelay
+      );
+    };
+  }
+
+  disconnect() {
+    this.eventSource?.close();
+    this.eventSource = null;
+  }
+}
+```
+
+### 7. Event Types
+
+```python
+# app/schemas/events.py
+from enum import Enum
+from pydantic import BaseModel
+from datetime import datetime
+
+class EventType(str, Enum):
+    # Agent Events
+    AGENT_STARTED = "agent_started"
+    AGENT_ACTIVITY = "agent_activity"
+    AGENT_COMPLETED = "agent_completed"
+    AGENT_ERROR = "agent_error"
+
+    # Project Events
+    ISSUE_CREATED = "issue_created"
+    ISSUE_UPDATED = "issue_updated"
+    ISSUE_CLOSED = "issue_closed"
+
+    # Git Events
+    BRANCH_CREATED = "branch_created"
+    COMMIT_PUSHED = "commit_pushed"
+    PR_CREATED = "pr_created"
+    PR_MERGED = "pr_merged"
+
+    # Workflow Events
+    APPROVAL_REQUIRED = "approval_required"
+    SPRINT_STARTED = "sprint_started"
+    SPRINT_COMPLETED = "sprint_completed"
+
+    # Pipeline Events
+    PIPELINE_STARTED = "pipeline_started"
+    PIPELINE_COMPLETED = "pipeline_completed"
+    PIPELINE_FAILED = "pipeline_failed"
+
+class ProjectEvent(BaseModel):
+    id: str
+    type: EventType
+    project_id: str
+    agent_id: str | None
+    data: dict
+    timestamp: datetime
+```
+
+### 8. WebSocket for Chat (Secondary)
+
+```python
+# app/api/v1/chat.py
+from fastapi import WebSocket, WebSocketDisconnect
+from app.services.agent_chat import AgentChatService
+
+@router.websocket("/projects/{project_id}/agents/{agent_id}/chat")
+async def agent_chat(
+    websocket: WebSocket,
+    project_id: str,
+    agent_id: str
+):
+    """Bidirectional chat with an agent."""
+    await websocket.accept()
+
+    chat_service = AgentChatService(project_id, agent_id)
+
+    try:
+        while True:
+            # Receive message from client
+            message = await websocket.receive_json()
+
+            # Stream response from agent
+            async for chunk in chat_service.get_response(message):
+                await websocket.send_json({
+                    "type": "chunk",
+                    "content": chunk
+                })
+
+            await websocket.send_json({"type": "done"})
+    except WebSocketDisconnect:
+        pass
+```
+
+## Performance Considerations
+
+### Connection Limits
+- Browser limit: ~6 connections per domain (HTTP/1.1)
+- Recommendation: Use single SSE connection per project, multiplex events
+
+### Scalability
+- Redis Pub/Sub handles cross-instance event distribution
+- Consider Redis Streams for message persistence (audit/replay)
+
+### Keepalive
+- Send comment every 30 seconds to prevent timeout
+- Client reconnects automatically on disconnect
+
+## Recommendations
+
+1. **Use SSE for all server-to-client events** (simpler, auto-reconnect)
+2. **Use WebSocket only for interactive chat** with agents
+3. **Redis Pub/Sub for event distribution** across instances
+4. **Single SSE connection per project** with event multiplexing
+5. **Exponential backoff** for client reconnection
+
+## References
+
+- [FastAPI SSE](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
+- [MDN EventSource](https://developer.mozilla.org/en-US/docs/Web/API/EventSource)
+- [Redis Pub/Sub](https://redis.io/topics/pubsub)
+
+## Decision
+
+**Adopt SSE as the primary real-time transport** with WebSocket reserved for bidirectional chat. Use Redis Pub/Sub for event distribution.
+
+---
+
+*Spike completed. Findings will inform ADR-002: Real-time Communication Architecture.*
--- a/docs/spikes/SPIKE-004-celery-redis-integration.md
+++ b/docs/spikes/SPIKE-004-celery-redis-integration.md
@@ -0,0 +1,420 @@
+# SPIKE-004: Celery + Redis Integration
+
+**Status:** Completed
+**Date:** 2025-12-29
+**Author:** Architecture Team
+**Related Issue:** #4
+
+---
+
+## Objective
+
+Research best practices for integrating Celery with FastAPI for background task processing, focusing on agent orchestration, long-running workflows, and task monitoring.
+
+## Research Questions
+
+1. How to properly integrate Celery with async FastAPI?
+2. What is the optimal task queue architecture for Syndarix?
+3. How to handle long-running agent tasks?
+4. What monitoring and visibility patterns should we use?
+
+## Findings
+
+### 1. Celery + FastAPI Integration Pattern
+
+**Challenge:** Celery is synchronous, FastAPI is async.
+
+**Solution:** Use `celery.result.AsyncResult` with async polling or callbacks.
+
+```python
+# app/core/celery.py
+from celery import Celery
+from app.core.config import settings
+
+celery_app = Celery(
+    "syndarix",
+    broker=settings.REDIS_URL,
+    backend=settings.REDIS_URL,
+    include=[
+        "app.tasks.agent_tasks",
+        "app.tasks.git_tasks",
+        "app.tasks.sync_tasks",
+    ]
+)
+
+celery_app.conf.update(
+    task_serializer="json",
+    accept_content=["json"],
+    result_serializer="json",
+    timezone="UTC",
+    enable_utc=True,
+    task_track_started=True,
+    task_time_limit=3600,  # 1 hour max
+    task_soft_time_limit=3300,  # 55 min soft limit
+    worker_prefetch_multiplier=1,  # One task at a time for LLM tasks
+    task_acks_late=True,  # Acknowledge after completion
+    task_reject_on_worker_lost=True,  # Retry if worker dies
+)
+```
+
+### 2. Task Queue Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                       FastAPI Backend                            │
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
+│  │  API Layer  │  │   Services  │  │   Events    │              │
+│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘              │
+│         │                │                │                      │
+│         └────────────────┼────────────────┘                      │
+│                          │                                       │
+│                          ▼                                       │
+│         ┌────────────────────────────────┐                      │
+│         │       Task Dispatcher          │                      │
+│         │   (Celery send_task)           │                      │
+│         └────────────────┬───────────────┘                      │
+└──────────────────────────┼──────────────────────────────────────┘
+                           │
+                           ▼
+┌──────────────────────────────────────────────────────────────────┐
+│                      Redis (Broker + Backend)                     │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │
+│  │ agent_queue  │  │  git_queue   │  │  sync_queue  │            │
+│  │ (priority)   │  │              │  │              │            │
+│  └──────────────┘  └──────────────┘  └──────────────┘            │
+└──────────────────────────────────────────────────────────────────┘
+                           │
+           ┌───────────────┼───────────────┐
+           │               │               │
+           ▼               ▼               ▼
+    ┌────────────┐  ┌────────────┐  ┌────────────┐
+    │   Worker   │  │   Worker   │  │   Worker   │
+    │  (agents)  │  │   (git)    │  │   (sync)   │
+    │ prefetch=1 │  │ prefetch=4 │  │ prefetch=4 │
+    └────────────┘  └────────────┘  └────────────┘
+```
+
+### 3. Queue Configuration
+
+```python
+# app/core/celery.py
+celery_app.conf.task_queues = [
+    Queue("agent_queue", routing_key="agent.#"),
+    Queue("git_queue", routing_key="git.#"),
+    Queue("sync_queue", routing_key="sync.#"),
+    Queue("cicd_queue", routing_key="cicd.#"),
+]
+
+celery_app.conf.task_routes = {
+    "app.tasks.agent_tasks.*": {"queue": "agent_queue"},
+    "app.tasks.git_tasks.*": {"queue": "git_queue"},
+    "app.tasks.sync_tasks.*": {"queue": "sync_queue"},
+    "app.tasks.cicd_tasks.*": {"queue": "cicd_queue"},
+}
+```
+
+### 4. Agent Task Implementation
+
+```python
+# app/tasks/agent_tasks.py
+from celery import Task
+from app.core.celery import celery_app
+from app.services.agent_runner import AgentRunner
+from app.services.events import EventBus
+
+class AgentTask(Task):
+    """Base class for agent tasks with retry and monitoring."""
+
+    autoretry_for = (ConnectionError, TimeoutError)
+    retry_backoff = True
+    retry_backoff_max = 600
+    retry_jitter = True
+    max_retries = 3
+
+    def on_failure(self, exc, task_id, args, kwargs, einfo):
+        """Handle task failure."""
+        project_id = kwargs.get("project_id")
+        agent_id = kwargs.get("agent_id")
+        EventBus().publish(f"project:{project_id}", {
+            "type": "agent_error",
+            "agent_id": agent_id,
+            "error": str(exc)
+        })
+
+@celery_app.task(bind=True, base=AgentTask)
+def run_agent_action(
+    self,
+    agent_id: str,
+    project_id: str,
+    action: str,
+    context: dict
+) -> dict:
+    """
+    Execute an agent action as a background task.
+
+    Args:
+        agent_id: The agent instance ID
+        project_id: The project context
+        action: The action to perform
+        context: Action-specific context
+
+    Returns:
+        Action result dictionary
+    """
+    runner = AgentRunner(agent_id, project_id)
+
+    # Update task state for monitoring
+    self.update_state(
+        state="RUNNING",
+        meta={"agent_id": agent_id, "action": action}
+    )
+
+    # Publish start event
+    EventBus().publish(f"project:{project_id}", {
+        "type": "agent_started",
+        "agent_id": agent_id,
+        "action": action,
+        "task_id": self.request.id
+    })
+
+    try:
+        result = runner.execute(action, context)
+
+        # Publish completion event
+        EventBus().publish(f"project:{project_id}", {
+            "type": "agent_completed",
+            "agent_id": agent_id,
+            "action": action,
+            "result_summary": result.get("summary")
+        })
+
+        return result
+    except Exception as e:
+        # Will trigger on_failure
+        raise
+```
+
+### 5. Long-Running Task Patterns
+
+**Progress Reporting:**
+```python
+@celery_app.task(bind=True)
+def implement_story(self, story_id: str, agent_id: str, project_id: str):
+    """Implement a user story with progress reporting."""
+
+    steps = [
+        ("analyzing", "Analyzing requirements"),
+        ("designing", "Designing solution"),
+        ("implementing", "Writing code"),
+        ("testing", "Running tests"),
+        ("documenting", "Updating documentation"),
+    ]
+
+    for i, (state, description) in enumerate(steps):
+        self.update_state(
+            state="PROGRESS",
+            meta={
+                "current": i + 1,
+                "total": len(steps),
+                "status": description
+            }
+        )
+
+        # Do the actual work
+        execute_step(state, story_id, agent_id)
+
+        # Publish progress event
+        EventBus().publish(f"project:{project_id}", {
+            "type": "agent_progress",
+            "agent_id": agent_id,
+            "step": i + 1,
+            "total": len(steps),
+            "description": description
+        })
+
+    return {"status": "completed", "story_id": story_id}
+```
+
+**Task Chaining:**
+```python
+from celery import chain, group
+
+# Sequential workflow
+workflow = chain(
+    analyze_requirements.s(story_id),
+    design_solution.s(),
+    implement_code.s(),
+    run_tests.s(),
+    create_pr.s()
+)
+
+# Parallel execution
+parallel_tests = group(
+    run_unit_tests.s(project_id),
+    run_integration_tests.s(project_id),
+    run_linting.s(project_id)
+)
+```
+
+### 6. FastAPI Integration
+
+```python
+# app/api/v1/agents.py
+from fastapi import APIRouter, BackgroundTasks
+from app.tasks.agent_tasks import run_agent_action
+from celery.result import AsyncResult
+
+router = APIRouter()
+
+@router.post("/agents/{agent_id}/actions")
+async def trigger_agent_action(
+    agent_id: str,
+    action: AgentActionRequest,
+    background_tasks: BackgroundTasks
+):
+    """Trigger an agent action as a background task."""
+
+    # Dispatch to Celery
+    task = run_agent_action.delay(
+        agent_id=agent_id,
+        project_id=action.project_id,
+        action=action.action,
+        context=action.context
+    )
+
+    return {
+        "task_id": task.id,
+        "status": "queued"
+    }
+
+@router.get("/tasks/{task_id}")
+async def get_task_status(task_id: str):
+    """Get the status of a background task."""
+
+    result = AsyncResult(task_id)
+
+    if result.state == "PENDING":
+        return {"status": "pending"}
+    elif result.state == "RUNNING":
+        return {"status": "running", **result.info}
+    elif result.state == "PROGRESS":
+        return {"status": "progress", **result.info}
+    elif result.state == "SUCCESS":
+        return {"status": "completed", "result": result.result}
+    elif result.state == "FAILURE":
+        return {"status": "failed", "error": str(result.result)}
+
+    return {"status": result.state}
+```
+
+### 7. Worker Configuration
+
+```bash
+# Run different workers for different queues
+
+# Agent worker (single task at a time for LLM rate limiting)
+celery -A app.core.celery worker \
+    -Q agent_queue \
+    -c 4 \
+    --prefetch-multiplier=1 \
+    -n agent_worker@%h
+
+# Git worker (can handle multiple concurrent tasks)
+celery -A app.core.celery worker \
+    -Q git_queue \
+    -c 8 \
+    --prefetch-multiplier=4 \
+    -n git_worker@%h
+
+# Sync worker
+celery -A app.core.celery worker \
+    -Q sync_queue \
+    -c 4 \
+    --prefetch-multiplier=4 \
+    -n sync_worker@%h
+```
+
+### 8. Monitoring with Flower
+
+```python
+# docker-compose.yml
+services:
+  flower:
+    image: mher/flower:latest
+    command: celery flower --broker=redis://redis:6379/0
+    ports:
+      - "5555:5555"
+    environment:
+      - CELERY_BROKER_URL=redis://redis:6379/0
+      - FLOWER_BASIC_AUTH=admin:password
+```
+
+### 9. Task Scheduling (Celery Beat)
+
+```python
+# app/core/celery.py
+from celery.schedules import crontab
+
+celery_app.conf.beat_schedule = {
+    # Sync issues every minute
+    "sync-external-issues": {
+        "task": "app.tasks.sync_tasks.sync_all_issues",
+        "schedule": 60.0,
+    },
+    # Health check every 5 minutes
+    "agent-health-check": {
+        "task": "app.tasks.agent_tasks.health_check_all_agents",
+        "schedule": 300.0,
+    },
+    # Daily cleanup at midnight
+    "cleanup-old-tasks": {
+        "task": "app.tasks.maintenance.cleanup_old_tasks",
+        "schedule": crontab(hour=0, minute=0),
+    },
+}
+```
+
+## Best Practices
+
+1. **One task per LLM call** - Avoid rate limiting issues
+2. **Progress reporting** - Update state for long-running tasks
+3. **Idempotent tasks** - Handle retries gracefully
+4. **Separate queues** - Isolate slow tasks from fast ones
+5. **Task result expiry** - Set `result_expires` to avoid Redis bloat
+6. **Soft time limits** - Allow graceful shutdown before hard kill
+
+## Recommendations
+
+1. **Use Celery for all long-running operations**
+   - Agent actions
+   - Git operations
+   - External sync
+   - CI/CD triggers
+
+2. **Use Redis as both broker and backend**
+   - Simplifies infrastructure
+   - Fast enough for our scale
+
+3. **Configure separate queues**
+   - `agent_queue` with prefetch=1
+   - `git_queue` with prefetch=4
+   - `sync_queue` with prefetch=4
+
+4. **Implement proper monitoring**
+   - Flower for web UI
+   - Prometheus metrics export
+   - Dead letter queue for failed tasks
+
+## References
+
+- [Celery Documentation](https://docs.celeryq.dev/)
+- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
+- [Celery Best Practices](https://docs.celeryq.dev/en/stable/userguide/tasks.html#tips-and-best-practices)
+
+## Decision
+
+**Adopt Celery + Redis** for all background task processing with queue-based routing and progress reporting via Redis Pub/Sub events.
+
+---
+
+*Spike completed. Findings will inform ADR-003: Background Task Architecture.*
--- a/docs/spikes/SPIKE-005-llm-provider-abstraction.md
+++ b/docs/spikes/SPIKE-005-llm-provider-abstraction.md
@@ -0,0 +1,516 @@
+# SPIKE-005: LLM Provider Abstraction
+
+**Status:** Completed
+**Date:** 2025-12-29
+**Author:** Architecture Team
+**Related Issue:** #5
+
+---
+
+## Objective
+
+Research the best approach for unified LLM provider abstraction with support for multiple providers, automatic failover, and cost tracking.
+
+## Research Questions
+
+1. What libraries exist for unified LLM access?
+2. How to implement automatic failover between providers?
+3. How to track token usage and costs per agent/project?
+4. What caching strategies can reduce API costs?
+
+## Findings
+
+### 1. LiteLLM - Recommended Solution
+
+**LiteLLM** provides a unified interface to 100+ LLM providers using the OpenAI SDK format.
+
+**Key Features:**
+- Unified API across providers (Anthropic, OpenAI, local, etc.)
+- Built-in failover and load balancing
+- Token counting and cost tracking
+- Streaming support
+- Async support
+- Caching with Redis
+
+**Installation:**
+```bash
+pip install litellm
+```
+
+### 2. Basic Usage
+
+```python
+from litellm import completion, acompletion
+import litellm
+
+# Configure providers
+litellm.api_key = os.getenv("ANTHROPIC_API_KEY")
+litellm.set_verbose = True  # For debugging
+
+# Synchronous call
+response = completion(
+    model="claude-3-5-sonnet-20241022",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+
+# Async call (for FastAPI)
+response = await acompletion(
+    model="claude-3-5-sonnet-20241022",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+```
+
+### 3. Model Naming Convention
+
+LiteLLM uses prefixed model names:
+
+| Provider | Model Format |
+|----------|--------------|
+| Anthropic | `claude-3-5-sonnet-20241022` |
+| OpenAI | `gpt-4-turbo` |
+| Azure OpenAI | `azure/deployment-name` |
+| Ollama | `ollama/llama3` |
+| Together AI | `together_ai/togethercomputer/llama-2-70b` |
+
+### 4. Failover Configuration
+
+```python
+from litellm import Router
+
+# Define model list with fallbacks
+model_list = [
+    {
+        "model_name": "primary-agent",
+        "litellm_params": {
+            "model": "claude-3-5-sonnet-20241022",
+            "api_key": os.getenv("ANTHROPIC_API_KEY"),
+        },
+        "model_info": {"id": 1}
+    },
+    {
+        "model_name": "primary-agent",  # Same name = fallback
+        "litellm_params": {
+            "model": "gpt-4-turbo",
+            "api_key": os.getenv("OPENAI_API_KEY"),
+        },
+        "model_info": {"id": 2}
+    },
+    {
+        "model_name": "primary-agent",
+        "litellm_params": {
+            "model": "ollama/llama3",
+            "api_base": "http://localhost:11434",
+        },
+        "model_info": {"id": 3}
+    }
+]
+
+# Initialize router with failover
+router = Router(
+    model_list=model_list,
+    fallbacks=[
+        {"primary-agent": ["primary-agent"]}  # Try all models with same name
+    ],
+    routing_strategy="simple-shuffle",  # or "latency-based-routing"
+    num_retries=3,
+    retry_after=5,  # seconds
+    timeout=60,
+)
+
+# Use router
+response = await router.acompletion(
+    model="primary-agent",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+```
+
+### 5. Syndarix LLM Gateway Architecture
+
+```python
+# app/services/llm_gateway.py
+from litellm import Router, acompletion
+from app.core.config import settings
+from app.models.agent import AgentType
+from app.services.cost_tracker import CostTracker
+from app.services.events import EventBus
+
+class LLMGateway:
+    """Unified LLM gateway with failover and cost tracking."""
+
+    def __init__(self):
+        self.router = self._build_router()
+        self.cost_tracker = CostTracker()
+        self.event_bus = EventBus()
+
+    def _build_router(self) -> Router:
+        """Build LiteLLM router from configuration."""
+        model_list = []
+
+        # Add Anthropic models
+        if settings.ANTHROPIC_API_KEY:
+            model_list.extend([
+                {
+                    "model_name": "high-reasoning",
+                    "litellm_params": {
+                        "model": "claude-3-5-sonnet-20241022",
+                        "api_key": settings.ANTHROPIC_API_KEY,
+                    }
+                },
+                {
+                    "model_name": "fast-response",
+                    "litellm_params": {
+                        "model": "claude-3-haiku-20240307",
+                        "api_key": settings.ANTHROPIC_API_KEY,
+                    }
+                }
+            ])
+
+        # Add OpenAI fallbacks
+        if settings.OPENAI_API_KEY:
+            model_list.extend([
+                {
+                    "model_name": "high-reasoning",
+                    "litellm_params": {
+                        "model": "gpt-4-turbo",
+                        "api_key": settings.OPENAI_API_KEY,
+                    }
+                },
+                {
+                    "model_name": "fast-response",
+                    "litellm_params": {
+                        "model": "gpt-4o-mini",
+                        "api_key": settings.OPENAI_API_KEY,
+                    }
+                }
+            ])
+
+        # Add local models (Ollama)
+        if settings.OLLAMA_URL:
+            model_list.append({
+                "model_name": "local-fallback",
+                "litellm_params": {
+                    "model": "ollama/llama3",
+                    "api_base": settings.OLLAMA_URL,
+                }
+            })
+
+        return Router(
+            model_list=model_list,
+            fallbacks=[
+                {"high-reasoning": ["high-reasoning", "local-fallback"]},
+                {"fast-response": ["fast-response", "local-fallback"]},
+            ],
+            routing_strategy="latency-based-routing",
+            num_retries=3,
+            timeout=120,
+        )
+
+    async def complete(
+        self,
+        agent_id: str,
+        project_id: str,
+        messages: list[dict],
+        model_preference: str = "high-reasoning",
+        stream: bool = False,
+        **kwargs
+    ) -> dict:
+        """
+        Generate a completion with automatic failover and cost tracking.
+
+        Args:
+            agent_id: The calling agent's ID
+            project_id: The project context
+            messages: Chat messages
+            model_preference: "high-reasoning" or "fast-response"
+            stream: Whether to stream the response
+            **kwargs: Additional LiteLLM parameters
+
+        Returns:
+            Completion response dictionary
+        """
+        try:
+            if stream:
+                return self._stream_completion(
+                    agent_id, project_id, messages, model_preference, **kwargs
+                )
+
+            response = await self.router.acompletion(
+                model=model_preference,
+                messages=messages,
+                **kwargs
+            )
+
+            # Track usage
+            await self._track_usage(
+                agent_id=agent_id,
+                project_id=project_id,
+                model=response.model,
+                usage=response.usage,
+            )
+
+            return {
+                "content": response.choices[0].message.content,
+                "model": response.model,
+                "usage": {
+                    "prompt_tokens": response.usage.prompt_tokens,
+                    "completion_tokens": response.usage.completion_tokens,
+                    "total_tokens": response.usage.total_tokens,
+                }
+            }
+
+        except Exception as e:
+            # Publish error event
+            await self.event_bus.publish(f"project:{project_id}", {
+                "type": "llm_error",
+                "agent_id": agent_id,
+                "error": str(e)
+            })
+            raise
+
+    async def _stream_completion(
+        self,
+        agent_id: str,
+        project_id: str,
+        messages: list[dict],
+        model_preference: str,
+        **kwargs
+    ):
+        """Stream a completion response."""
+        response = await self.router.acompletion(
+            model=model_preference,
+            messages=messages,
+            stream=True,
+            **kwargs
+        )
+
+        async for chunk in response:
+            if chunk.choices[0].delta.content:
+                yield chunk.choices[0].delta.content
+
+    async def _track_usage(
+        self,
+        agent_id: str,
+        project_id: str,
+        model: str,
+        usage: dict
+    ):
+        """Track token usage and costs."""
+        await self.cost_tracker.record_usage(
+            agent_id=agent_id,
+            project_id=project_id,
+            model=model,
+            prompt_tokens=usage.prompt_tokens,
+            completion_tokens=usage.completion_tokens,
+        )
+```
+
+### 6. Cost Tracking
+
+```python
+# app/services/cost_tracker.py
+from sqlalchemy.ext.asyncio import AsyncSession
+from app.models.usage import TokenUsage
+from datetime import datetime
+
+# Cost per 1M tokens (approximate)
+MODEL_COSTS = {
+    "claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
+    "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
+    "gpt-4-turbo": {"input": 10.00, "output": 30.00},
+    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
+    "ollama/llama3": {"input": 0.00, "output": 0.00},  # Local
+}
+
+class CostTracker:
+    def __init__(self, db: AsyncSession):
+        self.db = db
+
+    async def record_usage(
+        self,
+        agent_id: str,
+        project_id: str,
+        model: str,
+        prompt_tokens: int,
+        completion_tokens: int,
+    ):
+        """Record token usage and calculate cost."""
+        costs = MODEL_COSTS.get(model, {"input": 0, "output": 0})
+
+        input_cost = (prompt_tokens / 1_000_000) * costs["input"]
+        output_cost = (completion_tokens / 1_000_000) * costs["output"]
+        total_cost = input_cost + output_cost
+
+        usage = TokenUsage(
+            agent_id=agent_id,
+            project_id=project_id,
+            model=model,
+            prompt_tokens=prompt_tokens,
+            completion_tokens=completion_tokens,
+            total_tokens=prompt_tokens + completion_tokens,
+            cost_usd=total_cost,
+            timestamp=datetime.utcnow(),
+        )
+
+        self.db.add(usage)
+        await self.db.commit()
+
+    async def get_project_usage(
+        self,
+        project_id: str,
+        start_date: datetime = None,
+        end_date: datetime = None,
+    ) -> dict:
+        """Get usage summary for a project."""
+        # Query aggregated usage
+        ...
+
+    async def check_budget(
+        self,
+        project_id: str,
+        budget_limit: float,
+    ) -> bool:
+        """Check if project is within budget."""
+        usage = await self.get_project_usage(project_id)
+        return usage["total_cost_usd"] < budget_limit
+```
+
+### 7. Caching with Redis
+
+```python
+import litellm
+from litellm import Cache
+
+# Configure Redis cache
+litellm.cache = Cache(
+    type="redis",
+    host=settings.REDIS_HOST,
+    port=settings.REDIS_PORT,
+    password=settings.REDIS_PASSWORD,
+)
+
+# Enable caching
+litellm.enable_cache()
+
+# Cached completions (same input = cached response)
+response = await litellm.acompletion(
+    model="claude-3-5-sonnet-20241022",
+    messages=[{"role": "user", "content": "What is 2+2?"}],
+    cache={"ttl": 3600}  # Cache for 1 hour
+)
+```
+
+### 8. Agent Type Model Mapping
+
+```python
+# app/models/agent_type.py
+from sqlalchemy import Column, String, Enum as SQLEnum
+from app.db.base import Base
+
+class ModelPreference(str, Enum):
+    HIGH_REASONING = "high-reasoning"
+    FAST_RESPONSE = "fast-response"
+    COST_OPTIMIZED = "cost-optimized"
+
+class AgentType(Base):
+    __tablename__ = "agent_types"
+
+    id = Column(UUID, primary_key=True)
+    name = Column(String(50), unique=True)
+    role = Column(String(50))
+
+    # LLM configuration
+    model_preference = Column(
+        SQLEnum(ModelPreference),
+        default=ModelPreference.HIGH_REASONING
+    )
+    max_tokens = Column(Integer, default=4096)
+    temperature = Column(Float, default=0.7)
+
+    # System prompt
+    system_prompt = Column(Text)
+
+# Mapping agent types to models
+AGENT_MODEL_MAPPING = {
+    "Product Owner": ModelPreference.HIGH_REASONING,
+    "Project Manager": ModelPreference.FAST_RESPONSE,
+    "Business Analyst": ModelPreference.HIGH_REASONING,
+    "Software Architect": ModelPreference.HIGH_REASONING,
+    "Software Engineer": ModelPreference.HIGH_REASONING,
+    "UI/UX Designer": ModelPreference.HIGH_REASONING,
+    "QA Engineer": ModelPreference.FAST_RESPONSE,
+    "DevOps Engineer": ModelPreference.FAST_RESPONSE,
+    "AI/ML Engineer": ModelPreference.HIGH_REASONING,
+    "Security Expert": ModelPreference.HIGH_REASONING,
+}
+```
+
+## Rate Limiting Strategy
+
+```python
+from litellm import Router
+import asyncio
+
+# Configure rate limits per model
+router = Router(
+    model_list=model_list,
+    redis_host=settings.REDIS_HOST,
+    redis_port=settings.REDIS_PORT,
+    routing_strategy="usage-based-routing",  # Route based on rate limits
+)
+
+# Custom rate limiter
+class RateLimiter:
+    def __init__(self, requests_per_minute: int = 60):
+        self.rpm = requests_per_minute
+        self.semaphore = asyncio.Semaphore(requests_per_minute)
+
+    async def acquire(self):
+        await self.semaphore.acquire()
+        # Release after 60 seconds
+        asyncio.create_task(self._release_after(60))
+
+    async def _release_after(self, seconds: int):
+        await asyncio.sleep(seconds)
+        self.semaphore.release()
+```
+
+## Recommendations
+
+1. **Use LiteLLM as the unified abstraction layer**
+   - Simplifies multi-provider support
+   - Built-in failover and retry
+   - Consistent API across providers
+
+2. **Configure model groups by use case**
+   - `high-reasoning`: Complex analysis, architecture decisions
+   - `fast-response`: Quick tasks, simple queries
+   - `cost-optimized`: Non-critical, high-volume tasks
+
+3. **Implement automatic failover chain**
+   - Primary: Claude 3.5 Sonnet
+   - Fallback 1: GPT-4 Turbo
+   - Fallback 2: Local Llama 3 (if available)
+
+4. **Track all usage and costs**
+   - Per agent, per project
+   - Set budget alerts
+   - Generate usage reports
+
+5. **Cache frequently repeated queries**
+   - Use Redis-backed cache
+   - Cache embeddings for RAG
+   - Cache deterministic transformations
+
+## References
+
+- [LiteLLM Documentation](https://docs.litellm.ai/)
+- [LiteLLM Router](https://docs.litellm.ai/docs/routing)
+- [Anthropic Rate Limits](https://docs.anthropic.com/en/api/rate-limits)
+
+## Decision
+
+**Adopt LiteLLM** as the unified LLM abstraction layer with automatic failover, usage-based routing, and Redis-backed caching.
+
+---
+
+*Spike completed. Findings will inform ADR-004: LLM Provider Integration Architecture.*