diff --git a/docs/adrs/ADR-001-mcp-integration-architecture.md b/docs/adrs/ADR-001-mcp-integration-architecture.md new file mode 100644 index 0000000..e4d52f9 --- /dev/null +++ b/docs/adrs/ADR-001-mcp-integration-architecture.md @@ -0,0 +1,134 @@ +# ADR-001: MCP Integration Architecture + +**Status:** Accepted +**Date:** 2025-12-29 +**Deciders:** Architecture Team +**Related Spikes:** SPIKE-001 + +--- + +## Context + +Syndarix requires integration with multiple external services (LLM providers, Git, issue trackers, file systems, CI/CD). The Model Context Protocol (MCP) was identified as the standard for tool integration in AI applications. We need to decide on: + +1. The MCP framework to use +2. Server deployment pattern (singleton vs per-project) +3. Scoping mechanism for multi-project/multi-agent access + +## Decision Drivers + +- **Simplicity:** Minimize operational complexity +- **Resource Efficiency:** Avoid spawning redundant processes +- **Consistency:** Unified interface across all integrations +- **Scalability:** Support 10+ concurrent projects +- **Maintainability:** Easy to add new MCP servers + +## Considered Options + +### Option 1: Per-Project MCP Servers +Spawn dedicated MCP server instances for each project. + +**Pros:** +- Complete isolation between projects +- Simple access control (project owns server) + +**Cons:** +- Resource heavy (7 servers × N projects) +- Complex orchestration +- Difficult to share cross-project resources + +### Option 2: Unified Singleton MCP Servers (Selected) +Single instance of each MCP server type, with explicit project/agent scoping. + +**Pros:** +- Resource efficient (7 total servers) +- Simpler deployment +- Enables cross-project learning (if desired) +- Consistent management + +**Cons:** +- Requires explicit scoping in all tools +- Shared state requires careful design + +### Option 3: Hybrid (MCP Proxy) +Single proxy that routes to per-project backends. + +**Pros:** +- Balance of isolation and efficiency + +**Cons:** +- Added complexity +- Routing overhead + +## Decision + +**Adopt Option 2: Unified Singleton MCP Servers with explicit scoping.** + +All MCP servers are deployed as singletons. Every tool accepts `project_id` and `agent_id` parameters for: +- Access control validation +- Audit logging +- Context filtering + +## Implementation + +### MCP Server Registry + +| Server | Port | Purpose | +|--------|------|---------| +| LLM Gateway | 9001 | Route LLM requests with failover | +| Git MCP | 9002 | Git operations across providers | +| Knowledge Base MCP | 9003 | RAG and document search | +| Issues MCP | 9004 | Issue tracking operations | +| File System MCP | 9005 | Workspace file operations | +| Code Analysis MCP | 9006 | Static analysis, linting | +| CI/CD MCP | 9007 | Pipeline operations | + +### Framework Selection + +Use **FastMCP 2.0** for all MCP server implementations: +- Decorator-based tool registration +- Built-in async support +- Compatible with SSE transport +- Type-safe with Pydantic + +### Tool Signature Pattern + +```python +@mcp.tool() +def tool_name( + project_id: str, # Required: project scope + agent_id: str, # Required: calling agent + # ... tool-specific params +) -> Result: + validate_access(agent_id, project_id) + log_tool_usage(agent_id, project_id, "tool_name") + # ... implementation +``` + +## Consequences + +### Positive +- Single deployment per MCP type simplifies operations +- Consistent interface across all tools +- Easy to add monitoring/logging centrally +- Cross-project analytics possible + +### Negative +- All tools must include scoping parameters +- Shared state requires careful design +- Single point of failure per MCP type (mitigated by multiple instances) + +### Neutral +- Requires MCP client manager in FastAPI backend +- Authentication handled internally (service tokens for v1) + +## Compliance + +This decision aligns with: +- FR-802: MCP-first architecture requirement +- NFR-201: Horizontal scalability requirement +- NFR-602: Centralized logging requirement + +--- + +*This ADR supersedes any previous decisions regarding MCP architecture.* diff --git a/docs/adrs/ADR-002-realtime-communication.md b/docs/adrs/ADR-002-realtime-communication.md new file mode 100644 index 0000000..bb49364 --- /dev/null +++ b/docs/adrs/ADR-002-realtime-communication.md @@ -0,0 +1,160 @@ +# ADR-002: Real-time Communication Architecture + +**Status:** Accepted +**Date:** 2025-12-29 +**Deciders:** Architecture Team +**Related Spikes:** SPIKE-003 + +--- + +## Context + +Syndarix requires real-time communication for: +- Agent activity streams +- Project progress updates +- Build/pipeline status +- Client approval requests +- Issue change notifications +- Interactive chat with agents + +We need to decide between WebSocket and Server-Sent Events (SSE) for real-time data delivery. + +## Decision Drivers + +- **Simplicity:** Minimize implementation complexity +- **Reliability:** Built-in reconnection handling +- **Scalability:** Support 200+ concurrent connections +- **Compatibility:** Work through proxies and load balancers +- **Use Case Fit:** Match communication patterns + +## Considered Options + +### Option 1: WebSocket Only +Use WebSocket for all real-time communication. + +**Pros:** +- Bidirectional communication +- Single protocol to manage +- Well-supported in FastAPI + +**Cons:** +- Manual reconnection logic required +- More complex through proxies +- Overkill for server-to-client streams + +### Option 2: SSE Only +Use Server-Sent Events for all real-time communication. + +**Pros:** +- Built-in automatic reconnection +- Native HTTP (proxy-friendly) +- Simpler implementation + +**Cons:** +- Unidirectional only +- Browser connection limits per domain + +### Option 3: SSE Primary + WebSocket for Chat (Selected) +Use SSE for server-to-client events, WebSocket for bidirectional chat. + +**Pros:** +- Best tool for each use case +- SSE simplicity for 90% of needs +- WebSocket only where truly needed + +**Cons:** +- Two protocols to manage + +## Decision + +**Adopt Option 3: SSE as primary transport, WebSocket for interactive chat.** + +### SSE Use Cases (90%) +- Agent activity streams +- Project progress updates +- Build/pipeline status +- Approval request notifications +- Issue change notifications + +### WebSocket Use Cases (10%) +- Interactive chat with agents +- Real-time debugging sessions +- Future collaboration features + +## Implementation + +### Event Bus with Redis Pub/Sub + +``` +FastAPI Backend ──publish──> Redis Pub/Sub ──subscribe──> SSE Endpoints + │ + └──> Other Backend Instances +``` + +### SSE Endpoint Pattern + +```python +@router.get("/projects/{project_id}/events") +async def project_events(project_id: str, request: Request): + async def event_generator(): + subscriber = await event_bus.subscribe(f"project:{project_id}") + try: + while not await request.is_disconnected(): + event = await asyncio.wait_for( + subscriber.get_event(), timeout=30.0 + ) + yield f"event: {event.type}\ndata: {event.json()}\n\n" + finally: + await subscriber.unsubscribe() + + return StreamingResponse( + event_generator(), + media_type="text/event-stream" + ) +``` + +### Event Types + +| Category | Event Types | +|----------|-------------| +| Agent | `agent_started`, `agent_activity`, `agent_completed`, `agent_error` | +| Project | `issue_created`, `issue_updated`, `issue_closed` | +| Git | `branch_created`, `commit_pushed`, `pr_created`, `pr_merged` | +| Workflow | `approval_required`, `sprint_started`, `sprint_completed` | +| Pipeline | `pipeline_started`, `pipeline_completed`, `pipeline_failed` | + +### Client Implementation + +- Single SSE connection per project +- Event multiplexing through event types +- Exponential backoff on reconnection +- Native `EventSource` API with automatic reconnect + +## Consequences + +### Positive +- Simpler implementation for server-to-client streams +- Automatic reconnection reduces client complexity +- Works through all HTTP proxies +- Reduced server resource usage vs WebSocket + +### Negative +- Two protocols to maintain +- WebSocket requires manual reconnect logic +- SSE limited to ~6 connections per domain (HTTP/1.1) + +### Mitigation +- Use HTTP/2 where possible (higher connection limits) +- Multiplex all project events on single connection +- WebSocket only for interactive chat sessions + +## Compliance + +This decision aligns with: +- FR-105: Real-time agent activity monitoring +- NFR-102: 200+ concurrent connections requirement +- NFR-501: Responsive UI updates + +--- + +*This ADR supersedes any previous decisions regarding real-time communication.* diff --git a/docs/adrs/ADR-003-background-task-architecture.md b/docs/adrs/ADR-003-background-task-architecture.md new file mode 100644 index 0000000..36c3ddb --- /dev/null +++ b/docs/adrs/ADR-003-background-task-architecture.md @@ -0,0 +1,179 @@ +# ADR-003: Background Task Architecture + +**Status:** Accepted +**Date:** 2025-12-29 +**Deciders:** Architecture Team +**Related Spikes:** SPIKE-004 + +--- + +## Context + +Syndarix requires background task processing for: +- Agent actions (LLM calls, code generation) +- Git operations (clone, commit, push, PR creation) +- External synchronization (issue sync with Gitea/GitHub/GitLab) +- CI/CD pipeline triggers +- Long-running workflows (sprints, story implementation) + +These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring. + +## Decision Drivers + +- **Reliability:** Tasks must complete even if workers restart +- **Visibility:** Progress tracking for long-running operations +- **Scalability:** Handle concurrent agent operations +- **Rate Limiting:** Respect LLM API rate limits +- **Async Compatibility:** Work with async FastAPI + +## Considered Options + +### Option 1: FastAPI BackgroundTasks +Use FastAPI's built-in background tasks. + +**Pros:** +- Simple, no additional infrastructure +- Direct async integration + +**Cons:** +- No persistence (lost on restart) +- No retry mechanism +- No distributed workers + +### Option 2: Celery + Redis (Selected) +Use Celery for task queue with Redis as broker/backend. + +**Pros:** +- Mature, battle-tested +- Persistent task queue +- Built-in retry with backoff +- Distributed workers +- Task chaining and workflows +- Monitoring with Flower + +**Cons:** +- Additional infrastructure +- Sync-only task execution (bridge needed for async) + +### Option 3: Dramatiq + Redis +Use Dramatiq as a simpler Celery alternative. + +**Pros:** +- Simpler API than Celery +- Good async support + +**Cons:** +- Less mature ecosystem +- Fewer monitoring tools + +### Option 4: ARQ (Async Redis Queue) +Use ARQ for native async task processing. + +**Pros:** +- Native async +- Simple API + +**Cons:** +- Less feature-rich +- Smaller community + +## Decision + +**Adopt Option 2: Celery + Redis.** + +Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend. + +## Implementation + +### Queue Architecture + +``` +┌─────────────────────────────────────────────────┐ +│ Redis (Broker + Backend) │ +├─────────────┬─────────────┬─────────────────────┤ +│ agent_queue │ git_queue │ sync_queue │ +│ (prefetch=1)│ (prefetch=4)│ (prefetch=4) │ +└──────┬──────┴──────┬──────┴──────────┬──────────┘ + │ │ │ + ▼ ▼ ▼ + ┌─────────┐ ┌─────────┐ ┌─────────┐ + │ Agent │ │ Git │ │ Sync │ + │ Workers │ │ Workers │ │ Workers │ + └─────────┘ └─────────┘ └─────────┘ +``` + +### Queue Configuration + +| Queue | Prefetch | Concurrency | Purpose | +|-------|----------|-------------|---------| +| `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) | +| `git_queue` | 4 | 8 | Git operations | +| `sync_queue` | 4 | 4 | External sync | +| `cicd_queue` | 4 | 4 | Pipeline operations | + +### Task Patterns + +**Progress Reporting:** +```python +@celery_app.task(bind=True) +def implement_story(self, story_id: str, agent_id: str, project_id: str): + for i, step in enumerate(steps): + self.update_state( + state="PROGRESS", + meta={"current": i + 1, "total": len(steps)} + ) + # Publish SSE event for real-time UI update + event_bus.publish(f"project:{project_id}", { + "type": "agent_progress", + "step": i + 1, + "total": len(steps) + }) + execute_step(step) +``` + +**Task Chaining:** +```python +workflow = chain( + analyze_requirements.s(story_id), + design_solution.s(), + implement_code.s(), + run_tests.s(), + create_pr.s() +) +``` + +### Monitoring + +- **Flower:** Web UI for task monitoring (port 5555) +- **Prometheus:** Metrics export for alerting +- **Dead Letter Queue:** Failed tasks for investigation + +## Consequences + +### Positive +- Reliable task execution with persistence +- Automatic retry with exponential backoff +- Progress tracking for long operations +- Distributed workers for scalability +- Rich monitoring and debugging tools + +### Negative +- Additional infrastructure (Redis, workers) +- Celery is synchronous (event_loop bridge for async calls) +- Learning curve for task patterns + +### Mitigation +- Use existing Redis instance (already needed for SSE) +- Wrap async calls with `asyncio.run()` or `sync_to_async` +- Document common task patterns + +## Compliance + +This decision aligns with: +- FR-304: Long-running implementation workflow +- NFR-102: 500+ background jobs per minute +- NFR-402: Task reliability and fault tolerance + +--- + +*This ADR supersedes any previous decisions regarding background task processing.* diff --git a/docs/adrs/ADR-004-llm-provider-abstraction.md b/docs/adrs/ADR-004-llm-provider-abstraction.md new file mode 100644 index 0000000..2b5a3b6 --- /dev/null +++ b/docs/adrs/ADR-004-llm-provider-abstraction.md @@ -0,0 +1,189 @@ +# ADR-004: LLM Provider Abstraction + +**Status:** Accepted +**Date:** 2025-12-29 +**Deciders:** Architecture Team +**Related Spikes:** SPIKE-005 + +--- + +## Context + +Syndarix agents require access to large language models (LLMs) from multiple providers: +- **Anthropic** (Claude) - Primary provider +- **OpenAI** (GPT-4) - Fallback provider +- **Local models** (Ollama/Llama) - Cost optimization, privacy + +We need a unified abstraction layer that provides: +- Consistent API across providers +- Automatic failover on errors +- Usage tracking and cost management +- Rate limiting compliance + +## Decision Drivers + +- **Reliability:** Automatic failover on provider outages +- **Cost Control:** Track and limit API spending +- **Flexibility:** Easy to add/swap providers +- **Consistency:** Single interface for all agents +- **Async Support:** Compatible with async FastAPI + +## Considered Options + +### Option 1: Direct Provider SDKs +Use Anthropic and OpenAI SDKs directly with custom abstraction. + +**Pros:** +- Full control over implementation +- No external dependencies + +**Cons:** +- Significant development effort +- Must maintain failover logic +- Must track token costs manually + +### Option 2: LiteLLM (Selected) +Use LiteLLM as unified abstraction layer. + +**Pros:** +- Unified API for 100+ providers +- Built-in failover and routing +- Automatic token counting +- Cost tracking built-in +- Redis caching support +- Active community + +**Cons:** +- External dependency +- May lag behind provider SDK updates + +### Option 3: LangChain +Use LangChain's LLM abstraction. + +**Pros:** +- Large ecosystem +- Many integrations + +**Cons:** +- Heavy dependency +- Overkill for just LLM abstraction +- Complexity overhead + +## Decision + +**Adopt Option 2: LiteLLM for unified LLM provider abstraction.** + +LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead. + +## Implementation + +### Model Groups + +| Group Name | Use Case | Primary Model | Fallback | +|------------|----------|---------------|----------| +| `high-reasoning` | Complex analysis, architecture | Claude 3.5 Sonnet | GPT-4 Turbo | +| `fast-response` | Quick tasks, simple queries | Claude 3 Haiku | GPT-4o Mini | +| `cost-optimized` | High-volume, non-critical | Local Llama 3 | Claude 3 Haiku | + +### Failover Chain + +``` +Claude 3.5 Sonnet (Anthropic) + │ + ▼ (on failure) + GPT-4 Turbo (OpenAI) + │ + ▼ (on failure) + Llama 3 (Ollama/Local) + │ + ▼ (on failure) + Error with retry +``` + +### LLM Gateway Service + +```python +class LLMGateway: + def __init__(self): + self.router = Router( + model_list=model_list, + fallbacks=[ + {"high-reasoning": ["high-reasoning", "local-fallback"]}, + ], + routing_strategy="latency-based-routing", + num_retries=3, + ) + + async def complete( + self, + agent_id: str, + project_id: str, + messages: list[dict], + model_preference: str = "high-reasoning", + ) -> dict: + response = await self.router.acompletion( + model=model_preference, + messages=messages, + ) + await self._track_usage(agent_id, project_id, response) + return response +``` + +### Cost Tracking + +| Model | Input (per 1M tokens) | Output (per 1M tokens) | +|-------|----------------------|------------------------| +| Claude 3.5 Sonnet | $3.00 | $15.00 | +| Claude 3 Haiku | $0.25 | $1.25 | +| GPT-4 Turbo | $10.00 | $30.00 | +| GPT-4o Mini | $0.15 | $0.60 | +| Ollama (local) | $0.00 | $0.00 | + +### Agent Type Mapping + +| Agent Type | Model Preference | Rationale | +|------------|------------------|-----------| +| Product Owner | high-reasoning | Complex requirements analysis | +| Software Architect | high-reasoning | Architecture decisions | +| Software Engineer | high-reasoning | Code generation | +| QA Engineer | fast-response | Test case generation | +| DevOps Engineer | fast-response | Config generation | +| Project Manager | fast-response | Status updates | + +### Caching Strategy + +- **Redis-backed cache** for repeated queries +- **TTL:** 1 hour for general queries +- **Skip cache:** For context-dependent generation +- **Cache key:** Hash of (model, messages, temperature) + +## Consequences + +### Positive +- Single interface for all LLM operations +- Automatic failover improves reliability +- Built-in cost tracking and budgeting +- Easy to add new providers +- Caching reduces API costs + +### Negative +- Dependency on LiteLLM library +- May lag behind provider SDK features +- Additional abstraction layer + +### Mitigation +- Pin LiteLLM version, test before upgrades +- Direct SDK access available if needed +- Monitor LiteLLM updates for breaking changes + +## Compliance + +This decision aligns with: +- FR-101: Agent type model configuration +- NFR-103: Agent response time targets +- NFR-402: Failover requirements +- TR-001: LLM API unavailability mitigation + +--- + +*This ADR supersedes any previous decisions regarding LLM integration.* diff --git a/docs/adrs/ADR-005-tech-stack-selection.md b/docs/adrs/ADR-005-tech-stack-selection.md new file mode 100644 index 0000000..c0f6063 --- /dev/null +++ b/docs/adrs/ADR-005-tech-stack-selection.md @@ -0,0 +1,156 @@ +# ADR-005: Technology Stack Selection + +**Status:** Accepted +**Date:** 2025-12-29 +**Deciders:** Architecture Team + +--- + +## Context + +Syndarix needs a robust, modern technology stack that can support: +- Multi-agent orchestration with real-time communication +- Full-stack web application with API backend +- Background task processing for long-running operations +- Vector search for RAG (Retrieval-Augmented Generation) +- Multiple external integrations via MCP + +The decision was made to build upon **PragmaStack** as the foundation, extending it with Syndarix-specific components. + +## Decision Drivers + +- **Productivity:** Rapid development with modern frameworks +- **Type Safety:** Minimize runtime errors +- **Async Performance:** Handle concurrent agent operations +- **Ecosystem:** Rich library support +- **Familiarity:** Team expertise with selected technologies +- **Production-Ready:** Proven technologies for production workloads + +## Decision + +**Adopt PragmaStack as foundation with Syndarix-specific extensions.** + +### Core Stack (from PragmaStack) + +| Layer | Technology | Version | Rationale | +|-------|------------|---------|-----------| +| **Backend** | FastAPI | 0.115+ | Async, OpenAPI, type hints | +| **Backend Language** | Python | 3.11+ | Type hints, async/await, ecosystem | +| **Frontend** | Next.js | 16 | React 19, server components, App Router | +| **Frontend Language** | TypeScript | 5.0+ | Type safety, IDE support | +| **Database** | PostgreSQL | 15+ | Robust, extensible, pgvector | +| **ORM** | SQLAlchemy | 2.0+ | Async support, type hints | +| **Validation** | Pydantic | 2.0+ | Data validation, serialization | +| **State Management** | Zustand | 4.0+ | Simple, performant | +| **Data Fetching** | TanStack Query | 5.0+ | Caching, invalidation | +| **UI Components** | shadcn/ui | Latest | Accessible, customizable | +| **CSS** | Tailwind CSS | 4.0+ | Utility-first, fast styling | +| **Auth** | JWT | - | Dual-token (access + refresh) | + +### Syndarix Extensions + +| Component | Technology | Version | Purpose | +|-----------|------------|---------|---------| +| **Task Queue** | Celery | 5.3+ | Background job processing | +| **Message Broker** | Redis | 7.0+ | Celery broker, caching, pub/sub | +| **Vector Store** | pgvector | Latest | Embeddings for RAG | +| **MCP Framework** | FastMCP | 2.0+ | MCP server development | +| **LLM Abstraction** | LiteLLM | Latest | Multi-provider LLM access | +| **Real-time** | SSE + WebSocket | - | Event streaming, chat | + +### Testing Stack + +| Type | Technology | Purpose | +|------|------------|---------| +| **Backend Unit** | pytest | 8.0+ | Python testing | +| **Backend Async** | pytest-asyncio | Async test support | +| **Backend Coverage** | coverage.py | Code coverage | +| **Frontend Unit** | Jest | 29+ | React testing | +| **Frontend Components** | React Testing Library | Component testing | +| **E2E** | Playwright | 1.40+ | Browser automation | + +### DevOps Stack + +| Component | Technology | Purpose | +|-----------|------------|---------| +| **Containerization** | Docker | 24+ | Application packaging | +| **Orchestration** | Docker Compose | Local development | +| **CI/CD** | Gitea Actions | Automated pipelines | +| **Database Migrations** | Alembic | Schema versioning | + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Frontend (Next.js 16) │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Pages │ │ Components │ │ Stores │ │ +│ │ (App Router)│ │ (shadcn/ui) │ │ (Zustand) │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +└────────────────────────────┬────────────────────────────────────┘ + │ REST + SSE + WebSocket + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Backend (FastAPI 0.115+) │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ API │ │ Services │ │ CRUD │ │ +│ │ Routes │ │ Layer │ │ Layer │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ LLM Gateway │ │ MCP Client │ │ Event Bus │ │ +│ │ (LiteLLM) │ │ Manager │ │ (Redis) │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ┌────────────────────┼────────────────────┐ + ▼ ▼ ▼ +┌───────────────┐ ┌───────────────┐ ┌───────────────────────────┐ +│ PostgreSQL │ │ Redis │ │ MCP Servers │ +│ + pgvector │ │ (Cache/Queue) │ │ (LLM, Git, KB, Issues...) │ +└───────────────┘ └───────────────┘ └───────────────────────────┘ + │ + ▼ + ┌───────────────┐ + │ Celery │ + │ Workers │ + └───────────────┘ +``` + +## Consequences + +### Positive +- Proven, production-ready stack +- Strong typing throughout (Python + TypeScript) +- Excellent async performance +- Rich ecosystem for extensions +- Team familiarity reduces learning curve + +### Negative +- Python GIL limits CPU-bound concurrency (mitigated by Celery) +- Multiple languages (Python + TypeScript) to maintain +- PostgreSQL requires management (vs serverless options) + +### Neutral +- PragmaStack provides solid foundation but may include unused features +- Stack is opinionated, limiting some technology choices + +## Version Pinning Strategy + +| Component | Strategy | Rationale | +|-----------|----------|-----------| +| Python | 3.11+ (specific minor) | Stability | +| Node.js | 20 LTS | Long-term support | +| FastAPI | 0.115+ | Latest stable | +| Next.js | 16 | Current major | +| PostgreSQL | 15+ | Required for features | + +## Compliance + +This decision aligns with: +- NFR-601: Code quality standards (TypeScript, type hints) +- NFR-603: Docker containerization requirement +- TC-001 through TC-006: Technical constraints + +--- + +*This ADR establishes the foundational technology choices for Syndarix.* diff --git a/docs/adrs/ADR-006-agent-orchestration.md b/docs/adrs/ADR-006-agent-orchestration.md new file mode 100644 index 0000000..6a3683d --- /dev/null +++ b/docs/adrs/ADR-006-agent-orchestration.md @@ -0,0 +1,260 @@ +# ADR-006: Agent Orchestration Architecture + +**Status:** Accepted +**Date:** 2025-12-29 +**Deciders:** Architecture Team +**Related Spikes:** SPIKE-002 + +--- + +## Context + +Syndarix requires an agent orchestration system that can: +- Define reusable agent types with specific capabilities +- Spawn multiple instances of the same type with unique identities +- Manage agent state, context, and conversation history +- Route messages between agents +- Handle agent failover and recovery +- Track resource usage per agent + +## Decision Drivers + +- **Flexibility:** Support diverse agent roles and capabilities +- **Scalability:** Handle 50+ concurrent agent instances +- **Isolation:** Each instance maintains separate state +- **Observability:** Full visibility into agent activities +- **Reliability:** Graceful handling of failures + +## Decision + +**Adopt a Type-Instance pattern** where: +- **Agent Types** define templates (model, expertise, personality) +- **Agent Instances** are spawned from types with unique identities +- **Agent Orchestrator** manages lifecycle and communication + +## Architecture + +### Agent Type Definition + +```python +class AgentType(Base): + id = Column(UUID, primary_key=True) + name = Column(String(50), unique=True) # "Software Engineer" + role = Column(Enum(AgentRole)) # ENGINEER + base_model = Column(String(100)) # "claude-3-5-sonnet-20241022" + failover_model = Column(String(100)) # "gpt-4-turbo" + expertise = Column(ARRAY(String)) # ["python", "fastapi", "testing"] + personality = Column(JSONB) # {"style": "detailed", "tone": "professional"} + system_prompt = Column(Text) # Base system prompt template + capabilities = Column(ARRAY(String)) # ["code_generation", "code_review"] + is_active = Column(Boolean, default=True) +``` + +### Agent Instance Definition + +```python +class AgentInstance(Base): + id = Column(UUID, primary_key=True) + name = Column(String(50)) # "Dave" + agent_type_id = Column(UUID, ForeignKey) + project_id = Column(UUID, ForeignKey) + status = Column(Enum(InstanceStatus)) # ACTIVE, IDLE, TERMINATED + context = Column(JSONB) # Current working context + conversation_id = Column(UUID) # Active conversation + rag_collection_id = Column(String) # Domain knowledge collection + token_usage = Column(JSONB) # {"prompt": 0, "completion": 0} + last_active_at = Column(DateTime) + created_at = Column(DateTime) + terminated_at = Column(DateTime) +``` + +### Orchestrator Service + +```python +class AgentOrchestrator: + """Central service for agent lifecycle management.""" + + async def spawn_agent( + self, + agent_type_id: UUID, + project_id: UUID, + name: str, + domain_knowledge: list[str] = None + ) -> AgentInstance: + """Spawn a new agent instance from a type definition.""" + agent_type = await self.get_agent_type(agent_type_id) + + instance = AgentInstance( + name=name, + agent_type_id=agent_type_id, + project_id=project_id, + status=InstanceStatus.ACTIVE, + context={"initialized_at": datetime.utcnow().isoformat()}, + ) + + # Initialize RAG collection if domain knowledge provided + if domain_knowledge: + instance.rag_collection_id = await self._init_rag_collection( + instance.id, domain_knowledge + ) + + await self.db.add(instance) + await self.db.commit() + + # Publish spawn event + await self.event_bus.publish(f"project:{project_id}", { + "type": "agent_spawned", + "agent_id": str(instance.id), + "name": name, + "role": agent_type.role.value + }) + + return instance + + async def terminate_agent(self, instance_id: UUID) -> None: + """Terminate an agent instance and release resources.""" + instance = await self.get_instance(instance_id) + instance.status = InstanceStatus.TERMINATED + instance.terminated_at = datetime.utcnow() + + # Cleanup RAG collection + if instance.rag_collection_id: + await self._cleanup_rag_collection(instance.rag_collection_id) + + await self.db.commit() + + async def send_message( + self, + from_id: UUID, + to_id: UUID, + message: AgentMessage + ) -> None: + """Route a message from one agent to another.""" + # Validate both agents exist and are active + sender = await self.get_instance(from_id) + recipient = await self.get_instance(to_id) + + # Persist message + await self.message_store.save(message) + + # If recipient is idle, trigger action + if recipient.status == InstanceStatus.IDLE: + await self._trigger_agent_action(recipient.id, message) + + # Publish for real-time tracking + await self.event_bus.publish(f"project:{sender.project_id}", { + "type": "agent_message", + "from": str(from_id), + "to": str(to_id), + "preview": message.content[:100] + }) + + async def broadcast( + self, + from_id: UUID, + target_role: AgentRole, + message: AgentMessage + ) -> None: + """Broadcast a message to all agents of a specific role.""" + sender = await self.get_instance(from_id) + recipients = await self.get_instances_by_role( + sender.project_id, target_role + ) + + for recipient in recipients: + await self.send_message(from_id, recipient.id, message) +``` + +### Agent Execution Pattern + +```python +class AgentRunner: + """Executes agent actions using LLM.""" + + def __init__(self, instance: AgentInstance, llm_gateway: LLMGateway): + self.instance = instance + self.llm = llm_gateway + + async def execute(self, action: str, context: dict) -> dict: + """Execute an action using the agent's configured model.""" + agent_type = await self.get_agent_type(self.instance.agent_type_id) + + # Build messages with system prompt and context + messages = [ + {"role": "system", "content": self._build_system_prompt(agent_type)}, + *self._get_conversation_history(), + {"role": "user", "content": self._build_action_prompt(action, context)} + ] + + # Add RAG context if available + if self.instance.rag_collection_id: + rag_context = await self._query_rag(action, context) + messages.insert(1, { + "role": "system", + "content": f"Relevant context:\n{rag_context}" + }) + + # Execute with failover + response = await self.llm.complete( + agent_id=str(self.instance.id), + project_id=str(self.instance.project_id), + messages=messages, + model_preference=self._get_model_preference(agent_type) + ) + + # Update instance context + self.instance.context = { + **self.instance.context, + "last_action": action, + "last_response_at": datetime.utcnow().isoformat() + } + + return response +``` + +### Agent Roles + +| Role | Instances | Primary Capabilities | +|------|-----------|---------------------| +| Product Owner | 1 | requirements, prioritization, client_communication | +| Project Manager | 1 | planning, tracking, coordination | +| Business Analyst | 1 | analysis, documentation, process_modeling | +| Software Architect | 1 | design, architecture_decisions, tech_selection | +| Software Engineer | 1-5 | code_generation, code_review, testing | +| UI/UX Designer | 1 | design, wireframes, accessibility | +| QA Engineer | 1-2 | test_planning, test_automation, bug_reporting | +| DevOps Engineer | 1 | cicd, infrastructure, deployment | +| AI/ML Engineer | 1 | ml_development, model_training, mlops | +| Security Expert | 1 | security_review, vulnerability_assessment | + +## Consequences + +### Positive +- Clear separation between type definition and instance runtime +- Multiple instances share type configuration (DRY) +- Easy to add new agent roles +- Full observability through events +- Graceful failure handling with model failover + +### Negative +- Complexity in managing instance lifecycle +- State synchronization across instances +- Memory overhead for context storage + +### Mitigation +- Context archival for long-running instances +- Periodic cleanup of terminated instances +- State compression for large contexts + +## Compliance + +This decision aligns with: +- FR-101: Agent type configuration +- FR-102: Agent instance spawning +- FR-103: Agent domain knowledge (RAG) +- FR-104: Inter-agent communication +- FR-105: Agent activity monitoring + +--- + +*This ADR establishes the agent orchestration architecture for Syndarix.* diff --git a/docs/architecture/ARCHITECTURE_OVERVIEW.md b/docs/architecture/ARCHITECTURE_OVERVIEW.md new file mode 100644 index 0000000..750a35e --- /dev/null +++ b/docs/architecture/ARCHITECTURE_OVERVIEW.md @@ -0,0 +1,487 @@ +# Syndarix Architecture Overview + +**Version:** 1.0 +**Date:** 2025-12-29 +**Status:** Draft + +--- + +## Table of Contents + +1. [Executive Summary](#1-executive-summary) +2. [System Context](#2-system-context) +3. [High-Level Architecture](#3-high-level-architecture) +4. [Core Components](#4-core-components) +5. [Data Architecture](#5-data-architecture) +6. [Integration Architecture](#6-integration-architecture) +7. [Security Architecture](#7-security-architecture) +8. [Deployment Architecture](#8-deployment-architecture) +9. [Cross-Cutting Concerns](#9-cross-cutting-concerns) +10. [Architecture Decisions](#10-architecture-decisions) + +--- + +## 1. Executive Summary + +Syndarix is an AI-powered software consulting agency platform that orchestrates specialized AI agents to deliver complete software solutions autonomously. This document describes the technical architecture that enables: + +- **Multi-Agent Orchestration:** 10 specialized agent roles collaborating on projects +- **MCP-First Integration:** All external tools via Model Context Protocol +- **Real-time Visibility:** SSE-based event streaming for progress tracking +- **Autonomous Workflows:** Configurable autonomy levels from full control to autonomous +- **Full Artifact Delivery:** Code, documentation, tests, and ADRs + +### Architecture Principles + +1. **MCP-First:** All integrations through unified MCP servers +2. **Event-Driven:** Async communication via Redis Pub/Sub +3. **Type-Safe:** Full typing in Python and TypeScript +4. **Stateless Services:** Horizontal scaling through stateless design +5. **Explicit Scoping:** All operations scoped to project/agent + +--- + +## 2. System Context + +### Context Diagram + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ EXTERNAL ACTORS │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Client │ │ Admin │ │ LLM APIs │ │ Git Hosts │ │ +│ │ (Human) │ │ (Human) │ │ (Anthropic) │ │ (Gitea) │ │ +│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ +│ │ │ │ │ │ +└─────────│──────────────────│──────────────────│──────────────────│──────────┘ + │ │ │ │ + │ Web UI │ Admin UI │ API │ API + │ SSE │ │ │ + ▼ ▼ ▼ ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ │ +│ SYNDARIX PLATFORM │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ Agent Orchestration │ │ +│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ +│ │ │ PO │ │ PM │ │ Arch │ │ Eng │ │ QA │ ... │ │ +│ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ │ │ │ + │ Storage │ Events │ Tasks │ + ▼ ▼ ▼ ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ INFRASTRUCTURE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ PostgreSQL │ │ Redis │ │ Celery │ │MCP Servers │ │ +│ │ + pgvector │ │ Pub/Sub │ │ Workers │ │ (7 types) │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Key Actors + +| Actor | Type | Interaction | +|-------|------|-------------| +| Client | Human | Web UI, approvals, feedback | +| Admin | Human | Configuration, monitoring | +| LLM Providers | External | Claude, GPT-4, local models | +| Git Hosts | External | Gitea, GitHub, GitLab | +| CI/CD Systems | External | Gitea Actions, etc. | + +--- + +## 3. High-Level Architecture + +### Layered Architecture + +``` +┌───────────────────────────────────────────────────────────────────┐ +│ PRESENTATION LAYER │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ Next.js 16 Frontend │ │ +│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ +│ │ │Dashboard │ │ Projects │ │ Agents │ │ Issues │ │ │ +│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────┘ + │ + │ REST + SSE + WebSocket + ▼ +┌───────────────────────────────────────────────────────────────────┐ +│ APPLICATION LAYER │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ FastAPI Backend │ │ +│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ +│ │ │ Auth │ │ API │ │ Services │ │ Events │ │ │ +│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────┐ +│ ORCHESTRATION LAYER │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ Agent │ │ Workflow │ │ Project │ │ │ +│ │ │ Orchestrator │ │ Engine │ │ Manager │ │ │ +│ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────┐ +│ INTEGRATION LAYER │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ MCP Client Manager │ │ +│ │ Connects to: LLM, Git, KB, Issues, FS, Code, CI/CD MCPs │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────┐ +│ DATA LAYER │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ PostgreSQL │ │ Redis │ │ File Store │ │ +│ │ + pgvector │ │ │ │ │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +└───────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 4. Core Components + +### 4.1 Agent Orchestrator + +**Purpose:** Manages agent lifecycle, spawning, communication, and coordination. + +**Responsibilities:** +- Spawn agent instances from type definitions +- Route messages between agents +- Manage agent context and memory +- Handle agent failover +- Track resource usage + +**Key Patterns:** +- Type-Instance pattern (types define templates, instances are runtime) +- Message routing with priority queues +- Context compression for long-running agents + +See: [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md) + +### 4.2 Workflow Engine + +**Purpose:** Orchestrates multi-step workflows and agent collaboration. + +**Responsibilities:** +- Execute workflow templates (requirements discovery, sprint, etc.) +- Track workflow state and progress +- Handle branching and conditions +- Manage approval gates + +**Workflow Types:** +- Requirements Discovery +- Architecture Spike +- Sprint Planning +- Implementation +- Sprint Demo + +### 4.3 Project Manager (Component) + +**Purpose:** Manages project lifecycle, configuration, and state. + +**Responsibilities:** +- Create and configure projects +- Manage complexity levels +- Track project status +- Generate reports + +### 4.4 LLM Gateway + +**Purpose:** Unified LLM access with failover and cost tracking. + +**Implementation:** LiteLLM-based router with: +- Multiple model groups (high-reasoning, fast-response) +- Automatic failover chain +- Per-agent token tracking +- Redis-backed caching + +See: [ADR-004: LLM Provider Abstraction](../adrs/ADR-004-llm-provider-abstraction.md) + +### 4.5 MCP Client Manager + +**Purpose:** Connects to all MCP servers and routes tool calls. + +**Implementation:** +- SSE connections to 7 MCP server types +- Automatic reconnection +- Request/response correlation +- Scoped tool calls with project_id/agent_id + +See: [ADR-001: MCP Integration Architecture](../adrs/ADR-001-mcp-integration-architecture.md) + +### 4.6 Event Bus + +**Purpose:** Real-time event distribution using Redis Pub/Sub. + +**Channels:** +- `project:{project_id}` - Project-scoped events +- `agent:{agent_id}` - Agent-specific events +- `system` - System-wide announcements + +See: [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md) + +--- + +## 5. Data Architecture + +### 5.1 Entity Model + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ User │───1:N─│ Project │───1:N─│ Sprint │ +└─────────────┘ └─────────────┘ └─────────────┘ + │ 1:N │ 1:N + │ │ + ┌──────┴──────┐ ┌──────┴──────┐ + │ │ │ │ + ┌──────┴──────┐ ┌────┴────┐ │ ┌─────┴─────┐ + │ AgentInstance│ │Repository│ │ │ Issue │ + └─────────────┘ └─────────┘ │ └───────────┘ + │ │ │ │ + │ 1:N │ 1:N │ │ 1:N + ┌──────┴──────┐ ┌──────┴────┐│ ┌──────┴──────┐ + │ Message │ │PullRequest│└───────│IssueComment │ + └─────────────┘ └───────────┘ └─────────────┘ +``` + +### 5.2 Key Entities + +| Entity | Purpose | Key Fields | +|--------|---------|------------| +| User | Human users | email, auth | +| Project | Work containers | name, complexity, autonomy_level | +| AgentType | Agent templates | base_model, expertise, system_prompt | +| AgentInstance | Running agents | name, project_id, context | +| Issue | Work items | type, status, external_tracker_fields | +| Sprint | Time-boxed iterations | goal, velocity | +| Repository | Git repos | provider, clone_url | +| KnowledgeDocument | RAG documents | content, embedding_id | + +### 5.3 Vector Storage + +**pgvector** extension for: +- Document embeddings (RAG) +- Semantic search across knowledge base +- Agent context similarity + +--- + +## 6. Integration Architecture + +### 6.1 MCP Server Registry + +| Server | Port | Purpose | Priority Providers | +|--------|------|---------|-------------------| +| LLM Gateway | 9001 | LLM routing | Anthropic, OpenAI, Ollama | +| Git MCP | 9002 | Git operations | Gitea, GitHub, GitLab | +| Knowledge Base | 9003 | RAG search | pgvector | +| Issues MCP | 9004 | Issue tracking | Gitea, GitHub, GitLab | +| File System | 9005 | Workspace files | Local FS | +| Code Analysis | 9006 | Static analysis | Ruff, ESLint | +| CI/CD MCP | 9007 | Pipelines | Gitea Actions | + +### 6.2 External Integration Diagram + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Syndarix Backend │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ MCP Client Manager │ │ +│ │ │ │ +│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ +│ │ │ LLM │ │ Git │ │ KB │ │ Issues │ │ CI/CD │ │ │ +│ │ │ Client │ │ Client │ │ Client │ │ Client │ │ Client │ │ │ +│ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ +│ └──────│──────────│──────────│──────────│──────────│──────┘ │ +└─────────│──────────│──────────│──────────│──────────│──────────┘ + │ │ │ │ │ + │ SSE │ SSE │ SSE │ SSE │ SSE + ▼ ▼ ▼ ▼ ▼ + ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ + │ LLM │ │ Git │ │ KB │ │ Issues │ │ CI/CD │ + │ MCP │ │ MCP │ │ MCP │ │ MCP │ │ MCP │ + │ Server │ │ Server │ │ Server │ │ Server │ │ Server │ + └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ + │ │ │ │ │ + ▼ ▼ ▼ ▼ ▼ + ┌─────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ + │Anthropic│ │ Gitea │ │pgvector│ │ Gitea │ │ Gitea │ + │ OpenAI │ │ GitHub │ │ │ │ Issues │ │Actions │ + │ Ollama │ │ GitLab │ │ │ │ │ │ │ + └─────────┘ └────────┘ └────────┘ └────────┘ └────────┘ +``` + +--- + +## 7. Security Architecture + +### 7.1 Authentication + +- **JWT Dual-Token:** Access token (15 min) + Refresh token (7 days) +- **OAuth 2.0 Provider:** For MCP client authentication +- **Service Tokens:** Internal service-to-service auth + +### 7.2 Authorization + +- **RBAC:** Role-based access control +- **Project Scoping:** All operations scoped to projects +- **Agent Permissions:** Agents operate within project scope + +### 7.3 Data Protection + +- **TLS 1.3:** All external communications +- **Encryption at Rest:** Database encryption +- **Secrets Management:** Environment-based, never in code + +--- + +## 8. Deployment Architecture + +### 8.1 Container Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Docker Compose │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Frontend │ │ Backend │ │ Workers │ │ Flower │ │ +│ │ (Next.js)│ │ (FastAPI)│ │ (Celery) │ │(Monitor) │ │ +│ │ :3000 │ │ :8000 │ │ │ │ :5555 │ │ +│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ LLM MCP │ │ Git MCP │ │ KB MCP │ │Issues MCP│ │ +│ │ :9001 │ │ :9002 │ │ :9003 │ │ :9004 │ │ +│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ FS MCP │ │ Code MCP │ │CI/CD MCP │ │ +│ │ :9005 │ │ :9006 │ │ :9007 │ │ +│ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ Infrastructure │ │ +│ │ ┌──────────┐ ┌──────────┐ │ │ +│ │ │PostgreSQL│ │ Redis │ │ │ +│ │ │ :5432 │ │ :6379 │ │ │ +│ │ └──────────┘ └──────────┘ │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### 8.2 Scaling Strategy + +| Component | Scaling | Strategy | +|-----------|---------|----------| +| Frontend | Horizontal | Stateless, behind LB | +| Backend | Horizontal | Stateless, behind LB | +| Celery Workers | Horizontal | Queue-based routing | +| MCP Servers | Horizontal | Stateless singletons | +| PostgreSQL | Vertical + Read Replicas | Primary/replica | +| Redis | Cluster | Sentinel or Cluster mode | + +--- + +## 9. Cross-Cutting Concerns + +### 9.1 Logging + +- **Format:** Structured JSON +- **Correlation:** Request IDs across services +- **Levels:** DEBUG, INFO, WARNING, ERROR, CRITICAL + +### 9.2 Monitoring + +- **Metrics:** Prometheus-compatible export +- **Traces:** OpenTelemetry (future) +- **Dashboards:** Grafana (optional) + +### 9.3 Error Handling + +- **Agent Errors:** Logged, published via SSE +- **Task Failures:** Celery retry with backoff +- **Integration Errors:** Circuit breaker pattern + +--- + +## 10. Architecture Decisions + +### Summary of ADRs + +| ADR | Title | Status | +|-----|-------|--------| +| [ADR-001](../adrs/ADR-001-mcp-integration-architecture.md) | MCP Integration Architecture | Accepted | +| [ADR-002](../adrs/ADR-002-realtime-communication.md) | Real-time Communication | Accepted | +| [ADR-003](../adrs/ADR-003-background-task-architecture.md) | Background Task Architecture | Accepted | +| [ADR-004](../adrs/ADR-004-llm-provider-abstraction.md) | LLM Provider Abstraction | Accepted | +| [ADR-005](../adrs/ADR-005-tech-stack-selection.md) | Tech Stack Selection | Accepted | +| [ADR-006](../adrs/ADR-006-agent-orchestration.md) | Agent Orchestration | Accepted | + +### Key Decisions Summary + +1. **Unified Singleton MCP Servers** with project/agent scoping +2. **SSE for real-time events**, WebSocket only for chat +3. **Celery + Redis** for background tasks +4. **LiteLLM** for unified LLM abstraction with failover +5. **PragmaStack** as foundation with Syndarix extensions +6. **Type-Instance pattern** for agent orchestration + +--- + +## Appendix A: Technology Stack Quick Reference + +| Layer | Technology | +|-------|------------| +| Frontend | Next.js 16, React 19, TypeScript, Tailwind, shadcn/ui | +| Backend | FastAPI, Python 3.11+, SQLAlchemy 2.0, Pydantic 2.0 | +| Database | PostgreSQL 15+ with pgvector | +| Cache/Queue | Redis 7.0+ | +| Task Queue | Celery 5.3+ | +| MCP | FastMCP 2.0 | +| LLM | LiteLLM (Claude, GPT-4, Ollama) | +| Testing | pytest, Jest, Playwright | +| Container | Docker, Docker Compose | + +--- + +## Appendix B: Port Reference + +| Service | Port | +|---------|------| +| Frontend | 3000 | +| Backend | 8000 | +| PostgreSQL | 5432 | +| Redis | 6379 | +| Flower | 5555 | +| LLM MCP | 9001 | +| Git MCP | 9002 | +| KB MCP | 9003 | +| Issues MCP | 9004 | +| FS MCP | 9005 | +| Code MCP | 9006 | +| CI/CD MCP | 9007 | + +--- + +*This document provides the comprehensive architecture overview for Syndarix. For detailed decisions, see the individual ADRs.*