syndarix/docs/architecture/ARCHITECTURE.md

# Syndarix Architecture

**Version:** 1.0
**Date:** 2025-12-29
**Status:** Approved

---

## Executive Summary

Syndarix is an autonomous AI-powered software consulting platform that orchestrates specialized AI agents to deliver complete software solutions. This document describes the chosen architecture, key decisions, and component interactions.

### Core Principles

1. **Self-Hostable First:** All components are fully self-hostable with permissive licenses (MIT/BSD)
2. **Production-Ready:** Use battle-tested technologies, not experimental frameworks
3. **Hybrid Architecture:** Combine best-in-class tools rather than monolithic frameworks
4. **Auditability:** Every agent action is logged and traceable
5. **Human-in-the-Loop:** Configurable autonomy with approval checkpoints

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              SYNDARIX PLATFORM                                   │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                         FRONTEND (Next.js 16)                             │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │   │
│  │  │ Dashboard  │  │  Project   │  │  Agent     │  │  Approval  │         │   │
│  │  │   Pages    │  │   Views    │  │  Monitor   │  │   Queue    │         │   │
│  │  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │   │
│  └──────────────────────────────────────────────────────────────────────────┘   │
│                                       │                                          │
│                          REST + SSE + WebSocket                                  │
│                                       ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                         BACKEND (FastAPI)                                 │   │
│  │                                                                           │   │
│  │  ┌─────────────────────────────────────────────────────────────────────┐ │   │
│  │  │                    ORCHESTRATION LAYER                               │ │   │
│  │  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌───────────┐  │ │   │
│  │  │  │   Agent     │  │  Workflow   │  │  Approval   │  │ LangGraph │  │ │   │
│  │  │  │ Orchestrator│  │   Engine    │  │   Service   │  │  Runtime  │  │ │   │
│  │  │  │(Type-Inst.) │  │(transitions)│  │             │  │           │  │ │   │
│  │  │  └─────────────┘  └─────────────┘  └─────────────┘  └───────────┘  │ │   │
│  │  └─────────────────────────────────────────────────────────────────────┘ │   │
│  │                                                                           │   │
│  │  ┌─────────────────────────────────────────────────────────────────────┐ │   │
│  │  │                    INTEGRATION LAYER                                 │ │   │
│  │  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │ │   │
│  │  │  │ LLM Gateway │  │  MCP Client │  │   Event     │                  │ │   │
│  │  │  │  (LiteLLM)  │  │   Manager   │  │    Bus      │                  │ │   │
│  │  │  └─────────────┘  └─────────────┘  └─────────────┘                  │ │   │
│  │  └─────────────────────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────────────────┘   │
│                                       │                                          │
│           ┌───────────────────────────┼───────────────────────────┐             │
│           ▼                           ▼                           ▼             │
│  ┌────────────────┐          ┌────────────────┐          ┌────────────────┐    │
│  │   PostgreSQL   │          │     Redis      │          │  Celery Workers│    │
│  │   + pgvector   │          │  (Cache/Queue) │          │  (Background)  │    │
│  └────────────────┘          └────────────────┘          └────────────────┘    │
│                                       │                                          │
│                                       ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                         MCP SERVERS                                       │   │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │   │
│  │  │   LLM    │  │Knowledge │  │   Git    │  │  Issues  │  │   File   │   │   │
│  │  │ Gateway  │  │   Base   │  │   MCP    │  │   MCP    │  │  System  │   │   │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │   │
│  └──────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
```

---

## Key Architecture Decisions

### ADR Summary Matrix

| ADR | Decision | Key Technology |
|-----|----------|----------------|
| ADR-001 | MCP Integration | FastMCP 2.0, Unified Singletons |
| ADR-002 | Real-time Communication | SSE primary, WebSocket for chat |
| ADR-003 | Background Tasks | Celery + Redis |
| ADR-004 | LLM Provider | LiteLLM with failover |
| ADR-005 | Tech Stack | PragmaStack + extensions |
| ADR-006 | Agent Orchestration | Type-Instance pattern |
| ADR-007 | Framework Selection | Hybrid (LangGraph + transitions + Celery) |
| ADR-008 | Knowledge Base | pgvector for RAG |
| ADR-009 | Agent Communication | Structured messages + Redis Streams |
| ADR-010 | Workflows | transitions + PostgreSQL + Celery |
| ADR-011 | Issue Sync | Webhook-first + polling fallback |
| ADR-012 | Cost Tracking | LiteLLM callbacks + Redis budgets |
| ADR-013 | Audit Logging | Structlog + hash chaining |
| ADR-014 | Client Approval | Checkpoint-based + notifications |

---

## Component Deep Dives

### 1. Agent Orchestration

**Pattern:** Type-Instance

- **Agent Types:** Templates defining model, expertise, personality, capabilities
- **Agent Instances:** Runtime instances spawned from types, assigned to projects
- **Orchestrator:** Manages lifecycle, routing, and resource tracking

```
Agent Type (Template)              Agent Instance (Runtime)
┌─────────────────────┐            ┌─────────────────────┐
│ name: "Engineer"    │───spawn───▶│ id: "eng-001"       │
│ model: "sonnet"     │            │ name: "Dave"        │
│ expertise: [py, js] │            │ project: "proj-123" │
│ capabilities: [...]  │            │ context: {...}      │
└─────────────────────┘            │ status: ACTIVE      │
                                   └─────────────────────┘
```

### 2. LLM Gateway (LiteLLM)

**Failover Chain:**
```
Claude Opus 4.5 (Primary)
         │
         ▼ (on failure/rate limit)
    GPT 5.1 Codex max (Code specialist)
         │
         ▼ (on failure/rate limit)
    Gemini 3 Pro (Multimodal)
         │
         ▼ (on failure)
    Qwen3-235B / DeepSeek V3.2 (Self-hosted)
```

**Model Groups:**
| Group | Use Case | Primary Model | Fallback |
|-------|----------|---------------|----------|
| high-reasoning | Architecture, complex analysis | Claude Opus 4.5 | GPT 5.1 Codex max |
| code-generation | Code writing, refactoring | GPT 5.1 Codex max | Claude Opus 4.5 |
| fast-response | Quick tasks, status updates | Gemini 3 Flash | Qwen3-235B |
| cost-optimized | High-volume, non-critical | Qwen3-235B | DeepSeek V3.2 |
| self-hosted | Privacy-sensitive, air-gapped | DeepSeek V3.2 | Qwen3-235B |

### 3. Knowledge Base (RAG)

**Stack:** pgvector + LiteLLM embeddings

**Chunking Strategy:**
| Content | Strategy | Model |
|---------|----------|-------|
| Code | AST-based (function/class) | voyage-code-3 |
| Docs | Heading-based | text-embedding-3-small |
| Conversations | Turn-based | text-embedding-3-small |

**Search:** Hybrid (70% vector + 30% keyword)

### 4. Workflow Engine

**Stack:** transitions library + PostgreSQL + Celery

**Core Workflows:**
- **Sprint Workflow:** planning → active → review → done
- **Story Workflow:** analysis → design → implementation → review → testing → done
- **PR Workflow:** submitted → reviewing → changes_requested → approved → merged

**Durability:** Event sourcing with state persistence to PostgreSQL

### 5. Real-time Communication

**SSE (90% of use cases):**
- Agent activity streams
- Project progress updates
- Approval notifications
- Issue change notifications

**WebSocket (10% - bidirectional):**
- Interactive chat with agents
- Real-time debugging

**Event Bus:** Redis Pub/Sub for cross-instance distribution

### 6. Issue Synchronization

**Architecture:** Webhook-first + polling fallback

**Supported Providers:**
- Gitea (primary)
- GitHub
- GitLab

**Conflict Resolution:** Last-Writer-Wins with version vectors

### 7. Cost Tracking

**Real-time Pipeline:**
```
LLM Request → LiteLLM Callback → Redis INCR → Budget Check
                    │
              Async Queue → PostgreSQL → SSE Dashboard Update
```

**Budget Enforcement:**
- Soft limits: Alerts + model downgrade
- Hard limits: Block requests

### 8. Audit Logging

**Immutability:** SHA-256 hash chaining

**Storage Tiers:**
| Tier | Storage | Retention |
|------|---------|-----------|
| Hot | PostgreSQL | 0-90 days |
| Cold | S3/MinIO | 90+ days |

### 9. Client Approval Flow

**Autonomy Levels:**
| Level | Description |
|-------|-------------|
| FULL_CONTROL | Approve every action |
| MILESTONE | Approve sprint boundaries |
| AUTONOMOUS | Only critical decisions |

**Notifications:** SSE + Email + Mobile Push

---

## Technology Stack

### Core Technologies

| Layer | Technology | Version | License |
|-------|------------|---------|---------|
| Backend | FastAPI | 0.115+ | MIT |
| Frontend | Next.js | 16 | MIT |
| Database | PostgreSQL + pgvector | 15+ | PostgreSQL |
| Cache/Queue | Redis | 7.0+ | BSD-3 |
| Task Queue | Celery | 5.3+ | BSD-3 |
| LLM Gateway | LiteLLM | Latest | MIT |
| MCP Framework | FastMCP | 2.0+ | MIT |

### Self-Hostability Guarantee

**All components are fully self-hostable with no mandatory subscriptions:**

| Component | License | Self-Hosted | Managed Alternative (Optional) |
|-----------|---------|-------------|--------------------------------|
| PostgreSQL | PostgreSQL | Yes | RDS, Neon, Supabase |
| Redis | BSD-3 | Yes | Redis Cloud |
| LiteLLM | MIT | Yes | LiteLLM Enterprise |
| Celery | BSD-3 | Yes | - |
| FastMCP | MIT | Yes | - |
| LangGraph | MIT | Yes | LangSmith (observability only) |
| transitions | MIT | Yes | - |
| DeepSeek V3.2 | MIT | Yes | API available |
| Qwen3-235B | Apache 2.0 | Yes | Alibaba Cloud |

---

## Data Flow Diagrams

### Agent Task Execution

```
1. Client creates story in Syndarix
         │
         ▼
2. Story workflow transitions to "implementation"
         │
         ▼
3. Agent Orchestrator spawns Engineer instance
         │
         ▼
4. Engineer queries Knowledge Base (RAG)
         │
         ▼
5. Engineer calls LLM Gateway for code generation
         │
         ▼
6. Engineer calls Git MCP to create branch & commit
         │
         ▼
7. Engineer creates PR via Git MCP
         │
         ▼
8. Workflow transitions to "review"
         │
         ▼
9. If autonomy_level != AUTONOMOUS:
   └── Approval request created
   └── Client notified via SSE + email
         │
         ▼
10. Client approves → PR merged → Workflow to "testing"
```

### Real-time Event Flow

```
Agent Action
     │
     ▼
Event Bus (Redis Pub/Sub)
     │
     ├──▶ SSE Endpoint ──▶ Frontend Dashboard
     │
     ├──▶ Audit Logger ──▶ PostgreSQL
     │
     └──▶ Other Backend Instances (horizontal scaling)
```

---

## Security Architecture

### Authentication Flow

- **Users:** JWT dual-token (access + refresh) via PragmaStack
- **Agents:** Service tokens for MCP communication
- **MCP Servers:** Internal network only, validated service tokens

### Multi-Tenancy

- **Project Isolation:** All queries scoped by project_id
- **Row-Level Security:** PostgreSQL RLS for knowledge base
- **Agent Scoping:** Every MCP tool requires project_id + agent_id

### Audit Trail

- **Hash Chaining:** Tamper-evident event log
- **Complete Coverage:** All agent actions, LLM calls, MCP tool invocations

---

## Scalability Considerations

### Horizontal Scaling

| Component | Scaling Strategy |
|-----------|-----------------|
| FastAPI | Multiple instances behind load balancer |
| Celery Workers | Add workers per queue as needed |
| PostgreSQL | Read replicas, connection pooling |
| Redis | Cluster mode for high availability |

### Expected Scale

| Metric | Target |
|--------|--------|
| Concurrent Projects | 50+ |
| Concurrent Agent Instances | 200+ |
| Background Jobs/minute | 500+ |
| SSE Connections | 200+ |

---

## Deployment Architecture

### Local Development

```
docker-compose up
├── PostgreSQL (+ pgvector)
├── Redis
├── FastAPI Backend
├── Next.js Frontend
├── Celery Workers (agent, git, sync queues)
├── Celery Beat (scheduler)
├── Flower (monitoring)
└── MCP Servers (7 containers)
```

### Production

```
┌─────────────────────────────────────────────────────────────────┐
│                        Load Balancer                             │
└─────────────────────────────┬───────────────────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  API Instance 1 │  │  API Instance 2 │  │  API Instance N │
└─────────────────┘  └─────────────────┘  └─────────────────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│   PostgreSQL    │  │  Redis Cluster  │  │  Celery Workers │
│   (Primary +    │  │                 │  │  (Auto-scaled)  │
│    Replicas)    │  │                 │  │                 │
└─────────────────┘  └─────────────────┘  └─────────────────┘
```

---

## Related Documents

- [Implementation Roadmap](./IMPLEMENTATION_ROADMAP.md)
- [Architecture Deep Analysis](./ARCHITECTURE_DEEP_ANALYSIS.md)
- [ADRs](../adrs/) - All architecture decision records
- [Spikes](../spikes/) - Research documents

---

## Appendix: Full ADR List

1. [ADR-001: MCP Integration Architecture](../adrs/ADR-001-mcp-integration-architecture.md)
2. [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md)
3. [ADR-003: Background Task Architecture](../adrs/ADR-003-background-task-architecture.md)
4. [ADR-004: LLM Provider Abstraction](../adrs/ADR-004-llm-provider-abstraction.md)
5. [ADR-005: Technology Stack Selection](../adrs/ADR-005-tech-stack-selection.md)
6. [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md)
7. [ADR-007: Agentic Framework Selection](../adrs/ADR-007-agentic-framework-selection.md)
8. [ADR-008: Knowledge Base and RAG](../adrs/ADR-008-knowledge-base-rag.md)
9. [ADR-009: Agent Communication Protocol](../adrs/ADR-009-agent-communication-protocol.md)
10. [ADR-010: Workflow State Machine](../adrs/ADR-010-workflow-state-machine.md)
11. [ADR-011: Issue Synchronization](../adrs/ADR-011-issue-synchronization.md)
12. [ADR-012: Cost Tracking](../adrs/ADR-012-cost-tracking.md)
13. [ADR-013: Audit Logging](../adrs/ADR-013-audit-logging.md)
14. [ADR-014: Client Approval Flow](../adrs/ADR-014-client-approval-flow.md)

---

*This document serves as the authoritative architecture reference for Syndarix.*