Files
syndarix/docs/architecture/ARCHITECTURE.md
Felipe Cardoso 88cf4e0abc feat: Update to production model stack and fix remaining inconsistencies
## Model Stack Updates (User's Actual Models)

Updated all documentation to reflect production models:
- Claude Opus 4.5 (primary reasoning)
- GPT 5.1 Codex max (code generation specialist)
- Gemini 3 Pro/Flash (multimodal, fast inference)
- Qwen3-235B (cost-effective, self-hostable)
- DeepSeek V3.2 (self-hosted, open weights)

### Files Updated:
- ADR-004: Full model groups, failover chains, cost tables
- ADR-007: Code example with correct model identifiers
- ADR-012: Cost tracking with new model prices
- ARCHITECTURE.md: Model groups, failover diagram
- IMPLEMENTATION_ROADMAP.md: External services list

## Architecture Diagram Updates

- Added LangGraph Runtime to orchestration layer
- Added technology labels (Type-Instance, transitions)

## Self-Hostability Table Expanded

Added entries for:
- LangGraph (MIT)
- transitions (MIT)
- DeepSeek V3.2 (MIT)
- Qwen3-235B (Apache 2.0)

## Metric Alignments

- Response time: Split into API (<200ms) and Agent (<10s/<60s)
- Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing
- Added concurrent projects (10+) and agents (50+) metrics

## Infrastructure Updates

- Celery workers: 4-8 instances (was 2-4) across 4 queues
- MCP servers: Clarified Phase 2 + Phase 5 deployment
- Sync interval: Clarified 60s fallback + 15min reconciliation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 23:35:51 +01:00

436 lines
21 KiB
Markdown

# Syndarix Architecture
**Version:** 1.0
**Date:** 2025-12-29
**Status:** Approved
---
## Executive Summary
Syndarix is an autonomous AI-powered software consulting platform that orchestrates specialized AI agents to deliver complete software solutions. This document describes the chosen architecture, key decisions, and component interactions.
### Core Principles
1. **Self-Hostable First:** All components are fully self-hostable with permissive licenses (MIT/BSD)
2. **Production-Ready:** Use battle-tested technologies, not experimental frameworks
3. **Hybrid Architecture:** Combine best-in-class tools rather than monolithic frameworks
4. **Auditability:** Every agent action is logged and traceable
5. **Human-in-the-Loop:** Configurable autonomy with approval checkpoints
---
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ SYNDARIX PLATFORM │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ FRONTEND (Next.js 16) │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Dashboard │ │ Project │ │ Agent │ │ Approval │ │ │
│ │ │ Pages │ │ Views │ │ Monitor │ │ Queue │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ REST + SSE + WebSocket │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ BACKEND (FastAPI) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ ORCHESTRATION LAYER │ │ │
│ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │ │ │
│ │ │ │ Agent │ │ Workflow │ │ Approval │ │ LangGraph │ │ │ │
│ │ │ │ Orchestrator│ │ Engine │ │ Service │ │ Runtime │ │ │ │
│ │ │ │(Type-Inst.) │ │(transitions)│ │ │ │ │ │ │ │
│ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ INTEGRATION LAYER │ │ │
│ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │
│ │ │ │ LLM Gateway │ │ MCP Client │ │ Event │ │ │ │
│ │ │ │ (LiteLLM) │ │ Manager │ │ Bus │ │ │ │
│ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ Celery Workers│ │
│ │ + pgvector │ │ (Cache/Queue) │ │ (Background) │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ MCP SERVERS │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ LLM │ │Knowledge │ │ Git │ │ Issues │ │ File │ │ │
│ │ │ Gateway │ │ Base │ │ MCP │ │ MCP │ │ System │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
```
---
## Key Architecture Decisions
### ADR Summary Matrix
| ADR | Decision | Key Technology |
|-----|----------|----------------|
| ADR-001 | MCP Integration | FastMCP 2.0, Unified Singletons |
| ADR-002 | Real-time Communication | SSE primary, WebSocket for chat |
| ADR-003 | Background Tasks | Celery + Redis |
| ADR-004 | LLM Provider | LiteLLM with failover |
| ADR-005 | Tech Stack | PragmaStack + extensions |
| ADR-006 | Agent Orchestration | Type-Instance pattern |
| ADR-007 | Framework Selection | Hybrid (LangGraph + transitions + Celery) |
| ADR-008 | Knowledge Base | pgvector for RAG |
| ADR-009 | Agent Communication | Structured messages + Redis Streams |
| ADR-010 | Workflows | transitions + PostgreSQL + Celery |
| ADR-011 | Issue Sync | Webhook-first + polling fallback |
| ADR-012 | Cost Tracking | LiteLLM callbacks + Redis budgets |
| ADR-013 | Audit Logging | Structlog + hash chaining |
| ADR-014 | Client Approval | Checkpoint-based + notifications |
---
## Component Deep Dives
### 1. Agent Orchestration
**Pattern:** Type-Instance
- **Agent Types:** Templates defining model, expertise, personality, capabilities
- **Agent Instances:** Runtime instances spawned from types, assigned to projects
- **Orchestrator:** Manages lifecycle, routing, and resource tracking
```
Agent Type (Template) Agent Instance (Runtime)
┌─────────────────────┐ ┌─────────────────────┐
│ name: "Engineer" │───spawn───▶│ id: "eng-001" │
│ model: "sonnet" │ │ name: "Dave" │
│ expertise: [py, js] │ │ project: "proj-123" │
│ capabilities: [...] │ │ context: {...} │
└─────────────────────┘ │ status: ACTIVE │
└─────────────────────┘
```
### 2. LLM Gateway (LiteLLM)
**Failover Chain:**
```
Claude Opus 4.5 (Primary)
▼ (on failure/rate limit)
GPT 5.1 Codex max (Code specialist)
▼ (on failure/rate limit)
Gemini 3 Pro (Multimodal)
▼ (on failure)
Qwen3-235B / DeepSeek V3.2 (Self-hosted)
```
**Model Groups:**
| Group | Use Case | Primary Model | Fallback |
|-------|----------|---------------|----------|
| high-reasoning | Architecture, complex analysis | Claude Opus 4.5 | GPT 5.1 Codex max |
| code-generation | Code writing, refactoring | GPT 5.1 Codex max | Claude Opus 4.5 |
| fast-response | Quick tasks, status updates | Gemini 3 Flash | Qwen3-235B |
| cost-optimized | High-volume, non-critical | Qwen3-235B | DeepSeek V3.2 |
| self-hosted | Privacy-sensitive, air-gapped | DeepSeek V3.2 | Qwen3-235B |
### 3. Knowledge Base (RAG)
**Stack:** pgvector + LiteLLM embeddings
**Chunking Strategy:**
| Content | Strategy | Model |
|---------|----------|-------|
| Code | AST-based (function/class) | voyage-code-3 |
| Docs | Heading-based | text-embedding-3-small |
| Conversations | Turn-based | text-embedding-3-small |
**Search:** Hybrid (70% vector + 30% keyword)
### 4. Workflow Engine
**Stack:** transitions library + PostgreSQL + Celery
**Core Workflows:**
- **Sprint Workflow:** planning → active → review → done
- **Story Workflow:** analysis → design → implementation → review → testing → done
- **PR Workflow:** submitted → reviewing → changes_requested → approved → merged
**Durability:** Event sourcing with state persistence to PostgreSQL
### 5. Real-time Communication
**SSE (90% of use cases):**
- Agent activity streams
- Project progress updates
- Approval notifications
- Issue change notifications
**WebSocket (10% - bidirectional):**
- Interactive chat with agents
- Real-time debugging
**Event Bus:** Redis Pub/Sub for cross-instance distribution
### 6. Issue Synchronization
**Architecture:** Webhook-first + polling fallback
**Supported Providers:**
- Gitea (primary)
- GitHub
- GitLab
**Conflict Resolution:** Last-Writer-Wins with version vectors
### 7. Cost Tracking
**Real-time Pipeline:**
```
LLM Request → LiteLLM Callback → Redis INCR → Budget Check
Async Queue → PostgreSQL → SSE Dashboard Update
```
**Budget Enforcement:**
- Soft limits: Alerts + model downgrade
- Hard limits: Block requests
### 8. Audit Logging
**Immutability:** SHA-256 hash chaining
**Storage Tiers:**
| Tier | Storage | Retention |
|------|---------|-----------|
| Hot | PostgreSQL | 0-90 days |
| Cold | S3/MinIO | 90+ days |
### 9. Client Approval Flow
**Autonomy Levels:**
| Level | Description |
|-------|-------------|
| FULL_CONTROL | Approve every action |
| MILESTONE | Approve sprint boundaries |
| AUTONOMOUS | Only critical decisions |
**Notifications:** SSE + Email + Mobile Push
---
## Technology Stack
### Core Technologies
| Layer | Technology | Version | License |
|-------|------------|---------|---------|
| Backend | FastAPI | 0.115+ | MIT |
| Frontend | Next.js | 16 | MIT |
| Database | PostgreSQL + pgvector | 15+ | PostgreSQL |
| Cache/Queue | Redis | 7.0+ | BSD-3 |
| Task Queue | Celery | 5.3+ | BSD-3 |
| LLM Gateway | LiteLLM | Latest | MIT |
| MCP Framework | FastMCP | 2.0+ | MIT |
### Self-Hostability Guarantee
**All components are fully self-hostable with no mandatory subscriptions:**
| Component | License | Self-Hosted | Managed Alternative (Optional) |
|-----------|---------|-------------|--------------------------------|
| PostgreSQL | PostgreSQL | Yes | RDS, Neon, Supabase |
| Redis | BSD-3 | Yes | Redis Cloud |
| LiteLLM | MIT | Yes | LiteLLM Enterprise |
| Celery | BSD-3 | Yes | - |
| FastMCP | MIT | Yes | - |
| LangGraph | MIT | Yes | LangSmith (observability only) |
| transitions | MIT | Yes | - |
| DeepSeek V3.2 | MIT | Yes | API available |
| Qwen3-235B | Apache 2.0 | Yes | Alibaba Cloud |
---
## Data Flow Diagrams
### Agent Task Execution
```
1. Client creates story in Syndarix
2. Story workflow transitions to "implementation"
3. Agent Orchestrator spawns Engineer instance
4. Engineer queries Knowledge Base (RAG)
5. Engineer calls LLM Gateway for code generation
6. Engineer calls Git MCP to create branch & commit
7. Engineer creates PR via Git MCP
8. Workflow transitions to "review"
9. If autonomy_level != AUTONOMOUS:
└── Approval request created
└── Client notified via SSE + email
10. Client approves → PR merged → Workflow to "testing"
```
### Real-time Event Flow
```
Agent Action
Event Bus (Redis Pub/Sub)
├──▶ SSE Endpoint ──▶ Frontend Dashboard
├──▶ Audit Logger ──▶ PostgreSQL
└──▶ Other Backend Instances (horizontal scaling)
```
---
## Security Architecture
### Authentication Flow
- **Users:** JWT dual-token (access + refresh) via PragmaStack
- **Agents:** Service tokens for MCP communication
- **MCP Servers:** Internal network only, validated service tokens
### Multi-Tenancy
- **Project Isolation:** All queries scoped by project_id
- **Row-Level Security:** PostgreSQL RLS for knowledge base
- **Agent Scoping:** Every MCP tool requires project_id + agent_id
### Audit Trail
- **Hash Chaining:** Tamper-evident event log
- **Complete Coverage:** All agent actions, LLM calls, MCP tool invocations
---
## Scalability Considerations
### Horizontal Scaling
| Component | Scaling Strategy |
|-----------|-----------------|
| FastAPI | Multiple instances behind load balancer |
| Celery Workers | Add workers per queue as needed |
| PostgreSQL | Read replicas, connection pooling |
| Redis | Cluster mode for high availability |
### Expected Scale
| Metric | Target |
|--------|--------|
| Concurrent Projects | 50+ |
| Concurrent Agent Instances | 200+ |
| Background Jobs/minute | 500+ |
| SSE Connections | 200+ |
---
## Deployment Architecture
### Local Development
```
docker-compose up
├── PostgreSQL (+ pgvector)
├── Redis
├── FastAPI Backend
├── Next.js Frontend
├── Celery Workers (agent, git, sync queues)
├── Celery Beat (scheduler)
├── Flower (monitoring)
└── MCP Servers (7 containers)
```
### Production
```
┌─────────────────────────────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────────┬───────────────────────────────────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Instance 1 │ │ API Instance 2 │ │ API Instance N │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└────────────────────┼────────────────────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ PostgreSQL │ │ Redis Cluster │ │ Celery Workers │
│ (Primary + │ │ │ │ (Auto-scaled) │
│ Replicas) │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
---
## Related Documents
- [Implementation Roadmap](./IMPLEMENTATION_ROADMAP.md)
- [Architecture Deep Analysis](./ARCHITECTURE_DEEP_ANALYSIS.md)
- [ADRs](../adrs/) - All architecture decision records
- [Spikes](../spikes/) - Research documents
---
## Appendix: Full ADR List
1. [ADR-001: MCP Integration Architecture](../adrs/ADR-001-mcp-integration-architecture.md)
2. [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md)
3. [ADR-003: Background Task Architecture](../adrs/ADR-003-background-task-architecture.md)
4. [ADR-004: LLM Provider Abstraction](../adrs/ADR-004-llm-provider-abstraction.md)
5. [ADR-005: Technology Stack Selection](../adrs/ADR-005-tech-stack-selection.md)
6. [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md)
7. [ADR-007: Agentic Framework Selection](../adrs/ADR-007-agentic-framework-selection.md)
8. [ADR-008: Knowledge Base and RAG](../adrs/ADR-008-knowledge-base-rag.md)
9. [ADR-009: Agent Communication Protocol](../adrs/ADR-009-agent-communication-protocol.md)
10. [ADR-010: Workflow State Machine](../adrs/ADR-010-workflow-state-machine.md)
11. [ADR-011: Issue Synchronization](../adrs/ADR-011-issue-synchronization.md)
12. [ADR-012: Cost Tracking](../adrs/ADR-012-cost-tracking.md)
13. [ADR-013: Audit Logging](../adrs/ADR-013-audit-logging.md)
14. [ADR-014: Client Approval Flow](../adrs/ADR-014-client-approval-flow.md)
---
*This document serves as the authoritative architecture reference for Syndarix.*