## Model Stack Updates (User's Actual Models) Updated all documentation to reflect production models: - Claude Opus 4.5 (primary reasoning) - GPT 5.1 Codex max (code generation specialist) - Gemini 3 Pro/Flash (multimodal, fast inference) - Qwen3-235B (cost-effective, self-hostable) - DeepSeek V3.2 (self-hosted, open weights) ### Files Updated: - ADR-004: Full model groups, failover chains, cost tables - ADR-007: Code example with correct model identifiers - ADR-012: Cost tracking with new model prices - ARCHITECTURE.md: Model groups, failover diagram - IMPLEMENTATION_ROADMAP.md: External services list ## Architecture Diagram Updates - Added LangGraph Runtime to orchestration layer - Added technology labels (Type-Instance, transitions) ## Self-Hostability Table Expanded Added entries for: - LangGraph (MIT) - transitions (MIT) - DeepSeek V3.2 (MIT) - Qwen3-235B (Apache 2.0) ## Metric Alignments - Response time: Split into API (<200ms) and Agent (<10s/<60s) - Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing - Added concurrent projects (10+) and agents (50+) metrics ## Infrastructure Updates - Celery workers: 4-8 instances (was 2-4) across 4 queues - MCP servers: Clarified Phase 2 + Phase 5 deployment - Sync interval: Clarified 60s fallback + 15min reconciliation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
20 KiB
ADR-007: Agentic Framework Selection
Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-002, SPIKE-005, SPIKE-007
Context
Syndarix requires a robust multi-agent orchestration system capable of:
- Managing 50+ concurrent agent instances
- Supporting long-running workflows (sprints spanning days/weeks)
- Providing durable execution that survives crashes/restarts
- Enabling human-in-the-loop at configurable autonomy levels
- Tracking token usage and costs per agent instance
- Supporting multi-provider LLM failover
We evaluated whether to adopt an existing framework wholesale or build a custom solution.
Decision Drivers
- Production Readiness: Must be battle-tested, not experimental
- Self-Hostability: All components must be self-hostable with no mandatory subscriptions
- Flexibility: Must support Syndarix-specific patterns (autonomy levels, client approvals)
- Durability: Workflows must survive failures, restarts, and deployments
- Observability: Full visibility into agent activities and costs
- Scalability: Handle 50+ concurrent agents without architectural changes
Considered Options
Option 1: CrewAI (Full Framework)
Pros:
- Easy to get started (role-based agents)
- Good for sequential/hierarchical workflows
- Strong enterprise traction ($18M Series A, 60% Fortune 500)
- LLM-agnostic design
Cons:
- Teams report hitting walls at 6-12 months of complexity
- Multi-agent coordination can cause infinite loops
- Limited ceiling for complex custom patterns
- Flows architecture adds learning curve without solving durability
Verdict: Rejected - insufficient flexibility for Syndarix's complex requirements
Option 2: AutoGen 0.4 (Full Framework)
Pros:
- Event-driven, async-first architecture
- Cross-language support (.NET, Python)
- Built-in observability (OpenTelemetry)
- Microsoft ecosystem integration
Cons:
- Tied to Microsoft patterns
- Less flexible for custom orchestration
- Newer 0.4 version still maturing
- No built-in durability for week-long workflows
Verdict: Rejected - too opinionated, insufficient durability
Option 3: LangGraph + Custom Infrastructure (Hybrid)
Pros:
- Fine-grained control over agent flow
- Excellent state management with PostgreSQL persistence
- Human-in-the-loop built-in
- Production-proven (Klarna, Replit, Elastic)
- Fully open source (MIT license)
- Can implement any pattern (supervisor, hierarchical, peer-to-peer)
Cons:
- Steep learning curve (graph theory, state machines)
- Needs additional infrastructure for durability (Temporal)
- Observability requires additional tooling
Verdict: Selected as foundation
Option 4: Fully Custom Solution
Pros:
- Complete control
- No external dependencies
- Tailored to exact requirements
Cons:
- Reinvents production-tested solutions
- Higher development and maintenance cost
- Longer time to market
- More bugs in critical path
Verdict: Rejected - unnecessary when proven components exist
Decision
Adopt a hybrid architecture using LangGraph as the core agent framework, complemented by:
- LangGraph - Agent state machines and logic
- transitions + PostgreSQL + Celery - Durable workflow state machines
- Redis Streams - Agent-to-agent communication
- LiteLLM - Unified LLM access with failover
- PostgreSQL + pgvector - State persistence and RAG
Why Not Temporal?
After evaluating both approaches, we chose the simpler transitions + PostgreSQL + Celery stack over Temporal:
| Factor | Temporal | transitions + PostgreSQL |
|---|---|---|
| Complexity | High (separate cluster, workers, SDK) | Low (Python library + existing infra) |
| Learning Curve | Steep (new paradigm) | Gentle (familiar patterns) |
| Infrastructure | Dedicated cluster required | Uses existing PostgreSQL + Celery |
| Scale Target | Enterprise (1000s of workflows) | Syndarix (10s of agents) |
| Debugging | Temporal UI (powerful but complex) | Standard DB queries + logs |
Temporal is overkill for our scale (10-50 concurrent agents). The simpler approach provides:
- Full durability via PostgreSQL state persistence
- Event sourcing via transition history table
- Background execution via Celery workers
- Simpler debugging with standard tools
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ Syndarix Agentic Architecture │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Workflow Engine (transitions + PostgreSQL) │ │
│ │ │ │
│ │ • State persistence to PostgreSQL (survives restarts) │ │
│ │ • Event sourcing via workflow_transitions table │ │
│ │ • Human approval checkpoints (pause workflow, await signal) │ │
│ │ • Background execution via Celery workers │ │
│ │ │ │
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ LangGraph Agent Runtime │ │
│ │ │ │
│ │ • Graph-based state machines for agent logic │ │
│ │ • Persistent checkpoints to PostgreSQL │ │
│ │ • Cycles, conditionals, parallel execution │ │
│ │ • Human-in-the-loop first-class support │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Agent State Graph │ │ │
│ │ │ [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING] │ │ │
│ │ │ ▲ │ │ │ │ │ │
│ │ │ └─────────────┴──────────────┴──────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Redis Streams Communication Layer │ │
│ │ │ │
│ │ • Agent-to-Agent messaging (A2A protocol concepts) │ │
│ │ • Event-driven architecture │ │
│ │ • Real-time activity streaming to UI │ │
│ │ • Project-scoped message channels │ │
│ │ │ │
│ │ License: BSD-3 | Self-Hosted: Yes | Subscription: None Required │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ LiteLLM Gateway │ │
│ │ │ │
│ │ • Unified API for 100+ LLM providers │ │
│ │ • Automatic failover chains (Claude → GPT-4 → Ollama) │ │
│ │ • Token counting and cost calculation │ │
│ │ • Rate limiting and load balancing │ │
│ │ │ │
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Component Responsibilities
| Component | Responsibility | Why This Choice |
|---|---|---|
| LangGraph | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing |
| transitions | Workflow state machines (sprint, story, PR) | Lightweight, Pythonic, no external dependencies |
| Celery + Redis | Background task execution, async workflows | Already in stack, battle-tested |
| PostgreSQL | Workflow state persistence, event sourcing | ACID guarantees, survives restarts |
| Redis Streams | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups |
| LiteLLM | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in |
Reboot Survival (Durability)
The architecture fully supports system reboots and crashes:
- Workflow State: Persisted to PostgreSQL
workflow_instancestable - Transition History: Event-sourced in
workflow_transitionstable - Agent Checkpoints: LangGraph persists to PostgreSQL
- Pending Tasks: Celery tasks in Redis (configured with persistence)
Recovery Process:
System Restart
│
▼
Load workflow_instances WHERE status = 'in_progress'
│
▼
For each workflow:
├── Restore state from context JSONB
├── Identify current_state
├── Resume from last checkpoint
└── Continue execution
Self-Hostability Guarantee
All components are fully self-hostable with permissive open-source licenses:
| Component | License | Paid Cloud Alternative | Required for Syndarix? |
|---|---|---|---|
| LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom |
| transitions | MIT | N/A | N/A - simple library |
| Celery | BSD-3 | Various | No - self-host |
| LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy |
| Redis | BSD-3 | Redis Cloud | No - self-host |
| PostgreSQL | PostgreSQL | Various managed DBs | No - self-host |
No mandatory subscriptions. All paid alternatives are optional cloud-managed offerings.
What We Build vs. What We Use
| Concern | Approach | Rationale |
|---|---|---|
| Agent Logic | USE LangGraph | Don't reinvent state machines |
| LLM Access | USE LiteLLM | Don't reinvent provider abstraction |
| Workflow State | USE transitions + PostgreSQL | Simple, durable, debuggable |
| Background Tasks | USE Celery | Already in stack, proven |
| Messaging | USE Redis Streams | Don't reinvent pub/sub |
| Orchestration | BUILD thin layer | Syndarix-specific (autonomy levels, team structure) |
| Agent Spawning | BUILD thin layer | Type-Instance pattern specific to Syndarix |
| Cost Attribution | BUILD thin layer | Per-agent, per-project tracking specific to Syndarix |
Integration Pattern
# Example: How the layers integrate
# 1. Workflow state machine (transitions library)
class SprintWorkflow(Machine):
states = ['planning', 'active', 'review', 'done']
def __init__(self, sprint_id: str):
self.sprint_id = sprint_id
Machine.__init__(
self,
states=self.states,
initial='planning',
after_state_change='persist_state'
)
self.add_transition('start', 'planning', 'active', before='spawn_agents')
self.add_transition('complete_work', 'active', 'review')
self.add_transition('approve', 'review', 'done', conditions='has_approval')
async def persist_state(self):
"""Save state to PostgreSQL (survives restarts)"""
await db.execute("""
UPDATE workflow_instances
SET current_state = $1, context = $2, updated_at = NOW()
WHERE id = $3
""", self.state, self.context, self.sprint_id)
# 2. Background execution via Celery
@celery_app.task(bind=True, max_retries=3)
def run_sprint_workflow(self, sprint_id: str):
workflow = SprintWorkflow.load(sprint_id) # Restore from DB
workflow.start() # Triggers agent spawning
# Workflow persists state, can resume after restart
# 3. LangGraph handles individual agent logic
def create_agent_graph() -> StateGraph:
graph = StateGraph(AgentState)
graph.add_node("think", think_node) # LLM reasoning
graph.add_node("execute", execute_node) # Tool calls via MCP
graph.add_node("handoff", handoff_node) # Message to other agent
# ... state transitions
return graph.compile(checkpointer=PostgresSaver(...))
# 4. LiteLLM handles LLM calls with failover
async def think_node(state: AgentState) -> AgentState:
response = await litellm.acompletion(
model="claude-opus-4-5", # Claude Opus 4.5 (primary)
messages=state["messages"],
fallbacks=["gpt-5.1-codex-max", "gemini-3-pro", "qwen3-235b", "deepseek-v3.2"],
metadata={"agent_id": state["agent_id"]},
)
return {"messages": [response.choices[0].message]}
# 5. Redis Streams handles agent communication
async def handoff_node(state: AgentState) -> AgentState:
await message_bus.publish(AgentMessage(
source_agent_id=state["agent_id"],
target_agent_id=state["handoff_target"],
message_type="TASK_HANDOFF",
payload=state["handoff_context"],
))
return state
Human Approval Checkpoints
For workflows requiring human approval (FULL_CONTROL and MILESTONE modes):
class StoryWorkflow(Machine):
async def request_approval_and_wait(self, action: str):
"""Pause workflow and await human decision."""
# 1. Create approval request
request = await approval_service.create(
workflow_id=self.id,
action=action,
context=self.context
)
# 2. Transition to waiting state (persisted)
self.state = 'awaiting_approval'
await self.persist_state()
# 3. Workflow is paused - Celery task completes
# When user approves, a new task resumes the workflow
@classmethod
async def resume_on_approval(cls, workflow_id: str, approved: bool):
"""Called when user makes a decision."""
workflow = await cls.load(workflow_id)
if approved:
workflow.trigger('approved')
else:
workflow.trigger('rejected')
Consequences
Positive
- Production-tested foundations - LangGraph, Celery, LiteLLM are battle-tested
- No subscription lock-in - All components self-hostable under permissive licenses
- Right tool for each job - Specialized components for state, communication, background processing
- Escape hatches - Can replace any component without full rewrite
- Simpler operations - Uses existing PostgreSQL + Redis infrastructure, no new services
- Reboot survival - Full durability via PostgreSQL persistence
Negative
- Multiple technologies to learn - Team needs LangGraph, transitions, Redis Streams knowledge
- Integration work - Thin glue layers needed between components
- Manual recovery logic - Must implement workflow recovery on startup
Mitigation
- Learning curve - Start with simple 2-3 agent workflows, expand gradually
- Integration - Create clear abstractions; each layer only knows its immediate neighbors
- Recovery - Implement startup recovery task that scans for in-progress workflows
Compliance
This decision aligns with:
- FR-101-105: Agent management requirements (Type-Instance pattern)
- FR-301-305: Workflow execution requirements
- NFR-402: Fault tolerance (workflow durability, crash recovery)
- TC-001: PostgreSQL as primary database
- Core Principle: Self-hostability (all components MIT/BSD licensed)
Alternatives Not Chosen
LangSmith for Observability
LangSmith is LangChain's paid observability platform. Instead, we will:
- Use LangFuse (open source, self-hostable) for LLM observability
- Use standard logging + PostgreSQL queries for workflow visibility
- Build custom dashboards for Syndarix-specific metrics
Temporal for Durable Workflows
Temporal was initially considered but rejected for this project:
- Overkill for scale - Syndarix targets 10-50 concurrent agents, not thousands
- Operational overhead - Requires separate cluster, workers, SDK learning curve
- Simpler alternative available - transitions + PostgreSQL provides equivalent durability
- Migration path - If scale demands grow, Temporal can be introduced later
References
- LangGraph Documentation
- transitions Library
- LiteLLM Documentation
- LangFuse (Open Source LLM Observability)
- SPIKE-002: Agent Orchestration Pattern
- SPIKE-005: LLM Provider Abstraction
- SPIKE-008: Workflow State Machine
- ADR-010: Workflow State Machine
This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.