# ADR-007: Agentic Framework Selection **Status:** Accepted **Date:** 2025-12-29 **Deciders:** Architecture Team **Related Spikes:** SPIKE-002, SPIKE-005, SPIKE-007 --- ## Context Syndarix requires a robust multi-agent orchestration system capable of: - Managing 50+ concurrent agent instances - Supporting long-running workflows (sprints spanning days/weeks) - Providing durable execution that survives crashes/restarts - Enabling human-in-the-loop at configurable autonomy levels - Tracking token usage and costs per agent instance - Supporting multi-provider LLM failover We evaluated whether to adopt an existing framework wholesale or build a custom solution. ## Decision Drivers - **Production Readiness:** Must be battle-tested, not experimental - **Self-Hostability:** All components must be self-hostable with no mandatory subscriptions - **Flexibility:** Must support Syndarix-specific patterns (autonomy levels, client approvals) - **Durability:** Workflows must survive failures, restarts, and deployments - **Observability:** Full visibility into agent activities and costs - **Scalability:** Handle 50+ concurrent agents without architectural changes ## Considered Options ### Option 1: CrewAI (Full Framework) **Pros:** - Easy to get started (role-based agents) - Good for sequential/hierarchical workflows - Strong enterprise traction ($18M Series A, 60% Fortune 500) - LLM-agnostic design **Cons:** - Teams report hitting walls at 6-12 months of complexity - Multi-agent coordination can cause infinite loops - Limited ceiling for complex custom patterns - Flows architecture adds learning curve without solving durability **Verdict:** Rejected - insufficient flexibility for Syndarix's complex requirements ### Option 2: AutoGen 0.4 (Full Framework) **Pros:** - Event-driven, async-first architecture - Cross-language support (.NET, Python) - Built-in observability (OpenTelemetry) - Microsoft ecosystem integration **Cons:** - Tied to Microsoft patterns - Less flexible for custom orchestration - Newer 0.4 version still maturing - No built-in durability for week-long workflows **Verdict:** Rejected - too opinionated, insufficient durability ### Option 3: LangGraph + Custom Infrastructure (Hybrid) **Pros:** - Fine-grained control over agent flow - Excellent state management with PostgreSQL persistence - Human-in-the-loop built-in - Production-proven (Klarna, Replit, Elastic) - Fully open source (MIT license) - Can implement any pattern (supervisor, hierarchical, peer-to-peer) **Cons:** - Steep learning curve (graph theory, state machines) - Needs additional infrastructure for durability (Temporal) - Observability requires additional tooling **Verdict:** Selected as foundation ### Option 4: Fully Custom Solution **Pros:** - Complete control - No external dependencies - Tailored to exact requirements **Cons:** - Reinvents production-tested solutions - Higher development and maintenance cost - Longer time to market - More bugs in critical path **Verdict:** Rejected - unnecessary when proven components exist ## Decision **Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by: 1. **LangGraph** - Agent state machines and logic 2. **transitions + PostgreSQL + Celery** - Durable workflow state machines 3. **Redis Streams** - Agent-to-agent communication 4. **LiteLLM** - Unified LLM access with failover 5. **PostgreSQL + pgvector** - State persistence and RAG ### Why Not Temporal? After evaluating both approaches, we chose the simpler **transitions + PostgreSQL + Celery** stack over Temporal: | Factor | Temporal | transitions + PostgreSQL | |--------|----------|-------------------------| | Complexity | High (separate cluster, workers, SDK) | Low (Python library + existing infra) | | Learning Curve | Steep (new paradigm) | Gentle (familiar patterns) | | Infrastructure | Dedicated cluster required | Uses existing PostgreSQL + Celery | | Scale Target | Enterprise (1000s of workflows) | Syndarix (10s of agents) | | Debugging | Temporal UI (powerful but complex) | Standard DB queries + logs | **Temporal is overkill for our scale** (10-50 concurrent agents). The simpler approach provides: - Full durability via PostgreSQL state persistence - Event sourcing via transition history table - Background execution via Celery workers - Simpler debugging with standard tools ### Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ Syndarix Agentic Architecture │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Workflow Engine (transitions + PostgreSQL) │ │ │ │ │ │ │ │ • State persistence to PostgreSQL (survives restarts) │ │ │ │ • Event sourcing via workflow_transitions table │ │ │ │ • Human approval checkpoints (pause workflow, await signal) │ │ │ │ • Background execution via Celery workers │ │ │ │ │ │ │ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ LangGraph Agent Runtime │ │ │ │ │ │ │ │ • Graph-based state machines for agent logic │ │ │ │ • Persistent checkpoints to PostgreSQL │ │ │ │ • Cycles, conditionals, parallel execution │ │ │ │ • Human-in-the-loop first-class support │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ │ │ Agent State Graph │ │ │ │ │ │ [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING] │ │ │ │ │ │ ▲ │ │ │ │ │ │ │ │ │ └─────────────┴──────────────┴──────────────┘ │ │ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Redis Streams Communication Layer │ │ │ │ │ │ │ │ • Agent-to-Agent messaging (A2A protocol concepts) │ │ │ │ • Event-driven architecture │ │ │ │ • Real-time activity streaming to UI │ │ │ │ • Project-scoped message channels │ │ │ │ │ │ │ │ License: BSD-3 | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ LiteLLM Gateway │ │ │ │ │ │ │ │ • Unified API for 100+ LLM providers │ │ │ │ • Automatic failover chains (Claude → GPT-4 → Ollama) │ │ │ │ • Token counting and cost calculation │ │ │ │ • Rate limiting and load balancing │ │ │ │ │ │ │ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### Component Responsibilities | Component | Responsibility | Why This Choice | |-----------|---------------|-----------------| | **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing | | **transitions** | Workflow state machines (sprint, story, PR) | Lightweight, Pythonic, no external dependencies | | **Celery + Redis** | Background task execution, async workflows | Already in stack, battle-tested | | **PostgreSQL** | Workflow state persistence, event sourcing | ACID guarantees, survives restarts | | **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups | | **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in | ### Reboot Survival (Durability) The architecture **fully supports system reboots and crashes**: 1. **Workflow State**: Persisted to PostgreSQL `workflow_instances` table 2. **Transition History**: Event-sourced in `workflow_transitions` table 3. **Agent Checkpoints**: LangGraph persists to PostgreSQL 4. **Pending Tasks**: Celery tasks in Redis (configured with persistence) **Recovery Process:** ``` System Restart │ ▼ Load workflow_instances WHERE status = 'in_progress' │ ▼ For each workflow: ├── Restore state from context JSONB ├── Identify current_state ├── Resume from last checkpoint └── Continue execution ``` ### Self-Hostability Guarantee All components are fully self-hostable with permissive open-source licenses: | Component | License | Paid Cloud Alternative | Required for Syndarix? | |-----------|---------|----------------------|----------------------| | LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom | | transitions | MIT | N/A | N/A - simple library | | Celery | BSD-3 | Various | No - self-host | | LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy | | Redis | BSD-3 | Redis Cloud | No - self-host | | PostgreSQL | PostgreSQL | Various managed DBs | No - self-host | **No mandatory subscriptions.** All paid alternatives are optional cloud-managed offerings. ### What We Build vs. What We Use | Concern | Approach | Rationale | |---------|----------|-----------| | Agent Logic | **USE LangGraph** | Don't reinvent state machines | | LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction | | Workflow State | **USE transitions + PostgreSQL** | Simple, durable, debuggable | | Background Tasks | **USE Celery** | Already in stack, proven | | Messaging | **USE Redis Streams** | Don't reinvent pub/sub | | Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) | | Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix | | Cost Attribution | **BUILD thin layer** | Per-agent, per-project tracking specific to Syndarix | ### Integration Pattern ```python # Example: How the layers integrate # 1. Workflow state machine (transitions library) class SprintWorkflow(Machine): states = ['planning', 'active', 'review', 'done'] def __init__(self, sprint_id: str): self.sprint_id = sprint_id Machine.__init__( self, states=self.states, initial='planning', after_state_change='persist_state' ) self.add_transition('start', 'planning', 'active', before='spawn_agents') self.add_transition('complete_work', 'active', 'review') self.add_transition('approve', 'review', 'done', conditions='has_approval') async def persist_state(self): """Save state to PostgreSQL (survives restarts)""" await db.execute(""" UPDATE workflow_instances SET current_state = $1, context = $2, updated_at = NOW() WHERE id = $3 """, self.state, self.context, self.sprint_id) # 2. Background execution via Celery @celery_app.task(bind=True, max_retries=3) def run_sprint_workflow(self, sprint_id: str): workflow = SprintWorkflow.load(sprint_id) # Restore from DB workflow.start() # Triggers agent spawning # Workflow persists state, can resume after restart # 3. LangGraph handles individual agent logic def create_agent_graph() -> StateGraph: graph = StateGraph(AgentState) graph.add_node("think", think_node) # LLM reasoning graph.add_node("execute", execute_node) # Tool calls via MCP graph.add_node("handoff", handoff_node) # Message to other agent # ... state transitions return graph.compile(checkpointer=PostgresSaver(...)) # 4. LiteLLM handles LLM calls with failover async def think_node(state: AgentState) -> AgentState: response = await litellm.acompletion( model="claude-opus-4-5", # Claude Opus 4.5 (primary) messages=state["messages"], fallbacks=["gpt-5.1-codex-max", "gemini-3-pro", "qwen3-235b", "deepseek-v3.2"], metadata={"agent_id": state["agent_id"]}, ) return {"messages": [response.choices[0].message]} # 5. Redis Streams handles agent communication async def handoff_node(state: AgentState) -> AgentState: await message_bus.publish(AgentMessage( source_agent_id=state["agent_id"], target_agent_id=state["handoff_target"], message_type="TASK_HANDOFF", payload=state["handoff_context"], )) return state ``` ### Human Approval Checkpoints For workflows requiring human approval (FULL_CONTROL and MILESTONE modes): ```python class StoryWorkflow(Machine): async def request_approval_and_wait(self, action: str): """Pause workflow and await human decision.""" # 1. Create approval request request = await approval_service.create( workflow_id=self.id, action=action, context=self.context ) # 2. Transition to waiting state (persisted) self.state = 'awaiting_approval' await self.persist_state() # 3. Workflow is paused - Celery task completes # When user approves, a new task resumes the workflow @classmethod async def resume_on_approval(cls, workflow_id: str, approved: bool): """Called when user makes a decision.""" workflow = await cls.load(workflow_id) if approved: workflow.trigger('approved') else: workflow.trigger('rejected') ``` ## Consequences ### Positive - **Production-tested foundations** - LangGraph, Celery, LiteLLM are battle-tested - **No subscription lock-in** - All components self-hostable under permissive licenses - **Right tool for each job** - Specialized components for state, communication, background processing - **Escape hatches** - Can replace any component without full rewrite - **Simpler operations** - Uses existing PostgreSQL + Redis infrastructure, no new services - **Reboot survival** - Full durability via PostgreSQL persistence ### Negative - **Multiple technologies to learn** - Team needs LangGraph, transitions, Redis Streams knowledge - **Integration work** - Thin glue layers needed between components - **Manual recovery logic** - Must implement workflow recovery on startup ### Mitigation - **Learning curve** - Start with simple 2-3 agent workflows, expand gradually - **Integration** - Create clear abstractions; each layer only knows its immediate neighbors - **Recovery** - Implement startup recovery task that scans for in-progress workflows ## Compliance This decision aligns with: - **FR-101-105**: Agent management requirements (Type-Instance pattern) - **FR-301-305**: Workflow execution requirements - **NFR-402**: Fault tolerance (workflow durability, crash recovery) - **TC-001**: PostgreSQL as primary database - **Core Principle**: Self-hostability (all components MIT/BSD licensed) ## Alternatives Not Chosen ### LangSmith for Observability LangSmith is LangChain's paid observability platform. Instead, we will: - Use **LangFuse** (open source, self-hostable) for LLM observability - Use standard logging + PostgreSQL queries for workflow visibility - Build custom dashboards for Syndarix-specific metrics ### Temporal for Durable Workflows Temporal was initially considered but rejected for this project: - **Overkill for scale** - Syndarix targets 10-50 concurrent agents, not thousands - **Operational overhead** - Requires separate cluster, workers, SDK learning curve - **Simpler alternative available** - transitions + PostgreSQL provides equivalent durability - **Migration path** - If scale demands grow, Temporal can be introduced later ## References - [LangGraph Documentation](https://langchain-ai.github.io/langgraph/) - [transitions Library](https://github.com/pytransitions/transitions) - [LiteLLM Documentation](https://docs.litellm.ai/) - [LangFuse (Open Source LLM Observability)](https://langfuse.com/) - [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md) - [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md) - [SPIKE-008: Workflow State Machine](../spikes/SPIKE-008-workflow-state-machine.md) - [ADR-010: Workflow State Machine](./ADR-010-workflow-state-machine.md) --- *This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.*