# ADR-007: Agentic Framework Selection **Status:** Accepted **Date:** 2025-12-29 **Deciders:** Architecture Team **Related Spikes:** SPIKE-002, SPIKE-005, SPIKE-007 --- ## Context Syndarix requires a robust multi-agent orchestration system capable of: - Managing 50+ concurrent agent instances - Supporting long-running workflows (sprints spanning days/weeks) - Providing durable execution that survives crashes/restarts - Enabling human-in-the-loop at configurable autonomy levels - Tracking token usage and costs per agent instance - Supporting multi-provider LLM failover We evaluated whether to adopt an existing framework wholesale or build a custom solution. ## Decision Drivers - **Production Readiness:** Must be battle-tested, not experimental - **Self-Hostability:** All components must be self-hostable with no mandatory subscriptions - **Flexibility:** Must support Syndarix-specific patterns (autonomy levels, client approvals) - **Durability:** Workflows must survive failures, restarts, and deployments - **Observability:** Full visibility into agent activities and costs - **Scalability:** Handle 50+ concurrent agents without architectural changes ## Considered Options ### Option 1: CrewAI (Full Framework) **Pros:** - Easy to get started (role-based agents) - Good for sequential/hierarchical workflows - Strong enterprise traction ($18M Series A, 60% Fortune 500) - LLM-agnostic design **Cons:** - Teams report hitting walls at 6-12 months of complexity - Multi-agent coordination can cause infinite loops - Limited ceiling for complex custom patterns - Flows architecture adds learning curve without solving durability **Verdict:** Rejected - insufficient flexibility for Syndarix's complex requirements ### Option 2: AutoGen 0.4 (Full Framework) **Pros:** - Event-driven, async-first architecture - Cross-language support (.NET, Python) - Built-in observability (OpenTelemetry) - Microsoft ecosystem integration **Cons:** - Tied to Microsoft patterns - Less flexible for custom orchestration - Newer 0.4 version still maturing - No built-in durability for week-long workflows **Verdict:** Rejected - too opinionated, insufficient durability ### Option 3: LangGraph + Custom Infrastructure (Hybrid) **Pros:** - Fine-grained control over agent flow - Excellent state management with PostgreSQL persistence - Human-in-the-loop built-in - Production-proven (Klarna, Replit, Elastic) - Fully open source (MIT license) - Can implement any pattern (supervisor, hierarchical, peer-to-peer) **Cons:** - Steep learning curve (graph theory, state machines) - Needs additional infrastructure for durability (Temporal) - Observability requires additional tooling **Verdict:** Selected as foundation ### Option 4: Fully Custom Solution **Pros:** - Complete control - No external dependencies - Tailored to exact requirements **Cons:** - Reinvents production-tested solutions - Higher development and maintenance cost - Longer time to market - More bugs in critical path **Verdict:** Rejected - unnecessary when proven components exist ## Decision **Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by: 1. **LangGraph** - Agent state machines and logic 2. **Temporal** - Durable workflow execution 3. **Redis Streams** - Agent-to-agent communication 4. **LiteLLM** - Unified LLM access with failover 5. **PostgreSQL + pgvector** - State persistence and RAG ### Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ Syndarix Agentic Architecture │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Temporal Workflow Engine │ │ │ │ │ │ │ │ • Durable execution (survives crashes, restarts, deployments) │ │ │ │ • Human approval checkpoints (wait indefinitely for client) │ │ │ │ • Long-running workflows (projects spanning weeks/months) │ │ │ │ • Built-in retry policies and timeouts │ │ │ │ │ │ │ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ LangGraph Agent Runtime │ │ │ │ │ │ │ │ • Graph-based state machines for agent logic │ │ │ │ • Persistent checkpoints to PostgreSQL │ │ │ │ • Cycles, conditionals, parallel execution │ │ │ │ • Human-in-the-loop first-class support │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ │ │ Agent State Graph │ │ │ │ │ │ [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING] │ │ │ │ │ │ ▲ │ │ │ │ │ │ │ │ │ └─────────────┴──────────────┴──────────────┘ │ │ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Redis Streams Communication Layer │ │ │ │ │ │ │ │ • Agent-to-Agent messaging (A2A protocol concepts) │ │ │ │ • Event-driven architecture │ │ │ │ • Real-time activity streaming to UI │ │ │ │ • Project-scoped message channels │ │ │ │ │ │ │ │ License: BSD-3 | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ LiteLLM Gateway │ │ │ │ │ │ │ │ • Unified API for 100+ LLM providers │ │ │ │ • Automatic failover chains (Claude → GPT-4 → Ollama) │ │ │ │ • Token counting and cost calculation │ │ │ │ • Rate limiting and load balancing │ │ │ │ │ │ │ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### Component Responsibilities | Component | Responsibility | Why This Choice | |-----------|---------------|-----------------| | **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing | | **Temporal** | Durable workflows, human approvals, long-running orchestration | Only solution for week-long workflows that survive failures | | **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups | | **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in | | **PostgreSQL** | State persistence, audit logs, agent data | Already in stack, pgvector for RAG | ### Self-Hostability Guarantee All components are fully self-hostable with permissive open-source licenses: | Component | License | Paid Cloud Alternative | Required for Syndarix? | |-----------|---------|----------------------|----------------------| | LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom | | Temporal | MIT | Temporal Cloud | No - self-host server | | LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy | | Redis | BSD-3 | Redis Cloud | No - self-host | | PostgreSQL | PostgreSQL | Various managed DBs | No - self-host | **No mandatory subscriptions.** All paid alternatives are optional cloud-managed offerings. ### What We Build vs. What We Use | Concern | Approach | Rationale | |---------|----------|-----------| | Agent Logic | **USE LangGraph** | Don't reinvent state machines | | LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction | | Durability | **USE Temporal** | Don't reinvent durable execution | | Messaging | **USE Redis Streams** | Don't reinvent pub/sub | | Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) | | Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix | | Cost Attribution | **BUILD thin layer** | Per-agent, per-project tracking specific to Syndarix | ### Integration Pattern ```python # Example: How the layers integrate # 1. Temporal orchestrates the high-level workflow @workflow.defn class SprintWorkflow: @workflow.run async def run(self, sprint: SprintConfig) -> SprintResult: # Spawns agents and waits for completion agents = await workflow.execute_activity(spawn_agent_team, sprint) # Each agent runs a LangGraph state machine results = await workflow.execute_activity( run_agent_tasks, agents, start_to_close_timeout=timedelta(days=7), ) # Human checkpoint (waits indefinitely) if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS: await workflow.wait_condition(lambda: self._approved) return results # 2. LangGraph handles individual agent logic def create_agent_graph() -> StateGraph: graph = StateGraph(AgentState) graph.add_node("think", think_node) # LLM reasoning graph.add_node("execute", execute_node) # Tool calls via MCP graph.add_node("handoff", handoff_node) # Message to other agent # ... state transitions return graph.compile(checkpointer=PostgresSaver(...)) # 3. LiteLLM handles LLM calls with failover async def think_node(state: AgentState) -> AgentState: response = await litellm.acompletion( model="claude-sonnet-4-20250514", messages=state["messages"], fallbacks=["gpt-4-turbo", "ollama/llama3"], metadata={"agent_id": state["agent_id"]}, ) return {"messages": [response.choices[0].message]} # 4. Redis Streams handles agent communication async def handoff_node(state: AgentState) -> AgentState: await message_bus.publish(AgentMessage( source_agent_id=state["agent_id"], target_agent_id=state["handoff_target"], message_type="TASK_HANDOFF", payload=state["handoff_context"], )) return state ``` ## Consequences ### Positive - **Production-tested foundations** - LangGraph, Temporal, LiteLLM are battle-tested - **No subscription lock-in** - All components self-hostable under permissive licenses - **Right tool for each job** - Specialized components for durability, state, communication - **Escape hatches** - Can replace any component without full rewrite - **Enterprise patterns** - Temporal used by Netflix, Uber, Stripe for similar problems ### Negative - **Multiple technologies to learn** - Team needs LangGraph, Temporal, Redis Streams knowledge - **Operational complexity** - More services to deploy and monitor - **Integration work** - Thin glue layers needed between components ### Mitigation - **Learning curve** - Start with simple 2-3 agent workflows, expand gradually - **Operational complexity** - Use Docker Compose locally, consider managed services for production if needed - **Integration** - Create clear abstractions; each layer only knows its immediate neighbors ## Compliance This decision aligns with: - **FR-101-105**: Agent orchestration requirements - **FR-301-305**: Workflow execution requirements - **NFR-501**: Self-hosting requirement (all components MIT/BSD licensed) - **TC-001**: PostgreSQL as primary database - **TC-002**: Redis for caching and messaging ## Alternatives Not Chosen ### LangSmith for Observability LangSmith is LangChain's paid observability platform. Instead, we will: - Use **LangFuse** (open source, self-hostable) for LLM observability - Use **Temporal UI** (built-in) for workflow visibility - Build custom dashboards for Syndarix-specific metrics ### Temporal Cloud Temporal offers a managed cloud service. Instead, we will: - Self-host Temporal server (single-node for start, cluster for scale) - Use PostgreSQL as Temporal's persistence backend (already in stack) ## References - [LangGraph Documentation](https://langchain-ai.github.io/langgraph/) - [Temporal.io Documentation](https://docs.temporal.io/) - [LiteLLM Documentation](https://docs.litellm.ai/) - [LangFuse (Open Source LLM Observability)](https://langfuse.com/) - [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md) - [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md) --- *This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.*