From bd702734c2e5db40db18c7cb94f550012da12122 Mon Sep 17 00:00:00 2001 From: Felipe Cardoso Date: Mon, 29 Dec 2025 13:42:33 +0100 Subject: [PATCH] docs: add ADR-007 for agentic framework selection MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Establishes the hybrid architecture decision: - LangGraph for agent state machines (MIT, self-hostable) - Temporal for durable workflow execution (MIT, self-hostable) - Redis Streams for agent communication (BSD-3, self-hostable) - LiteLLM for unified LLM access (MIT, self-hostable) Key decision: Use production-tested open-source components rather than reinventing the wheel, while maintaining 100% self-hostability with no mandatory subscriptions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- .../ADR-007-agentic-framework-selection.md | 320 ++++++++++++++++++ 1 file changed, 320 insertions(+) create mode 100644 docs/adrs/ADR-007-agentic-framework-selection.md diff --git a/docs/adrs/ADR-007-agentic-framework-selection.md b/docs/adrs/ADR-007-agentic-framework-selection.md new file mode 100644 index 0000000..e88a985 --- /dev/null +++ b/docs/adrs/ADR-007-agentic-framework-selection.md @@ -0,0 +1,320 @@ +# ADR-007: Agentic Framework Selection + +**Status:** Accepted +**Date:** 2025-12-29 +**Deciders:** Architecture Team +**Related Spikes:** SPIKE-002, SPIKE-005, SPIKE-007 + +--- + +## Context + +Syndarix requires a robust multi-agent orchestration system capable of: +- Managing 50+ concurrent agent instances +- Supporting long-running workflows (sprints spanning days/weeks) +- Providing durable execution that survives crashes/restarts +- Enabling human-in-the-loop at configurable autonomy levels +- Tracking token usage and costs per agent instance +- Supporting multi-provider LLM failover + +We evaluated whether to adopt an existing framework wholesale or build a custom solution. + +## Decision Drivers + +- **Production Readiness:** Must be battle-tested, not experimental +- **Self-Hostability:** All components must be self-hostable with no mandatory subscriptions +- **Flexibility:** Must support Syndarix-specific patterns (autonomy levels, client approvals) +- **Durability:** Workflows must survive failures, restarts, and deployments +- **Observability:** Full visibility into agent activities and costs +- **Scalability:** Handle 50+ concurrent agents without architectural changes + +## Considered Options + +### Option 1: CrewAI (Full Framework) + +**Pros:** +- Easy to get started (role-based agents) +- Good for sequential/hierarchical workflows +- Strong enterprise traction ($18M Series A, 60% Fortune 500) +- LLM-agnostic design + +**Cons:** +- Teams report hitting walls at 6-12 months of complexity +- Multi-agent coordination can cause infinite loops +- Limited ceiling for complex custom patterns +- Flows architecture adds learning curve without solving durability + +**Verdict:** Rejected - insufficient flexibility for Syndarix's complex requirements + +### Option 2: AutoGen 0.4 (Full Framework) + +**Pros:** +- Event-driven, async-first architecture +- Cross-language support (.NET, Python) +- Built-in observability (OpenTelemetry) +- Microsoft ecosystem integration + +**Cons:** +- Tied to Microsoft patterns +- Less flexible for custom orchestration +- Newer 0.4 version still maturing +- No built-in durability for week-long workflows + +**Verdict:** Rejected - too opinionated, insufficient durability + +### Option 3: LangGraph + Custom Infrastructure (Hybrid) + +**Pros:** +- Fine-grained control over agent flow +- Excellent state management with PostgreSQL persistence +- Human-in-the-loop built-in +- Production-proven (Klarna, Replit, Elastic) +- Fully open source (MIT license) +- Can implement any pattern (supervisor, hierarchical, peer-to-peer) + +**Cons:** +- Steep learning curve (graph theory, state machines) +- Needs additional infrastructure for durability (Temporal) +- Observability requires additional tooling + +**Verdict:** Selected as foundation + +### Option 4: Fully Custom Solution + +**Pros:** +- Complete control +- No external dependencies +- Tailored to exact requirements + +**Cons:** +- Reinvents production-tested solutions +- Higher development and maintenance cost +- Longer time to market +- More bugs in critical path + +**Verdict:** Rejected - unnecessary when proven components exist + +## Decision + +**Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by: + +1. **LangGraph** - Agent state machines and logic +2. **Temporal** - Durable workflow execution +3. **Redis Streams** - Agent-to-agent communication +4. **LiteLLM** - Unified LLM access with failover +5. **PostgreSQL + pgvector** - State persistence and RAG + +### Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ Syndarix Agentic Architecture │ +├─────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌───────────────────────────────────────────────────────────────────┐ │ +│ │ Temporal Workflow Engine │ │ +│ │ │ │ +│ │ • Durable execution (survives crashes, restarts, deployments) │ │ +│ │ • Human approval checkpoints (wait indefinitely for client) │ │ +│ │ • Long-running workflows (projects spanning weeks/months) │ │ +│ │ • Built-in retry policies and timeouts │ │ +│ │ │ │ +│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ +│ └───────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────────────────┐ │ +│ │ LangGraph Agent Runtime │ │ +│ │ │ │ +│ │ • Graph-based state machines for agent logic │ │ +│ │ • Persistent checkpoints to PostgreSQL │ │ +│ │ • Cycles, conditionals, parallel execution │ │ +│ │ • Human-in-the-loop first-class support │ │ +│ │ │ │ +│ │ ┌─────────────────────────────────────────────────────────────┐ │ │ +│ │ │ Agent State Graph │ │ │ +│ │ │ [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING] │ │ │ +│ │ │ ▲ │ │ │ │ │ │ +│ │ │ └─────────────┴──────────────┴──────────────┘ │ │ │ +│ │ └─────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ +│ └───────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────────────────┐ │ +│ │ Redis Streams Communication Layer │ │ +│ │ │ │ +│ │ • Agent-to-Agent messaging (A2A protocol concepts) │ │ +│ │ • Event-driven architecture │ │ +│ │ • Real-time activity streaming to UI │ │ +│ │ • Project-scoped message channels │ │ +│ │ │ │ +│ │ License: BSD-3 | Self-Hosted: Yes | Subscription: None Required │ │ +│ └───────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────────────────┐ │ +│ │ LiteLLM Gateway │ │ +│ │ │ │ +│ │ • Unified API for 100+ LLM providers │ │ +│ │ • Automatic failover chains (Claude → GPT-4 → Ollama) │ │ +│ │ • Token counting and cost calculation │ │ +│ │ • Rate limiting and load balancing │ │ +│ │ │ │ +│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ +│ └───────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +### Component Responsibilities + +| Component | Responsibility | Why This Choice | +|-----------|---------------|-----------------| +| **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing | +| **Temporal** | Durable workflows, human approvals, long-running orchestration | Only solution for week-long workflows that survive failures | +| **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups | +| **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in | +| **PostgreSQL** | State persistence, audit logs, agent data | Already in stack, pgvector for RAG | + +### Self-Hostability Guarantee + +All components are fully self-hostable with permissive open-source licenses: + +| Component | License | Paid Cloud Alternative | Required for Syndarix? | +|-----------|---------|----------------------|----------------------| +| LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom | +| Temporal | MIT | Temporal Cloud | No - self-host server | +| LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy | +| Redis | BSD-3 | Redis Cloud | No - self-host | +| PostgreSQL | PostgreSQL | Various managed DBs | No - self-host | + +**No mandatory subscriptions.** All paid alternatives are optional cloud-managed offerings. + +### What We Build vs. What We Use + +| Concern | Approach | Rationale | +|---------|----------|-----------| +| Agent Logic | **USE LangGraph** | Don't reinvent state machines | +| LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction | +| Durability | **USE Temporal** | Don't reinvent durable execution | +| Messaging | **USE Redis Streams** | Don't reinvent pub/sub | +| Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) | +| Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix | +| Cost Attribution | **BUILD thin layer** | Per-agent, per-project tracking specific to Syndarix | + +### Integration Pattern + +```python +# Example: How the layers integrate + +# 1. Temporal orchestrates the high-level workflow +@workflow.defn +class SprintWorkflow: + @workflow.run + async def run(self, sprint: SprintConfig) -> SprintResult: + # Spawns agents and waits for completion + agents = await workflow.execute_activity(spawn_agent_team, sprint) + + # Each agent runs a LangGraph state machine + results = await workflow.execute_activity( + run_agent_tasks, + agents, + start_to_close_timeout=timedelta(days=7), + ) + + # Human checkpoint (waits indefinitely) + if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS: + await workflow.wait_condition(lambda: self._approved) + + return results + +# 2. LangGraph handles individual agent logic +def create_agent_graph() -> StateGraph: + graph = StateGraph(AgentState) + graph.add_node("think", think_node) # LLM reasoning + graph.add_node("execute", execute_node) # Tool calls via MCP + graph.add_node("handoff", handoff_node) # Message to other agent + # ... state transitions + return graph.compile(checkpointer=PostgresSaver(...)) + +# 3. LiteLLM handles LLM calls with failover +async def think_node(state: AgentState) -> AgentState: + response = await litellm.acompletion( + model="claude-sonnet-4-20250514", + messages=state["messages"], + fallbacks=["gpt-4-turbo", "ollama/llama3"], + metadata={"agent_id": state["agent_id"]}, + ) + return {"messages": [response.choices[0].message]} + +# 4. Redis Streams handles agent communication +async def handoff_node(state: AgentState) -> AgentState: + await message_bus.publish(AgentMessage( + source_agent_id=state["agent_id"], + target_agent_id=state["handoff_target"], + message_type="TASK_HANDOFF", + payload=state["handoff_context"], + )) + return state +``` + +## Consequences + +### Positive + +- **Production-tested foundations** - LangGraph, Temporal, LiteLLM are battle-tested +- **No subscription lock-in** - All components self-hostable under permissive licenses +- **Right tool for each job** - Specialized components for durability, state, communication +- **Escape hatches** - Can replace any component without full rewrite +- **Enterprise patterns** - Temporal used by Netflix, Uber, Stripe for similar problems + +### Negative + +- **Multiple technologies to learn** - Team needs LangGraph, Temporal, Redis Streams knowledge +- **Operational complexity** - More services to deploy and monitor +- **Integration work** - Thin glue layers needed between components + +### Mitigation + +- **Learning curve** - Start with simple 2-3 agent workflows, expand gradually +- **Operational complexity** - Use Docker Compose locally, consider managed services for production if needed +- **Integration** - Create clear abstractions; each layer only knows its immediate neighbors + +## Compliance + +This decision aligns with: +- **FR-101-105**: Agent orchestration requirements +- **FR-301-305**: Workflow execution requirements +- **NFR-501**: Self-hosting requirement (all components MIT/BSD licensed) +- **TC-001**: PostgreSQL as primary database +- **TC-002**: Redis for caching and messaging + +## Alternatives Not Chosen + +### LangSmith for Observability + +LangSmith is LangChain's paid observability platform. Instead, we will: +- Use **LangFuse** (open source, self-hostable) for LLM observability +- Use **Temporal UI** (built-in) for workflow visibility +- Build custom dashboards for Syndarix-specific metrics + +### Temporal Cloud + +Temporal offers a managed cloud service. Instead, we will: +- Self-host Temporal server (single-node for start, cluster for scale) +- Use PostgreSQL as Temporal's persistence backend (already in stack) + +## References + +- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/) +- [Temporal.io Documentation](https://docs.temporal.io/) +- [LiteLLM Documentation](https://docs.litellm.ai/) +- [LangFuse (Open Source LLM Observability)](https://langfuse.com/) +- [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md) +- [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md) + +--- + +*This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.*