docs: add ADR-007 for agentic framework selection

Establishes the hybrid architecture decision: - LangGraph for agent state machines (MIT, self-hostable) - Temporal for durable workflow execution (MIT, self-hostable) - Redis Streams for agent communication (BSD-3, self-hostable) - LiteLLM for unified LLM access (MIT, self-hostable) Key decision: Use production-tested open-source components rather than reinventing the wheel, while maintaining 100% self-hostability with no mandatory subscriptions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 13:42:33 +01:00
parent 5594655fba
commit bd702734c2
1 changed files with 320 additions and 0 deletions
--- a/docs/adrs/ADR-007-agentic-framework-selection.md
+++ b/docs/adrs/ADR-007-agentic-framework-selection.md
@@ -0,0 +1,320 @@
+# ADR-007: Agentic Framework Selection
+
+**Status:** Accepted
+**Date:** 2025-12-29
+**Deciders:** Architecture Team
+**Related Spikes:** SPIKE-002, SPIKE-005, SPIKE-007
+
+---
+
+## Context
+
+Syndarix requires a robust multi-agent orchestration system capable of:
+- Managing 50+ concurrent agent instances
+- Supporting long-running workflows (sprints spanning days/weeks)
+- Providing durable execution that survives crashes/restarts
+- Enabling human-in-the-loop at configurable autonomy levels
+- Tracking token usage and costs per agent instance
+- Supporting multi-provider LLM failover
+
+We evaluated whether to adopt an existing framework wholesale or build a custom solution.
+
+## Decision Drivers
+
+- **Production Readiness:** Must be battle-tested, not experimental
+- **Self-Hostability:** All components must be self-hostable with no mandatory subscriptions
+- **Flexibility:** Must support Syndarix-specific patterns (autonomy levels, client approvals)
+- **Durability:** Workflows must survive failures, restarts, and deployments
+- **Observability:** Full visibility into agent activities and costs
+- **Scalability:** Handle 50+ concurrent agents without architectural changes
+
+## Considered Options
+
+### Option 1: CrewAI (Full Framework)
+
+**Pros:**
+- Easy to get started (role-based agents)
+- Good for sequential/hierarchical workflows
+- Strong enterprise traction ($18M Series A, 60% Fortune 500)
+- LLM-agnostic design
+
+**Cons:**
+- Teams report hitting walls at 6-12 months of complexity
+- Multi-agent coordination can cause infinite loops
+- Limited ceiling for complex custom patterns
+- Flows architecture adds learning curve without solving durability
+
+**Verdict:** Rejected - insufficient flexibility for Syndarix's complex requirements
+
+### Option 2: AutoGen 0.4 (Full Framework)
+
+**Pros:**
+- Event-driven, async-first architecture
+- Cross-language support (.NET, Python)
+- Built-in observability (OpenTelemetry)
+- Microsoft ecosystem integration
+
+**Cons:**
+- Tied to Microsoft patterns
+- Less flexible for custom orchestration
+- Newer 0.4 version still maturing
+- No built-in durability for week-long workflows
+
+**Verdict:** Rejected - too opinionated, insufficient durability
+
+### Option 3: LangGraph + Custom Infrastructure (Hybrid)
+
+**Pros:**
+- Fine-grained control over agent flow
+- Excellent state management with PostgreSQL persistence
+- Human-in-the-loop built-in
+- Production-proven (Klarna, Replit, Elastic)
+- Fully open source (MIT license)
+- Can implement any pattern (supervisor, hierarchical, peer-to-peer)
+
+**Cons:**
+- Steep learning curve (graph theory, state machines)
+- Needs additional infrastructure for durability (Temporal)
+- Observability requires additional tooling
+
+**Verdict:** Selected as foundation
+
+### Option 4: Fully Custom Solution
+
+**Pros:**
+- Complete control
+- No external dependencies
+- Tailored to exact requirements
+
+**Cons:**
+- Reinvents production-tested solutions
+- Higher development and maintenance cost
+- Longer time to market
+- More bugs in critical path
+
+**Verdict:** Rejected - unnecessary when proven components exist
+
+## Decision
+
+**Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by:
+
+1. **LangGraph** - Agent state machines and logic
+2. **Temporal** - Durable workflow execution
+3. **Redis Streams** - Agent-to-agent communication
+4. **LiteLLM** - Unified LLM access with failover
+5. **PostgreSQL + pgvector** - State persistence and RAG
+
+### Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                    Syndarix Agentic Architecture                         │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  ┌───────────────────────────────────────────────────────────────────┐  │
+│  │                 Temporal Workflow Engine                           │  │
+│  │                                                                    │  │
+│  │  • Durable execution (survives crashes, restarts, deployments)    │  │
+│  │  • Human approval checkpoints (wait indefinitely for client)      │  │
+│  │  • Long-running workflows (projects spanning weeks/months)        │  │
+│  │  • Built-in retry policies and timeouts                           │  │
+│  │                                                                    │  │
+│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
+│  └───────────────────────────────────────────────────────────────────┘  │
+│                                    │                                     │
+│                                    ▼                                     │
+│  ┌───────────────────────────────────────────────────────────────────┐  │
+│  │                 LangGraph Agent Runtime                            │  │
+│  │                                                                    │  │
+│  │  • Graph-based state machines for agent logic                     │  │
+│  │  • Persistent checkpoints to PostgreSQL                           │  │
+│  │  • Cycles, conditionals, parallel execution                       │  │
+│  │  • Human-in-the-loop first-class support                          │  │
+│  │                                                                    │  │
+│  │  ┌─────────────────────────────────────────────────────────────┐  │  │
+│  │  │              Agent State Graph                               │  │  │
+│  │  │  [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING]        │  │  │
+│  │  │    ▲             │              │              │             │  │  │
+│  │  │    └─────────────┴──────────────┴──────────────┘             │  │  │
+│  │  └─────────────────────────────────────────────────────────────┘  │  │
+│  │                                                                    │  │
+│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
+│  └───────────────────────────────────────────────────────────────────┘  │
+│                                    │                                     │
+│                                    ▼                                     │
+│  ┌───────────────────────────────────────────────────────────────────┐  │
+│  │              Redis Streams Communication Layer                     │  │
+│  │                                                                    │  │
+│  │  • Agent-to-Agent messaging (A2A protocol concepts)               │  │
+│  │  • Event-driven architecture                                      │  │
+│  │  • Real-time activity streaming to UI                             │  │
+│  │  • Project-scoped message channels                                │  │
+│  │                                                                    │  │
+│  │  License: BSD-3 | Self-Hosted: Yes | Subscription: None Required  │  │
+│  └───────────────────────────────────────────────────────────────────┘  │
+│                                    │                                     │
+│                                    ▼                                     │
+│  ┌───────────────────────────────────────────────────────────────────┐  │
+│  │                    LiteLLM Gateway                                 │  │
+│  │                                                                    │  │
+│  │  • Unified API for 100+ LLM providers                             │  │
+│  │  • Automatic failover chains (Claude → GPT-4 → Ollama)            │  │
+│  │  • Token counting and cost calculation                            │  │
+│  │  • Rate limiting and load balancing                               │  │
+│  │                                                                    │  │
+│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
+│  └───────────────────────────────────────────────────────────────────┘  │
+│                                                                          │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+### Component Responsibilities
+
+| Component | Responsibility | Why This Choice |
+|-----------|---------------|-----------------|
+| **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing |
+| **Temporal** | Durable workflows, human approvals, long-running orchestration | Only solution for week-long workflows that survive failures |
+| **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups |
+| **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in |
+| **PostgreSQL** | State persistence, audit logs, agent data | Already in stack, pgvector for RAG |
+
+### Self-Hostability Guarantee
+
+All components are fully self-hostable with permissive open-source licenses:
+
+| Component | License | Paid Cloud Alternative | Required for Syndarix? |
+|-----------|---------|----------------------|----------------------|
+| LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom |
+| Temporal | MIT | Temporal Cloud | No - self-host server |
+| LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy |
+| Redis | BSD-3 | Redis Cloud | No - self-host |
+| PostgreSQL | PostgreSQL | Various managed DBs | No - self-host |
+
+**No mandatory subscriptions.** All paid alternatives are optional cloud-managed offerings.
+
+### What We Build vs. What We Use
+
+| Concern | Approach | Rationale |
+|---------|----------|-----------|
+| Agent Logic | **USE LangGraph** | Don't reinvent state machines |
+| LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction |
+| Durability | **USE Temporal** | Don't reinvent durable execution |
+| Messaging | **USE Redis Streams** | Don't reinvent pub/sub |
+| Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) |
+| Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix |
+| Cost Attribution | **BUILD thin layer** | Per-agent, per-project tracking specific to Syndarix |
+
+### Integration Pattern
+
+```python
+# Example: How the layers integrate
+
+# 1. Temporal orchestrates the high-level workflow
+@workflow.defn
+class SprintWorkflow:
+    @workflow.run
+    async def run(self, sprint: SprintConfig) -> SprintResult:
+        # Spawns agents and waits for completion
+        agents = await workflow.execute_activity(spawn_agent_team, sprint)
+
+        # Each agent runs a LangGraph state machine
+        results = await workflow.execute_activity(
+            run_agent_tasks,
+            agents,
+            start_to_close_timeout=timedelta(days=7),
+        )
+
+        # Human checkpoint (waits indefinitely)
+        if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS:
+            await workflow.wait_condition(lambda: self._approved)
+
+        return results
+
+# 2. LangGraph handles individual agent logic
+def create_agent_graph() -> StateGraph:
+    graph = StateGraph(AgentState)
+    graph.add_node("think", think_node)      # LLM reasoning
+    graph.add_node("execute", execute_node)  # Tool calls via MCP
+    graph.add_node("handoff", handoff_node)  # Message to other agent
+    # ... state transitions
+    return graph.compile(checkpointer=PostgresSaver(...))
+
+# 3. LiteLLM handles LLM calls with failover
+async def think_node(state: AgentState) -> AgentState:
+    response = await litellm.acompletion(
+        model="claude-sonnet-4-20250514",
+        messages=state["messages"],
+        fallbacks=["gpt-4-turbo", "ollama/llama3"],
+        metadata={"agent_id": state["agent_id"]},
+    )
+    return {"messages": [response.choices[0].message]}
+
+# 4. Redis Streams handles agent communication
+async def handoff_node(state: AgentState) -> AgentState:
+    await message_bus.publish(AgentMessage(
+        source_agent_id=state["agent_id"],
+        target_agent_id=state["handoff_target"],
+        message_type="TASK_HANDOFF",
+        payload=state["handoff_context"],
+    ))
+    return state
+```
+
+## Consequences
+
+### Positive
+
+- **Production-tested foundations** - LangGraph, Temporal, LiteLLM are battle-tested
+- **No subscription lock-in** - All components self-hostable under permissive licenses
+- **Right tool for each job** - Specialized components for durability, state, communication
+- **Escape hatches** - Can replace any component without full rewrite
+- **Enterprise patterns** - Temporal used by Netflix, Uber, Stripe for similar problems
+
+### Negative
+
+- **Multiple technologies to learn** - Team needs LangGraph, Temporal, Redis Streams knowledge
+- **Operational complexity** - More services to deploy and monitor
+- **Integration work** - Thin glue layers needed between components
+
+### Mitigation
+
+- **Learning curve** - Start with simple 2-3 agent workflows, expand gradually
+- **Operational complexity** - Use Docker Compose locally, consider managed services for production if needed
+- **Integration** - Create clear abstractions; each layer only knows its immediate neighbors
+
+## Compliance
+
+This decision aligns with:
+- **FR-101-105**: Agent orchestration requirements
+- **FR-301-305**: Workflow execution requirements
+- **NFR-501**: Self-hosting requirement (all components MIT/BSD licensed)
+- **TC-001**: PostgreSQL as primary database
+- **TC-002**: Redis for caching and messaging
+
+## Alternatives Not Chosen
+
+### LangSmith for Observability
+
+LangSmith is LangChain's paid observability platform. Instead, we will:
+- Use **LangFuse** (open source, self-hostable) for LLM observability
+- Use **Temporal UI** (built-in) for workflow visibility
+- Build custom dashboards for Syndarix-specific metrics
+
+### Temporal Cloud
+
+Temporal offers a managed cloud service. Instead, we will:
+- Self-host Temporal server (single-node for start, cluster for scale)
+- Use PostgreSQL as Temporal's persistence backend (already in stack)
+
+## References
+
+- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
+- [Temporal.io Documentation](https://docs.temporal.io/)
+- [LiteLLM Documentation](https://docs.litellm.ai/)
+- [LangFuse (Open Source LLM Observability)](https://langfuse.com/)
+- [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md)
+- [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md)
+
+---
+
+*This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.*