forked from cardosofelipe/fast-next-template
docs: add ADR-007 for agentic framework selection
Establishes the hybrid architecture decision: - LangGraph for agent state machines (MIT, self-hostable) - Temporal for durable workflow execution (MIT, self-hostable) - Redis Streams for agent communication (BSD-3, self-hostable) - LiteLLM for unified LLM access (MIT, self-hostable) Key decision: Use production-tested open-source components rather than reinventing the wheel, while maintaining 100% self-hostability with no mandatory subscriptions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
320
docs/adrs/ADR-007-agentic-framework-selection.md
Normal file
320
docs/adrs/ADR-007-agentic-framework-selection.md
Normal file
@@ -0,0 +1,320 @@
|
|||||||
|
# ADR-007: Agentic Framework Selection
|
||||||
|
|
||||||
|
**Status:** Accepted
|
||||||
|
**Date:** 2025-12-29
|
||||||
|
**Deciders:** Architecture Team
|
||||||
|
**Related Spikes:** SPIKE-002, SPIKE-005, SPIKE-007
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Syndarix requires a robust multi-agent orchestration system capable of:
|
||||||
|
- Managing 50+ concurrent agent instances
|
||||||
|
- Supporting long-running workflows (sprints spanning days/weeks)
|
||||||
|
- Providing durable execution that survives crashes/restarts
|
||||||
|
- Enabling human-in-the-loop at configurable autonomy levels
|
||||||
|
- Tracking token usage and costs per agent instance
|
||||||
|
- Supporting multi-provider LLM failover
|
||||||
|
|
||||||
|
We evaluated whether to adopt an existing framework wholesale or build a custom solution.
|
||||||
|
|
||||||
|
## Decision Drivers
|
||||||
|
|
||||||
|
- **Production Readiness:** Must be battle-tested, not experimental
|
||||||
|
- **Self-Hostability:** All components must be self-hostable with no mandatory subscriptions
|
||||||
|
- **Flexibility:** Must support Syndarix-specific patterns (autonomy levels, client approvals)
|
||||||
|
- **Durability:** Workflows must survive failures, restarts, and deployments
|
||||||
|
- **Observability:** Full visibility into agent activities and costs
|
||||||
|
- **Scalability:** Handle 50+ concurrent agents without architectural changes
|
||||||
|
|
||||||
|
## Considered Options
|
||||||
|
|
||||||
|
### Option 1: CrewAI (Full Framework)
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Easy to get started (role-based agents)
|
||||||
|
- Good for sequential/hierarchical workflows
|
||||||
|
- Strong enterprise traction ($18M Series A, 60% Fortune 500)
|
||||||
|
- LLM-agnostic design
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Teams report hitting walls at 6-12 months of complexity
|
||||||
|
- Multi-agent coordination can cause infinite loops
|
||||||
|
- Limited ceiling for complex custom patterns
|
||||||
|
- Flows architecture adds learning curve without solving durability
|
||||||
|
|
||||||
|
**Verdict:** Rejected - insufficient flexibility for Syndarix's complex requirements
|
||||||
|
|
||||||
|
### Option 2: AutoGen 0.4 (Full Framework)
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Event-driven, async-first architecture
|
||||||
|
- Cross-language support (.NET, Python)
|
||||||
|
- Built-in observability (OpenTelemetry)
|
||||||
|
- Microsoft ecosystem integration
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Tied to Microsoft patterns
|
||||||
|
- Less flexible for custom orchestration
|
||||||
|
- Newer 0.4 version still maturing
|
||||||
|
- No built-in durability for week-long workflows
|
||||||
|
|
||||||
|
**Verdict:** Rejected - too opinionated, insufficient durability
|
||||||
|
|
||||||
|
### Option 3: LangGraph + Custom Infrastructure (Hybrid)
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Fine-grained control over agent flow
|
||||||
|
- Excellent state management with PostgreSQL persistence
|
||||||
|
- Human-in-the-loop built-in
|
||||||
|
- Production-proven (Klarna, Replit, Elastic)
|
||||||
|
- Fully open source (MIT license)
|
||||||
|
- Can implement any pattern (supervisor, hierarchical, peer-to-peer)
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Steep learning curve (graph theory, state machines)
|
||||||
|
- Needs additional infrastructure for durability (Temporal)
|
||||||
|
- Observability requires additional tooling
|
||||||
|
|
||||||
|
**Verdict:** Selected as foundation
|
||||||
|
|
||||||
|
### Option 4: Fully Custom Solution
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Complete control
|
||||||
|
- No external dependencies
|
||||||
|
- Tailored to exact requirements
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Reinvents production-tested solutions
|
||||||
|
- Higher development and maintenance cost
|
||||||
|
- Longer time to market
|
||||||
|
- More bugs in critical path
|
||||||
|
|
||||||
|
**Verdict:** Rejected - unnecessary when proven components exist
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
**Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by:
|
||||||
|
|
||||||
|
1. **LangGraph** - Agent state machines and logic
|
||||||
|
2. **Temporal** - Durable workflow execution
|
||||||
|
3. **Redis Streams** - Agent-to-agent communication
|
||||||
|
4. **LiteLLM** - Unified LLM access with failover
|
||||||
|
5. **PostgreSQL + pgvector** - State persistence and RAG
|
||||||
|
|
||||||
|
### Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Syndarix Agentic Architecture │
|
||||||
|
├─────────────────────────────────────────────────────────────────────────┤
|
||||||
|
│ │
|
||||||
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Temporal Workflow Engine │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ • Durable execution (survives crashes, restarts, deployments) │ │
|
||||||
|
│ │ • Human approval checkpoints (wait indefinitely for client) │ │
|
||||||
|
│ │ • Long-running workflows (projects spanning weeks/months) │ │
|
||||||
|
│ │ • Built-in retry policies and timeouts │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
|
||||||
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ LangGraph Agent Runtime │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ • Graph-based state machines for agent logic │ │
|
||||||
|
│ │ • Persistent checkpoints to PostgreSQL │ │
|
||||||
|
│ │ • Cycles, conditionals, parallel execution │ │
|
||||||
|
│ │ • Human-in-the-loop first-class support │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
|
||||||
|
│ │ │ Agent State Graph │ │ │
|
||||||
|
│ │ │ [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING] │ │ │
|
||||||
|
│ │ │ ▲ │ │ │ │ │ │
|
||||||
|
│ │ │ └─────────────┴──────────────┴──────────────┘ │ │ │
|
||||||
|
│ │ └─────────────────────────────────────────────────────────────┘ │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
|
||||||
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Redis Streams Communication Layer │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ • Agent-to-Agent messaging (A2A protocol concepts) │ │
|
||||||
|
│ │ • Event-driven architecture │ │
|
||||||
|
│ │ • Real-time activity streaming to UI │ │
|
||||||
|
│ │ • Project-scoped message channels │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ License: BSD-3 | Self-Hosted: Yes | Subscription: None Required │ │
|
||||||
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ LiteLLM Gateway │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ • Unified API for 100+ LLM providers │ │
|
||||||
|
│ │ • Automatic failover chains (Claude → GPT-4 → Ollama) │ │
|
||||||
|
│ │ • Token counting and cost calculation │ │
|
||||||
|
│ │ • Rate limiting and load balancing │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
|
||||||
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
└─────────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Component Responsibilities
|
||||||
|
|
||||||
|
| Component | Responsibility | Why This Choice |
|
||||||
|
|-----------|---------------|-----------------|
|
||||||
|
| **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing |
|
||||||
|
| **Temporal** | Durable workflows, human approvals, long-running orchestration | Only solution for week-long workflows that survive failures |
|
||||||
|
| **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups |
|
||||||
|
| **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in |
|
||||||
|
| **PostgreSQL** | State persistence, audit logs, agent data | Already in stack, pgvector for RAG |
|
||||||
|
|
||||||
|
### Self-Hostability Guarantee
|
||||||
|
|
||||||
|
All components are fully self-hostable with permissive open-source licenses:
|
||||||
|
|
||||||
|
| Component | License | Paid Cloud Alternative | Required for Syndarix? |
|
||||||
|
|-----------|---------|----------------------|----------------------|
|
||||||
|
| LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom |
|
||||||
|
| Temporal | MIT | Temporal Cloud | No - self-host server |
|
||||||
|
| LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy |
|
||||||
|
| Redis | BSD-3 | Redis Cloud | No - self-host |
|
||||||
|
| PostgreSQL | PostgreSQL | Various managed DBs | No - self-host |
|
||||||
|
|
||||||
|
**No mandatory subscriptions.** All paid alternatives are optional cloud-managed offerings.
|
||||||
|
|
||||||
|
### What We Build vs. What We Use
|
||||||
|
|
||||||
|
| Concern | Approach | Rationale |
|
||||||
|
|---------|----------|-----------|
|
||||||
|
| Agent Logic | **USE LangGraph** | Don't reinvent state machines |
|
||||||
|
| LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction |
|
||||||
|
| Durability | **USE Temporal** | Don't reinvent durable execution |
|
||||||
|
| Messaging | **USE Redis Streams** | Don't reinvent pub/sub |
|
||||||
|
| Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) |
|
||||||
|
| Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix |
|
||||||
|
| Cost Attribution | **BUILD thin layer** | Per-agent, per-project tracking specific to Syndarix |
|
||||||
|
|
||||||
|
### Integration Pattern
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Example: How the layers integrate
|
||||||
|
|
||||||
|
# 1. Temporal orchestrates the high-level workflow
|
||||||
|
@workflow.defn
|
||||||
|
class SprintWorkflow:
|
||||||
|
@workflow.run
|
||||||
|
async def run(self, sprint: SprintConfig) -> SprintResult:
|
||||||
|
# Spawns agents and waits for completion
|
||||||
|
agents = await workflow.execute_activity(spawn_agent_team, sprint)
|
||||||
|
|
||||||
|
# Each agent runs a LangGraph state machine
|
||||||
|
results = await workflow.execute_activity(
|
||||||
|
run_agent_tasks,
|
||||||
|
agents,
|
||||||
|
start_to_close_timeout=timedelta(days=7),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Human checkpoint (waits indefinitely)
|
||||||
|
if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS:
|
||||||
|
await workflow.wait_condition(lambda: self._approved)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
# 2. LangGraph handles individual agent logic
|
||||||
|
def create_agent_graph() -> StateGraph:
|
||||||
|
graph = StateGraph(AgentState)
|
||||||
|
graph.add_node("think", think_node) # LLM reasoning
|
||||||
|
graph.add_node("execute", execute_node) # Tool calls via MCP
|
||||||
|
graph.add_node("handoff", handoff_node) # Message to other agent
|
||||||
|
# ... state transitions
|
||||||
|
return graph.compile(checkpointer=PostgresSaver(...))
|
||||||
|
|
||||||
|
# 3. LiteLLM handles LLM calls with failover
|
||||||
|
async def think_node(state: AgentState) -> AgentState:
|
||||||
|
response = await litellm.acompletion(
|
||||||
|
model="claude-sonnet-4-20250514",
|
||||||
|
messages=state["messages"],
|
||||||
|
fallbacks=["gpt-4-turbo", "ollama/llama3"],
|
||||||
|
metadata={"agent_id": state["agent_id"]},
|
||||||
|
)
|
||||||
|
return {"messages": [response.choices[0].message]}
|
||||||
|
|
||||||
|
# 4. Redis Streams handles agent communication
|
||||||
|
async def handoff_node(state: AgentState) -> AgentState:
|
||||||
|
await message_bus.publish(AgentMessage(
|
||||||
|
source_agent_id=state["agent_id"],
|
||||||
|
target_agent_id=state["handoff_target"],
|
||||||
|
message_type="TASK_HANDOFF",
|
||||||
|
payload=state["handoff_context"],
|
||||||
|
))
|
||||||
|
return state
|
||||||
|
```
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Production-tested foundations** - LangGraph, Temporal, LiteLLM are battle-tested
|
||||||
|
- **No subscription lock-in** - All components self-hostable under permissive licenses
|
||||||
|
- **Right tool for each job** - Specialized components for durability, state, communication
|
||||||
|
- **Escape hatches** - Can replace any component without full rewrite
|
||||||
|
- **Enterprise patterns** - Temporal used by Netflix, Uber, Stripe for similar problems
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- **Multiple technologies to learn** - Team needs LangGraph, Temporal, Redis Streams knowledge
|
||||||
|
- **Operational complexity** - More services to deploy and monitor
|
||||||
|
- **Integration work** - Thin glue layers needed between components
|
||||||
|
|
||||||
|
### Mitigation
|
||||||
|
|
||||||
|
- **Learning curve** - Start with simple 2-3 agent workflows, expand gradually
|
||||||
|
- **Operational complexity** - Use Docker Compose locally, consider managed services for production if needed
|
||||||
|
- **Integration** - Create clear abstractions; each layer only knows its immediate neighbors
|
||||||
|
|
||||||
|
## Compliance
|
||||||
|
|
||||||
|
This decision aligns with:
|
||||||
|
- **FR-101-105**: Agent orchestration requirements
|
||||||
|
- **FR-301-305**: Workflow execution requirements
|
||||||
|
- **NFR-501**: Self-hosting requirement (all components MIT/BSD licensed)
|
||||||
|
- **TC-001**: PostgreSQL as primary database
|
||||||
|
- **TC-002**: Redis for caching and messaging
|
||||||
|
|
||||||
|
## Alternatives Not Chosen
|
||||||
|
|
||||||
|
### LangSmith for Observability
|
||||||
|
|
||||||
|
LangSmith is LangChain's paid observability platform. Instead, we will:
|
||||||
|
- Use **LangFuse** (open source, self-hostable) for LLM observability
|
||||||
|
- Use **Temporal UI** (built-in) for workflow visibility
|
||||||
|
- Build custom dashboards for Syndarix-specific metrics
|
||||||
|
|
||||||
|
### Temporal Cloud
|
||||||
|
|
||||||
|
Temporal offers a managed cloud service. Instead, we will:
|
||||||
|
- Self-host Temporal server (single-node for start, cluster for scale)
|
||||||
|
- Use PostgreSQL as Temporal's persistence backend (already in stack)
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
|
||||||
|
- [Temporal.io Documentation](https://docs.temporal.io/)
|
||||||
|
- [LiteLLM Documentation](https://docs.litellm.ai/)
|
||||||
|
- [LangFuse (Open Source LLM Observability)](https://langfuse.com/)
|
||||||
|
- [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md)
|
||||||
|
- [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.*
|
||||||
Reference in New Issue
Block a user