syndarix/docs/adrs/ADR-007-agentic-framework-selection.md

# ADR-007: Agentic Framework Selection

**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-002, SPIKE-005, SPIKE-007

---

## Context

Syndarix requires a robust multi-agent orchestration system capable of:
- Managing 50+ concurrent agent instances
- Supporting long-running workflows (sprints spanning days/weeks)
- Providing durable execution that survives crashes/restarts
- Enabling human-in-the-loop at configurable autonomy levels
- Tracking token usage and costs per agent instance
- Supporting multi-provider LLM failover

We evaluated whether to adopt an existing framework wholesale or build a custom solution.

## Decision Drivers

- **Production Readiness:** Must be battle-tested, not experimental
- **Self-Hostability:** All components must be self-hostable with no mandatory subscriptions
- **Flexibility:** Must support Syndarix-specific patterns (autonomy levels, client approvals)
- **Durability:** Workflows must survive failures, restarts, and deployments
- **Observability:** Full visibility into agent activities and costs
- **Scalability:** Handle 50+ concurrent agents without architectural changes

## Considered Options

### Option 1: CrewAI (Full Framework)

**Pros:**
- Easy to get started (role-based agents)
- Good for sequential/hierarchical workflows
- Strong enterprise traction ($18M Series A, 60% Fortune 500)
- LLM-agnostic design

**Cons:**
- Teams report hitting walls at 6-12 months of complexity
- Multi-agent coordination can cause infinite loops
- Limited ceiling for complex custom patterns
- Flows architecture adds learning curve without solving durability

**Verdict:** Rejected - insufficient flexibility for Syndarix's complex requirements

### Option 2: AutoGen 0.4 (Full Framework)

**Pros:**
- Event-driven, async-first architecture
- Cross-language support (.NET, Python)
- Built-in observability (OpenTelemetry)
- Microsoft ecosystem integration

**Cons:**
- Tied to Microsoft patterns
- Less flexible for custom orchestration
- Newer 0.4 version still maturing
- No built-in durability for week-long workflows

**Verdict:** Rejected - too opinionated, insufficient durability

### Option 3: LangGraph + Custom Infrastructure (Hybrid)

**Pros:**
- Fine-grained control over agent flow
- Excellent state management with PostgreSQL persistence
- Human-in-the-loop built-in
- Production-proven (Klarna, Replit, Elastic)
- Fully open source (MIT license)
- Can implement any pattern (supervisor, hierarchical, peer-to-peer)

**Cons:**
- Steep learning curve (graph theory, state machines)
- Needs additional infrastructure for durability (Temporal)
- Observability requires additional tooling

**Verdict:** Selected as foundation

### Option 4: Fully Custom Solution

**Pros:**
- Complete control
- No external dependencies
- Tailored to exact requirements

**Cons:**
- Reinvents production-tested solutions
- Higher development and maintenance cost
- Longer time to market
- More bugs in critical path

**Verdict:** Rejected - unnecessary when proven components exist

## Decision

**Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by:

1. **LangGraph** - Agent state machines and logic
2. **transitions + PostgreSQL + Celery** - Durable workflow state machines
3. **Redis Streams** - Agent-to-agent communication
4. **LiteLLM** - Unified LLM access with failover
5. **PostgreSQL + pgvector** - State persistence and RAG

### Why Not Temporal?

After evaluating both approaches, we chose the simpler **transitions + PostgreSQL + Celery** stack over Temporal:

| Factor | Temporal | transitions + PostgreSQL |
|--------|----------|-------------------------|
| Complexity | High (separate cluster, workers, SDK) | Low (Python library + existing infra) |
| Learning Curve | Steep (new paradigm) | Gentle (familiar patterns) |
| Infrastructure | Dedicated cluster required | Uses existing PostgreSQL + Celery |
| Scale Target | Enterprise (1000s of workflows) | Syndarix (10s of agents) |
| Debugging | Temporal UI (powerful but complex) | Standard DB queries + logs |

**Temporal is overkill for our scale** (10-50 concurrent agents). The simpler approach provides:
- Full durability via PostgreSQL state persistence
- Event sourcing via transition history table
- Background execution via Celery workers
- Simpler debugging with standard tools

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    Syndarix Agentic Architecture                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │           Workflow Engine (transitions + PostgreSQL)               │  │
│  │                                                                    │  │
│  │  • State persistence to PostgreSQL (survives restarts)            │  │
│  │  • Event sourcing via workflow_transitions table                  │  │
│  │  • Human approval checkpoints (pause workflow, await signal)      │  │
│  │  • Background execution via Celery workers                        │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                 LangGraph Agent Runtime                            │  │
│  │                                                                    │  │
│  │  • Graph-based state machines for agent logic                     │  │
│  │  • Persistent checkpoints to PostgreSQL                           │  │
│  │  • Cycles, conditionals, parallel execution                       │  │
│  │  • Human-in-the-loop first-class support                          │  │
│  │                                                                    │  │
│  │  ┌─────────────────────────────────────────────────────────────┐  │  │
│  │  │              Agent State Graph                               │  │  │
│  │  │  [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING]        │  │  │
│  │  │    ▲             │              │              │             │  │  │
│  │  │    └─────────────┴──────────────┴──────────────┘             │  │  │
│  │  └─────────────────────────────────────────────────────────────┘  │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │              Redis Streams Communication Layer                     │  │
│  │                                                                    │  │
│  │  • Agent-to-Agent messaging (A2A protocol concepts)               │  │
│  │  • Event-driven architecture                                      │  │
│  │  • Real-time activity streaming to UI                             │  │
│  │  • Project-scoped message channels                                │  │
│  │                                                                    │  │
│  │  License: BSD-3 | Self-Hosted: Yes | Subscription: None Required  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                    LiteLLM Gateway                                 │  │
│  │                                                                    │  │
│  │  • Unified API for 100+ LLM providers                             │  │
│  │  • Automatic failover chains (Claude → GPT-4 → Ollama)            │  │
│  │  • Token counting and cost calculation                            │  │
│  │  • Rate limiting and load balancing                               │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
```

### Component Responsibilities

| Component | Responsibility | Why This Choice |
|-----------|---------------|-----------------|
| **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing |
| **transitions** | Workflow state machines (sprint, story, PR) | Lightweight, Pythonic, no external dependencies |
| **Celery + Redis** | Background task execution, async workflows | Already in stack, battle-tested |
| **PostgreSQL** | Workflow state persistence, event sourcing | ACID guarantees, survives restarts |
| **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups |
| **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in |

### Reboot Survival (Durability)

The architecture **fully supports system reboots and crashes**:

1. **Workflow State**: Persisted to PostgreSQL `workflow_instances` table
2. **Transition History**: Event-sourced in `workflow_transitions` table
3. **Agent Checkpoints**: LangGraph persists to PostgreSQL
4. **Pending Tasks**: Celery tasks in Redis (configured with persistence)

**Recovery Process:**
```
System Restart
     │
     ▼
Load workflow_instances WHERE status = 'in_progress'
     │
     ▼
For each workflow:
├── Restore state from context JSONB
├── Identify current_state
├── Resume from last checkpoint
└── Continue execution
```

### Self-Hostability Guarantee

All components are fully self-hostable with permissive open-source licenses:

| Component | License | Paid Cloud Alternative | Required for Syndarix? |
|-----------|---------|----------------------|----------------------|
| LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom |
| transitions | MIT | N/A | N/A - simple library |
| Celery | BSD-3 | Various | No - self-host |
| LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy |
| Redis | BSD-3 | Redis Cloud | No - self-host |
| PostgreSQL | PostgreSQL | Various managed DBs | No - self-host |

**No mandatory subscriptions.** All paid alternatives are optional cloud-managed offerings.

### What We Build vs. What We Use

| Concern | Approach | Rationale |
|---------|----------|-----------|
| Agent Logic | **USE LangGraph** | Don't reinvent state machines |
| LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction |
| Workflow State | **USE transitions + PostgreSQL** | Simple, durable, debuggable |
| Background Tasks | **USE Celery** | Already in stack, proven |
| Messaging | **USE Redis Streams** | Don't reinvent pub/sub |
| Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) |
| Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix |
| Cost Attribution | **BUILD thin layer** | Per-agent, per-project tracking specific to Syndarix |

### Integration Pattern

```python
# Example: How the layers integrate

# 1. Workflow state machine (transitions library)
class SprintWorkflow(Machine):
    states = ['planning', 'active', 'review', 'done']

    def __init__(self, sprint_id: str):
        self.sprint_id = sprint_id
        Machine.__init__(
            self,
            states=self.states,
            initial='planning',
            after_state_change='persist_state'
        )
        self.add_transition('start', 'planning', 'active', before='spawn_agents')
        self.add_transition('complete_work', 'active', 'review')
        self.add_transition('approve', 'review', 'done', conditions='has_approval')

    async def persist_state(self):
        """Save state to PostgreSQL (survives restarts)"""
        await db.execute("""
            UPDATE workflow_instances
            SET current_state = $1, context = $2, updated_at = NOW()
            WHERE id = $3
        """, self.state, self.context, self.sprint_id)

# 2. Background execution via Celery
@celery_app.task(bind=True, max_retries=3)
def run_sprint_workflow(self, sprint_id: str):
    workflow = SprintWorkflow.load(sprint_id)  # Restore from DB
    workflow.start()  # Triggers agent spawning
    # Workflow persists state, can resume after restart

# 3. LangGraph handles individual agent logic
def create_agent_graph() -> StateGraph:
    graph = StateGraph(AgentState)
    graph.add_node("think", think_node)      # LLM reasoning
    graph.add_node("execute", execute_node)  # Tool calls via MCP
    graph.add_node("handoff", handoff_node)  # Message to other agent
    # ... state transitions
    return graph.compile(checkpointer=PostgresSaver(...))

# 4. LiteLLM handles LLM calls with failover
async def think_node(state: AgentState) -> AgentState:
    response = await litellm.acompletion(
        model="claude-opus-4-5",  # Claude Opus 4.5 (primary)
        messages=state["messages"],
        fallbacks=["gpt-5.1-codex-max", "gemini-3-pro", "qwen3-235b", "deepseek-v3.2"],
        metadata={"agent_id": state["agent_id"]},
    )
    return {"messages": [response.choices[0].message]}

# 5. Redis Streams handles agent communication
async def handoff_node(state: AgentState) -> AgentState:
    await message_bus.publish(AgentMessage(
        source_agent_id=state["agent_id"],
        target_agent_id=state["handoff_target"],
        message_type="TASK_HANDOFF",
        payload=state["handoff_context"],
    ))
    return state
```

### Human Approval Checkpoints

For workflows requiring human approval (FULL_CONTROL and MILESTONE modes):

```python
class StoryWorkflow(Machine):
    async def request_approval_and_wait(self, action: str):
        """Pause workflow and await human decision."""
        # 1. Create approval request
        request = await approval_service.create(
            workflow_id=self.id,
            action=action,
            context=self.context
        )

        # 2. Transition to waiting state (persisted)
        self.state = 'awaiting_approval'
        await self.persist_state()

        # 3. Workflow is paused - Celery task completes
        # When user approves, a new task resumes the workflow

    @classmethod
    async def resume_on_approval(cls, workflow_id: str, approved: bool):
        """Called when user makes a decision."""
        workflow = await cls.load(workflow_id)
        if approved:
            workflow.trigger('approved')
        else:
            workflow.trigger('rejected')
```

## Consequences

### Positive

- **Production-tested foundations** - LangGraph, Celery, LiteLLM are battle-tested
- **No subscription lock-in** - All components self-hostable under permissive licenses
- **Right tool for each job** - Specialized components for state, communication, background processing
- **Escape hatches** - Can replace any component without full rewrite
- **Simpler operations** - Uses existing PostgreSQL + Redis infrastructure, no new services
- **Reboot survival** - Full durability via PostgreSQL persistence

### Negative

- **Multiple technologies to learn** - Team needs LangGraph, transitions, Redis Streams knowledge
- **Integration work** - Thin glue layers needed between components
- **Manual recovery logic** - Must implement workflow recovery on startup

### Mitigation

- **Learning curve** - Start with simple 2-3 agent workflows, expand gradually
- **Integration** - Create clear abstractions; each layer only knows its immediate neighbors
- **Recovery** - Implement startup recovery task that scans for in-progress workflows

## Compliance

This decision aligns with:
- **FR-101-105**: Agent management requirements (Type-Instance pattern)
- **FR-301-305**: Workflow execution requirements
- **NFR-402**: Fault tolerance (workflow durability, crash recovery)
- **TC-001**: PostgreSQL as primary database
- **Core Principle**: Self-hostability (all components MIT/BSD licensed)

## Alternatives Not Chosen

### LangSmith for Observability

LangSmith is LangChain's paid observability platform. Instead, we will:
- Use **LangFuse** (open source, self-hostable) for LLM observability
- Use standard logging + PostgreSQL queries for workflow visibility
- Build custom dashboards for Syndarix-specific metrics

### Temporal for Durable Workflows

Temporal was initially considered but rejected for this project:
- **Overkill for scale** - Syndarix targets 10-50 concurrent agents, not thousands
- **Operational overhead** - Requires separate cluster, workers, SDK learning curve
- **Simpler alternative available** - transitions + PostgreSQL provides equivalent durability
- **Migration path** - If scale demands grow, Temporal can be introduced later

## References

- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
- [transitions Library](https://github.com/pytransitions/transitions)
- [LiteLLM Documentation](https://docs.litellm.ai/)
- [LangFuse (Open Source LLM Observability)](https://langfuse.com/)
- [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md)
- [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md)
- [SPIKE-008: Workflow State Machine](../spikes/SPIKE-008-workflow-state-machine.md)
- [ADR-010: Workflow State Machine](./ADR-010-workflow-state-machine.md)

---

*This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.*