Files
syndarix/docs/adrs/ADR-007-agentic-framework-selection.md
Felipe Cardoso bd702734c2 docs: add ADR-007 for agentic framework selection
Establishes the hybrid architecture decision:
- LangGraph for agent state machines (MIT, self-hostable)
- Temporal for durable workflow execution (MIT, self-hostable)
- Redis Streams for agent communication (BSD-3, self-hostable)
- LiteLLM for unified LLM access (MIT, self-hostable)

Key decision: Use production-tested open-source components rather than
reinventing the wheel, while maintaining 100% self-hostability with
no mandatory subscriptions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 13:42:33 +01:00

16 KiB

ADR-007: Agentic Framework Selection

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-002, SPIKE-005, SPIKE-007


Context

Syndarix requires a robust multi-agent orchestration system capable of:

  • Managing 50+ concurrent agent instances
  • Supporting long-running workflows (sprints spanning days/weeks)
  • Providing durable execution that survives crashes/restarts
  • Enabling human-in-the-loop at configurable autonomy levels
  • Tracking token usage and costs per agent instance
  • Supporting multi-provider LLM failover

We evaluated whether to adopt an existing framework wholesale or build a custom solution.

Decision Drivers

  • Production Readiness: Must be battle-tested, not experimental
  • Self-Hostability: All components must be self-hostable with no mandatory subscriptions
  • Flexibility: Must support Syndarix-specific patterns (autonomy levels, client approvals)
  • Durability: Workflows must survive failures, restarts, and deployments
  • Observability: Full visibility into agent activities and costs
  • Scalability: Handle 50+ concurrent agents without architectural changes

Considered Options

Option 1: CrewAI (Full Framework)

Pros:

  • Easy to get started (role-based agents)
  • Good for sequential/hierarchical workflows
  • Strong enterprise traction ($18M Series A, 60% Fortune 500)
  • LLM-agnostic design

Cons:

  • Teams report hitting walls at 6-12 months of complexity
  • Multi-agent coordination can cause infinite loops
  • Limited ceiling for complex custom patterns
  • Flows architecture adds learning curve without solving durability

Verdict: Rejected - insufficient flexibility for Syndarix's complex requirements

Option 2: AutoGen 0.4 (Full Framework)

Pros:

  • Event-driven, async-first architecture
  • Cross-language support (.NET, Python)
  • Built-in observability (OpenTelemetry)
  • Microsoft ecosystem integration

Cons:

  • Tied to Microsoft patterns
  • Less flexible for custom orchestration
  • Newer 0.4 version still maturing
  • No built-in durability for week-long workflows

Verdict: Rejected - too opinionated, insufficient durability

Option 3: LangGraph + Custom Infrastructure (Hybrid)

Pros:

  • Fine-grained control over agent flow
  • Excellent state management with PostgreSQL persistence
  • Human-in-the-loop built-in
  • Production-proven (Klarna, Replit, Elastic)
  • Fully open source (MIT license)
  • Can implement any pattern (supervisor, hierarchical, peer-to-peer)

Cons:

  • Steep learning curve (graph theory, state machines)
  • Needs additional infrastructure for durability (Temporal)
  • Observability requires additional tooling

Verdict: Selected as foundation

Option 4: Fully Custom Solution

Pros:

  • Complete control
  • No external dependencies
  • Tailored to exact requirements

Cons:

  • Reinvents production-tested solutions
  • Higher development and maintenance cost
  • Longer time to market
  • More bugs in critical path

Verdict: Rejected - unnecessary when proven components exist

Decision

Adopt a hybrid architecture using LangGraph as the core agent framework, complemented by:

  1. LangGraph - Agent state machines and logic
  2. Temporal - Durable workflow execution
  3. Redis Streams - Agent-to-agent communication
  4. LiteLLM - Unified LLM access with failover
  5. PostgreSQL + pgvector - State persistence and RAG

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                    Syndarix Agentic Architecture                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                 Temporal Workflow Engine                           │  │
│  │                                                                    │  │
│  │  • Durable execution (survives crashes, restarts, deployments)    │  │
│  │  • Human approval checkpoints (wait indefinitely for client)      │  │
│  │  • Long-running workflows (projects spanning weeks/months)        │  │
│  │  • Built-in retry policies and timeouts                           │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                 LangGraph Agent Runtime                            │  │
│  │                                                                    │  │
│  │  • Graph-based state machines for agent logic                     │  │
│  │  • Persistent checkpoints to PostgreSQL                           │  │
│  │  • Cycles, conditionals, parallel execution                       │  │
│  │  • Human-in-the-loop first-class support                          │  │
│  │                                                                    │  │
│  │  ┌─────────────────────────────────────────────────────────────┐  │  │
│  │  │              Agent State Graph                               │  │  │
│  │  │  [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING]        │  │  │
│  │  │    ▲             │              │              │             │  │  │
│  │  │    └─────────────┴──────────────┴──────────────┘             │  │  │
│  │  └─────────────────────────────────────────────────────────────┘  │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │              Redis Streams Communication Layer                     │  │
│  │                                                                    │  │
│  │  • Agent-to-Agent messaging (A2A protocol concepts)               │  │
│  │  • Event-driven architecture                                      │  │
│  │  • Real-time activity streaming to UI                             │  │
│  │  • Project-scoped message channels                                │  │
│  │                                                                    │  │
│  │  License: BSD-3 | Self-Hosted: Yes | Subscription: None Required  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                    LiteLLM Gateway                                 │  │
│  │                                                                    │  │
│  │  • Unified API for 100+ LLM providers                             │  │
│  │  • Automatic failover chains (Claude → GPT-4 → Ollama)            │  │
│  │  • Token counting and cost calculation                            │  │
│  │  • Rate limiting and load balancing                               │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Component Responsibility Why This Choice
LangGraph Agent state machines, tool execution, reasoning loops Production-proven, fine-grained control, PostgreSQL checkpointing
Temporal Durable workflows, human approvals, long-running orchestration Only solution for week-long workflows that survive failures
Redis Streams Agent messaging, real-time events, pub/sub Low-latency, persistent streams, consumer groups
LiteLLM LLM abstraction, failover, cost tracking Unified API, automatic failover, no vendor lock-in
PostgreSQL State persistence, audit logs, agent data Already in stack, pgvector for RAG

Self-Hostability Guarantee

All components are fully self-hostable with permissive open-source licenses:

Component License Paid Cloud Alternative Required for Syndarix?
LangGraph MIT LangSmith (observability) No - use LangFuse or custom
Temporal MIT Temporal Cloud No - self-host server
LiteLLM MIT LiteLLM Enterprise No - self-host proxy
Redis BSD-3 Redis Cloud No - self-host
PostgreSQL PostgreSQL Various managed DBs No - self-host

No mandatory subscriptions. All paid alternatives are optional cloud-managed offerings.

What We Build vs. What We Use

Concern Approach Rationale
Agent Logic USE LangGraph Don't reinvent state machines
LLM Access USE LiteLLM Don't reinvent provider abstraction
Durability USE Temporal Don't reinvent durable execution
Messaging USE Redis Streams Don't reinvent pub/sub
Orchestration BUILD thin layer Syndarix-specific (autonomy levels, team structure)
Agent Spawning BUILD thin layer Type-Instance pattern specific to Syndarix
Cost Attribution BUILD thin layer Per-agent, per-project tracking specific to Syndarix

Integration Pattern

# Example: How the layers integrate

# 1. Temporal orchestrates the high-level workflow
@workflow.defn
class SprintWorkflow:
    @workflow.run
    async def run(self, sprint: SprintConfig) -> SprintResult:
        # Spawns agents and waits for completion
        agents = await workflow.execute_activity(spawn_agent_team, sprint)

        # Each agent runs a LangGraph state machine
        results = await workflow.execute_activity(
            run_agent_tasks,
            agents,
            start_to_close_timeout=timedelta(days=7),
        )

        # Human checkpoint (waits indefinitely)
        if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS:
            await workflow.wait_condition(lambda: self._approved)

        return results

# 2. LangGraph handles individual agent logic
def create_agent_graph() -> StateGraph:
    graph = StateGraph(AgentState)
    graph.add_node("think", think_node)      # LLM reasoning
    graph.add_node("execute", execute_node)  # Tool calls via MCP
    graph.add_node("handoff", handoff_node)  # Message to other agent
    # ... state transitions
    return graph.compile(checkpointer=PostgresSaver(...))

# 3. LiteLLM handles LLM calls with failover
async def think_node(state: AgentState) -> AgentState:
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
        messages=state["messages"],
        fallbacks=["gpt-4-turbo", "ollama/llama3"],
        metadata={"agent_id": state["agent_id"]},
    )
    return {"messages": [response.choices[0].message]}

# 4. Redis Streams handles agent communication
async def handoff_node(state: AgentState) -> AgentState:
    await message_bus.publish(AgentMessage(
        source_agent_id=state["agent_id"],
        target_agent_id=state["handoff_target"],
        message_type="TASK_HANDOFF",
        payload=state["handoff_context"],
    ))
    return state

Consequences

Positive

  • Production-tested foundations - LangGraph, Temporal, LiteLLM are battle-tested
  • No subscription lock-in - All components self-hostable under permissive licenses
  • Right tool for each job - Specialized components for durability, state, communication
  • Escape hatches - Can replace any component without full rewrite
  • Enterprise patterns - Temporal used by Netflix, Uber, Stripe for similar problems

Negative

  • Multiple technologies to learn - Team needs LangGraph, Temporal, Redis Streams knowledge
  • Operational complexity - More services to deploy and monitor
  • Integration work - Thin glue layers needed between components

Mitigation

  • Learning curve - Start with simple 2-3 agent workflows, expand gradually
  • Operational complexity - Use Docker Compose locally, consider managed services for production if needed
  • Integration - Create clear abstractions; each layer only knows its immediate neighbors

Compliance

This decision aligns with:

  • FR-101-105: Agent orchestration requirements
  • FR-301-305: Workflow execution requirements
  • NFR-501: Self-hosting requirement (all components MIT/BSD licensed)
  • TC-001: PostgreSQL as primary database
  • TC-002: Redis for caching and messaging

Alternatives Not Chosen

LangSmith for Observability

LangSmith is LangChain's paid observability platform. Instead, we will:

  • Use LangFuse (open source, self-hostable) for LLM observability
  • Use Temporal UI (built-in) for workflow visibility
  • Build custom dashboards for Syndarix-specific metrics

Temporal Cloud

Temporal offers a managed cloud service. Instead, we will:

  • Self-host Temporal server (single-node for start, cluster for scale)
  • Use PostgreSQL as Temporal's persistence backend (already in stack)

References


This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.