Files
syndarix/docs/adrs/ADR-007-agentic-framework-selection.md
Felipe Cardoso 88cf4e0abc feat: Update to production model stack and fix remaining inconsistencies
## Model Stack Updates (User's Actual Models)

Updated all documentation to reflect production models:
- Claude Opus 4.5 (primary reasoning)
- GPT 5.1 Codex max (code generation specialist)
- Gemini 3 Pro/Flash (multimodal, fast inference)
- Qwen3-235B (cost-effective, self-hostable)
- DeepSeek V3.2 (self-hosted, open weights)

### Files Updated:
- ADR-004: Full model groups, failover chains, cost tables
- ADR-007: Code example with correct model identifiers
- ADR-012: Cost tracking with new model prices
- ARCHITECTURE.md: Model groups, failover diagram
- IMPLEMENTATION_ROADMAP.md: External services list

## Architecture Diagram Updates

- Added LangGraph Runtime to orchestration layer
- Added technology labels (Type-Instance, transitions)

## Self-Hostability Table Expanded

Added entries for:
- LangGraph (MIT)
- transitions (MIT)
- DeepSeek V3.2 (MIT)
- Qwen3-235B (Apache 2.0)

## Metric Alignments

- Response time: Split into API (<200ms) and Agent (<10s/<60s)
- Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing
- Added concurrent projects (10+) and agents (50+) metrics

## Infrastructure Updates

- Celery workers: 4-8 instances (was 2-4) across 4 queues
- MCP servers: Clarified Phase 2 + Phase 5 deployment
- Sync interval: Clarified 60s fallback + 15min reconciliation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 23:35:51 +01:00

20 KiB

ADR-007: Agentic Framework Selection

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-002, SPIKE-005, SPIKE-007


Context

Syndarix requires a robust multi-agent orchestration system capable of:

  • Managing 50+ concurrent agent instances
  • Supporting long-running workflows (sprints spanning days/weeks)
  • Providing durable execution that survives crashes/restarts
  • Enabling human-in-the-loop at configurable autonomy levels
  • Tracking token usage and costs per agent instance
  • Supporting multi-provider LLM failover

We evaluated whether to adopt an existing framework wholesale or build a custom solution.

Decision Drivers

  • Production Readiness: Must be battle-tested, not experimental
  • Self-Hostability: All components must be self-hostable with no mandatory subscriptions
  • Flexibility: Must support Syndarix-specific patterns (autonomy levels, client approvals)
  • Durability: Workflows must survive failures, restarts, and deployments
  • Observability: Full visibility into agent activities and costs
  • Scalability: Handle 50+ concurrent agents without architectural changes

Considered Options

Option 1: CrewAI (Full Framework)

Pros:

  • Easy to get started (role-based agents)
  • Good for sequential/hierarchical workflows
  • Strong enterprise traction ($18M Series A, 60% Fortune 500)
  • LLM-agnostic design

Cons:

  • Teams report hitting walls at 6-12 months of complexity
  • Multi-agent coordination can cause infinite loops
  • Limited ceiling for complex custom patterns
  • Flows architecture adds learning curve without solving durability

Verdict: Rejected - insufficient flexibility for Syndarix's complex requirements

Option 2: AutoGen 0.4 (Full Framework)

Pros:

  • Event-driven, async-first architecture
  • Cross-language support (.NET, Python)
  • Built-in observability (OpenTelemetry)
  • Microsoft ecosystem integration

Cons:

  • Tied to Microsoft patterns
  • Less flexible for custom orchestration
  • Newer 0.4 version still maturing
  • No built-in durability for week-long workflows

Verdict: Rejected - too opinionated, insufficient durability

Option 3: LangGraph + Custom Infrastructure (Hybrid)

Pros:

  • Fine-grained control over agent flow
  • Excellent state management with PostgreSQL persistence
  • Human-in-the-loop built-in
  • Production-proven (Klarna, Replit, Elastic)
  • Fully open source (MIT license)
  • Can implement any pattern (supervisor, hierarchical, peer-to-peer)

Cons:

  • Steep learning curve (graph theory, state machines)
  • Needs additional infrastructure for durability (Temporal)
  • Observability requires additional tooling

Verdict: Selected as foundation

Option 4: Fully Custom Solution

Pros:

  • Complete control
  • No external dependencies
  • Tailored to exact requirements

Cons:

  • Reinvents production-tested solutions
  • Higher development and maintenance cost
  • Longer time to market
  • More bugs in critical path

Verdict: Rejected - unnecessary when proven components exist

Decision

Adopt a hybrid architecture using LangGraph as the core agent framework, complemented by:

  1. LangGraph - Agent state machines and logic
  2. transitions + PostgreSQL + Celery - Durable workflow state machines
  3. Redis Streams - Agent-to-agent communication
  4. LiteLLM - Unified LLM access with failover
  5. PostgreSQL + pgvector - State persistence and RAG

Why Not Temporal?

After evaluating both approaches, we chose the simpler transitions + PostgreSQL + Celery stack over Temporal:

Factor Temporal transitions + PostgreSQL
Complexity High (separate cluster, workers, SDK) Low (Python library + existing infra)
Learning Curve Steep (new paradigm) Gentle (familiar patterns)
Infrastructure Dedicated cluster required Uses existing PostgreSQL + Celery
Scale Target Enterprise (1000s of workflows) Syndarix (10s of agents)
Debugging Temporal UI (powerful but complex) Standard DB queries + logs

Temporal is overkill for our scale (10-50 concurrent agents). The simpler approach provides:

  • Full durability via PostgreSQL state persistence
  • Event sourcing via transition history table
  • Background execution via Celery workers
  • Simpler debugging with standard tools

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                    Syndarix Agentic Architecture                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │           Workflow Engine (transitions + PostgreSQL)               │  │
│  │                                                                    │  │
│  │  • State persistence to PostgreSQL (survives restarts)            │  │
│  │  • Event sourcing via workflow_transitions table                  │  │
│  │  • Human approval checkpoints (pause workflow, await signal)      │  │
│  │  • Background execution via Celery workers                        │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                 LangGraph Agent Runtime                            │  │
│  │                                                                    │  │
│  │  • Graph-based state machines for agent logic                     │  │
│  │  • Persistent checkpoints to PostgreSQL                           │  │
│  │  • Cycles, conditionals, parallel execution                       │  │
│  │  • Human-in-the-loop first-class support                          │  │
│  │                                                                    │  │
│  │  ┌─────────────────────────────────────────────────────────────┐  │  │
│  │  │              Agent State Graph                               │  │  │
│  │  │  [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING]        │  │  │
│  │  │    ▲             │              │              │             │  │  │
│  │  │    └─────────────┴──────────────┴──────────────┘             │  │  │
│  │  └─────────────────────────────────────────────────────────────┘  │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │              Redis Streams Communication Layer                     │  │
│  │                                                                    │  │
│  │  • Agent-to-Agent messaging (A2A protocol concepts)               │  │
│  │  • Event-driven architecture                                      │  │
│  │  • Real-time activity streaming to UI                             │  │
│  │  • Project-scoped message channels                                │  │
│  │                                                                    │  │
│  │  License: BSD-3 | Self-Hosted: Yes | Subscription: None Required  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                    LiteLLM Gateway                                 │  │
│  │                                                                    │  │
│  │  • Unified API for 100+ LLM providers                             │  │
│  │  • Automatic failover chains (Claude → GPT-4 → Ollama)            │  │
│  │  • Token counting and cost calculation                            │  │
│  │  • Rate limiting and load balancing                               │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Component Responsibility Why This Choice
LangGraph Agent state machines, tool execution, reasoning loops Production-proven, fine-grained control, PostgreSQL checkpointing
transitions Workflow state machines (sprint, story, PR) Lightweight, Pythonic, no external dependencies
Celery + Redis Background task execution, async workflows Already in stack, battle-tested
PostgreSQL Workflow state persistence, event sourcing ACID guarantees, survives restarts
Redis Streams Agent messaging, real-time events, pub/sub Low-latency, persistent streams, consumer groups
LiteLLM LLM abstraction, failover, cost tracking Unified API, automatic failover, no vendor lock-in

Reboot Survival (Durability)

The architecture fully supports system reboots and crashes:

  1. Workflow State: Persisted to PostgreSQL workflow_instances table
  2. Transition History: Event-sourced in workflow_transitions table
  3. Agent Checkpoints: LangGraph persists to PostgreSQL
  4. Pending Tasks: Celery tasks in Redis (configured with persistence)

Recovery Process:

System Restart
     │
     ▼
Load workflow_instances WHERE status = 'in_progress'
     │
     ▼
For each workflow:
├── Restore state from context JSONB
├── Identify current_state
├── Resume from last checkpoint
└── Continue execution

Self-Hostability Guarantee

All components are fully self-hostable with permissive open-source licenses:

Component License Paid Cloud Alternative Required for Syndarix?
LangGraph MIT LangSmith (observability) No - use LangFuse or custom
transitions MIT N/A N/A - simple library
Celery BSD-3 Various No - self-host
LiteLLM MIT LiteLLM Enterprise No - self-host proxy
Redis BSD-3 Redis Cloud No - self-host
PostgreSQL PostgreSQL Various managed DBs No - self-host

No mandatory subscriptions. All paid alternatives are optional cloud-managed offerings.

What We Build vs. What We Use

Concern Approach Rationale
Agent Logic USE LangGraph Don't reinvent state machines
LLM Access USE LiteLLM Don't reinvent provider abstraction
Workflow State USE transitions + PostgreSQL Simple, durable, debuggable
Background Tasks USE Celery Already in stack, proven
Messaging USE Redis Streams Don't reinvent pub/sub
Orchestration BUILD thin layer Syndarix-specific (autonomy levels, team structure)
Agent Spawning BUILD thin layer Type-Instance pattern specific to Syndarix
Cost Attribution BUILD thin layer Per-agent, per-project tracking specific to Syndarix

Integration Pattern

# Example: How the layers integrate

# 1. Workflow state machine (transitions library)
class SprintWorkflow(Machine):
    states = ['planning', 'active', 'review', 'done']

    def __init__(self, sprint_id: str):
        self.sprint_id = sprint_id
        Machine.__init__(
            self,
            states=self.states,
            initial='planning',
            after_state_change='persist_state'
        )
        self.add_transition('start', 'planning', 'active', before='spawn_agents')
        self.add_transition('complete_work', 'active', 'review')
        self.add_transition('approve', 'review', 'done', conditions='has_approval')

    async def persist_state(self):
        """Save state to PostgreSQL (survives restarts)"""
        await db.execute("""
            UPDATE workflow_instances
            SET current_state = $1, context = $2, updated_at = NOW()
            WHERE id = $3
        """, self.state, self.context, self.sprint_id)

# 2. Background execution via Celery
@celery_app.task(bind=True, max_retries=3)
def run_sprint_workflow(self, sprint_id: str):
    workflow = SprintWorkflow.load(sprint_id)  # Restore from DB
    workflow.start()  # Triggers agent spawning
    # Workflow persists state, can resume after restart

# 3. LangGraph handles individual agent logic
def create_agent_graph() -> StateGraph:
    graph = StateGraph(AgentState)
    graph.add_node("think", think_node)      # LLM reasoning
    graph.add_node("execute", execute_node)  # Tool calls via MCP
    graph.add_node("handoff", handoff_node)  # Message to other agent
    # ... state transitions
    return graph.compile(checkpointer=PostgresSaver(...))

# 4. LiteLLM handles LLM calls with failover
async def think_node(state: AgentState) -> AgentState:
    response = await litellm.acompletion(
        model="claude-opus-4-5",  # Claude Opus 4.5 (primary)
        messages=state["messages"],
        fallbacks=["gpt-5.1-codex-max", "gemini-3-pro", "qwen3-235b", "deepseek-v3.2"],
        metadata={"agent_id": state["agent_id"]},
    )
    return {"messages": [response.choices[0].message]}

# 5. Redis Streams handles agent communication
async def handoff_node(state: AgentState) -> AgentState:
    await message_bus.publish(AgentMessage(
        source_agent_id=state["agent_id"],
        target_agent_id=state["handoff_target"],
        message_type="TASK_HANDOFF",
        payload=state["handoff_context"],
    ))
    return state

Human Approval Checkpoints

For workflows requiring human approval (FULL_CONTROL and MILESTONE modes):

class StoryWorkflow(Machine):
    async def request_approval_and_wait(self, action: str):
        """Pause workflow and await human decision."""
        # 1. Create approval request
        request = await approval_service.create(
            workflow_id=self.id,
            action=action,
            context=self.context
        )

        # 2. Transition to waiting state (persisted)
        self.state = 'awaiting_approval'
        await self.persist_state()

        # 3. Workflow is paused - Celery task completes
        # When user approves, a new task resumes the workflow

    @classmethod
    async def resume_on_approval(cls, workflow_id: str, approved: bool):
        """Called when user makes a decision."""
        workflow = await cls.load(workflow_id)
        if approved:
            workflow.trigger('approved')
        else:
            workflow.trigger('rejected')

Consequences

Positive

  • Production-tested foundations - LangGraph, Celery, LiteLLM are battle-tested
  • No subscription lock-in - All components self-hostable under permissive licenses
  • Right tool for each job - Specialized components for state, communication, background processing
  • Escape hatches - Can replace any component without full rewrite
  • Simpler operations - Uses existing PostgreSQL + Redis infrastructure, no new services
  • Reboot survival - Full durability via PostgreSQL persistence

Negative

  • Multiple technologies to learn - Team needs LangGraph, transitions, Redis Streams knowledge
  • Integration work - Thin glue layers needed between components
  • Manual recovery logic - Must implement workflow recovery on startup

Mitigation

  • Learning curve - Start with simple 2-3 agent workflows, expand gradually
  • Integration - Create clear abstractions; each layer only knows its immediate neighbors
  • Recovery - Implement startup recovery task that scans for in-progress workflows

Compliance

This decision aligns with:

  • FR-101-105: Agent management requirements (Type-Instance pattern)
  • FR-301-305: Workflow execution requirements
  • NFR-402: Fault tolerance (workflow durability, crash recovery)
  • TC-001: PostgreSQL as primary database
  • Core Principle: Self-hostability (all components MIT/BSD licensed)

Alternatives Not Chosen

LangSmith for Observability

LangSmith is LangChain's paid observability platform. Instead, we will:

  • Use LangFuse (open source, self-hostable) for LLM observability
  • Use standard logging + PostgreSQL queries for workflow visibility
  • Build custom dashboards for Syndarix-specific metrics

Temporal for Durable Workflows

Temporal was initially considered but rejected for this project:

  • Overkill for scale - Syndarix targets 10-50 concurrent agents, not thousands
  • Operational overhead - Requires separate cluster, workers, SDK learning curve
  • Simpler alternative available - transitions + PostgreSQL provides equivalent durability
  • Migration path - If scale demands grow, Temporal can be introduced later

References


This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.