forked from cardosofelipe/fast-next-template

Files

Felipe Cardoso bd702734c2 docs: add ADR-007 for agentic framework selection

Establishes the hybrid architecture decision:
- LangGraph for agent state machines (MIT, self-hostable)
- Temporal for durable workflow execution (MIT, self-hostable)
- Redis Streams for agent communication (BSD-3, self-hostable)
- LiteLLM for unified LLM access (MIT, self-hostable)

Key decision: Use production-tested open-source components rather than
reinventing the wheel, while maintaining 100% self-hostability with
no mandatory subscriptions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-29 13:42:33 +01:00

16 KiB

Raw Blame History

ADR-007: Agentic Framework Selection

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-002, SPIKE-005, SPIKE-007

Context

Syndarix requires a robust multi-agent orchestration system capable of:

Managing 50+ concurrent agent instances
Supporting long-running workflows (sprints spanning days/weeks)
Providing durable execution that survives crashes/restarts
Enabling human-in-the-loop at configurable autonomy levels
Tracking token usage and costs per agent instance
Supporting multi-provider LLM failover

We evaluated whether to adopt an existing framework wholesale or build a custom solution.

Decision Drivers

Production Readiness: Must be battle-tested, not experimental
Self-Hostability: All components must be self-hostable with no mandatory subscriptions
Flexibility: Must support Syndarix-specific patterns (autonomy levels, client approvals)
Durability: Workflows must survive failures, restarts, and deployments
Observability: Full visibility into agent activities and costs
Scalability: Handle 50+ concurrent agents without architectural changes

Considered Options

Option 1: CrewAI (Full Framework)

Pros:

Easy to get started (role-based agents)
Good for sequential/hierarchical workflows
Strong enterprise traction ($18M Series A, 60% Fortune 500)
LLM-agnostic design

Cons:

Teams report hitting walls at 6-12 months of complexity
Multi-agent coordination can cause infinite loops
Limited ceiling for complex custom patterns
Flows architecture adds learning curve without solving durability

Verdict: Rejected - insufficient flexibility for Syndarix's complex requirements

Option 2: AutoGen 0.4 (Full Framework)

Pros:

Event-driven, async-first architecture
Cross-language support (.NET, Python)
Built-in observability (OpenTelemetry)
Microsoft ecosystem integration

Cons:

Tied to Microsoft patterns
Less flexible for custom orchestration
Newer 0.4 version still maturing
No built-in durability for week-long workflows

Verdict: Rejected - too opinionated, insufficient durability

Option 3: LangGraph + Custom Infrastructure (Hybrid)

Pros:

Fine-grained control over agent flow
Excellent state management with PostgreSQL persistence
Human-in-the-loop built-in
Production-proven (Klarna, Replit, Elastic)
Fully open source (MIT license)
Can implement any pattern (supervisor, hierarchical, peer-to-peer)

Cons:

Steep learning curve (graph theory, state machines)
Needs additional infrastructure for durability (Temporal)
Observability requires additional tooling

Verdict: Selected as foundation

Option 4: Fully Custom Solution

Pros:

Complete control
No external dependencies
Tailored to exact requirements

Cons:

Reinvents production-tested solutions
Higher development and maintenance cost
Longer time to market
More bugs in critical path

Verdict: Rejected - unnecessary when proven components exist

Decision

Adopt a hybrid architecture using LangGraph as the core agent framework, complemented by:

LangGraph - Agent state machines and logic
Temporal - Durable workflow execution
Redis Streams - Agent-to-agent communication
LiteLLM - Unified LLM access with failover
PostgreSQL + pgvector - State persistence and RAG

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                    Syndarix Agentic Architecture                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                 Temporal Workflow Engine                           │  │
│  │                                                                    │  │
│  │  • Durable execution (survives crashes, restarts, deployments)    │  │
│  │  • Human approval checkpoints (wait indefinitely for client)      │  │
│  │  • Long-running workflows (projects spanning weeks/months)        │  │
│  │  • Built-in retry policies and timeouts                           │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                 LangGraph Agent Runtime                            │  │
│  │                                                                    │  │
│  │  • Graph-based state machines for agent logic                     │  │
│  │  • Persistent checkpoints to PostgreSQL                           │  │
│  │  • Cycles, conditionals, parallel execution                       │  │
│  │  • Human-in-the-loop first-class support                          │  │
│  │                                                                    │  │
│  │  ┌─────────────────────────────────────────────────────────────┐  │  │
│  │  │              Agent State Graph                               │  │  │
│  │  │  [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING]        │  │  │
│  │  │    ▲             │              │              │             │  │  │
│  │  │    └─────────────┴──────────────┴──────────────┘             │  │  │
│  │  └─────────────────────────────────────────────────────────────┘  │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │              Redis Streams Communication Layer                     │  │
│  │                                                                    │  │
│  │  • Agent-to-Agent messaging (A2A protocol concepts)               │  │
│  │  • Event-driven architecture                                      │  │
│  │  • Real-time activity streaming to UI                             │  │
│  │  • Project-scoped message channels                                │  │
│  │                                                                    │  │
│  │  License: BSD-3 | Self-Hosted: Yes | Subscription: None Required  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                    LiteLLM Gateway                                 │  │
│  │                                                                    │  │
│  │  • Unified API for 100+ LLM providers                             │  │
│  │  • Automatic failover chains (Claude → GPT-4 → Ollama)            │  │
│  │  • Token counting and cost calculation                            │  │
│  │  • Rate limiting and load balancing                               │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Component	Responsibility	Why This Choice
LangGraph	Agent state machines, tool execution, reasoning loops	Production-proven, fine-grained control, PostgreSQL checkpointing
Temporal	Durable workflows, human approvals, long-running orchestration	Only solution for week-long workflows that survive failures
Redis Streams	Agent messaging, real-time events, pub/sub	Low-latency, persistent streams, consumer groups
LiteLLM	LLM abstraction, failover, cost tracking	Unified API, automatic failover, no vendor lock-in
PostgreSQL	State persistence, audit logs, agent data	Already in stack, pgvector for RAG

Self-Hostability Guarantee

All components are fully self-hostable with permissive open-source licenses:

Component	License	Paid Cloud Alternative	Required for Syndarix?
LangGraph	MIT	LangSmith (observability)	No - use LangFuse or custom
Temporal	MIT	Temporal Cloud	No - self-host server
LiteLLM	MIT	LiteLLM Enterprise	No - self-host proxy
Redis	BSD-3	Redis Cloud	No - self-host
PostgreSQL	PostgreSQL	Various managed DBs	No - self-host

No mandatory subscriptions. All paid alternatives are optional cloud-managed offerings.

What We Build vs. What We Use

Concern	Approach	Rationale
Agent Logic	USE LangGraph	Don't reinvent state machines
LLM Access	USE LiteLLM	Don't reinvent provider abstraction
Durability	USE Temporal	Don't reinvent durable execution
Messaging	USE Redis Streams	Don't reinvent pub/sub
Orchestration	BUILD thin layer	Syndarix-specific (autonomy levels, team structure)
Agent Spawning	BUILD thin layer	Type-Instance pattern specific to Syndarix
Cost Attribution	BUILD thin layer	Per-agent, per-project tracking specific to Syndarix

Integration Pattern

# Example: How the layers integrate

# 1. Temporal orchestrates the high-level workflow
@workflow.defn
class SprintWorkflow:
    @workflow.run
    async def run(self, sprint: SprintConfig) -> SprintResult:
        # Spawns agents and waits for completion
        agents = await workflow.execute_activity(spawn_agent_team, sprint)

        # Each agent runs a LangGraph state machine
        results = await workflow.execute_activity(
            run_agent_tasks,
            agents,
            start_to_close_timeout=timedelta(days=7),
        )

        # Human checkpoint (waits indefinitely)
        if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS:
            await workflow.wait_condition(lambda: self._approved)

        return results

# 2. LangGraph handles individual agent logic
def create_agent_graph() -> StateGraph:
    graph = StateGraph(AgentState)
    graph.add_node("think", think_node)      # LLM reasoning
    graph.add_node("execute", execute_node)  # Tool calls via MCP
    graph.add_node("handoff", handoff_node)  # Message to other agent
    # ... state transitions
    return graph.compile(checkpointer=PostgresSaver(...))

# 3. LiteLLM handles LLM calls with failover
async def think_node(state: AgentState) -> AgentState:
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
        messages=state["messages"],
        fallbacks=["gpt-4-turbo", "ollama/llama3"],
        metadata={"agent_id": state["agent_id"]},
    )
    return {"messages": [response.choices[0].message]}

# 4. Redis Streams handles agent communication
async def handoff_node(state: AgentState) -> AgentState:
    await message_bus.publish(AgentMessage(
        source_agent_id=state["agent_id"],
        target_agent_id=state["handoff_target"],
        message_type="TASK_HANDOFF",
        payload=state["handoff_context"],
    ))
    return state

Consequences

Positive

Production-tested foundations - LangGraph, Temporal, LiteLLM are battle-tested
No subscription lock-in - All components self-hostable under permissive licenses
Right tool for each job - Specialized components for durability, state, communication
Escape hatches - Can replace any component without full rewrite
Enterprise patterns - Temporal used by Netflix, Uber, Stripe for similar problems

Negative

Multiple technologies to learn - Team needs LangGraph, Temporal, Redis Streams knowledge
Operational complexity - More services to deploy and monitor
Integration work - Thin glue layers needed between components

Mitigation

Learning curve - Start with simple 2-3 agent workflows, expand gradually
Operational complexity - Use Docker Compose locally, consider managed services for production if needed
Integration - Create clear abstractions; each layer only knows its immediate neighbors

Compliance

This decision aligns with:

FR-101-105: Agent orchestration requirements
FR-301-305: Workflow execution requirements
NFR-501: Self-hosting requirement (all components MIT/BSD licensed)
TC-001: PostgreSQL as primary database
TC-002: Redis for caching and messaging

Alternatives Not Chosen

LangSmith for Observability

LangSmith is LangChain's paid observability platform. Instead, we will:

Use LangFuse (open source, self-hostable) for LLM observability
Use Temporal UI (built-in) for workflow visibility
Build custom dashboards for Syndarix-specific metrics

Temporal Cloud

Temporal offers a managed cloud service. Instead, we will:

Self-host Temporal server (single-node for start, cluster for scale)
Use PostgreSQL as Temporal's persistence backend (already in stack)

References

This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.

16 KiB Raw Blame History

ADR-007: Agentic Framework Selection

Context

Decision Drivers

Considered Options

Option 1: CrewAI (Full Framework)

Option 2: AutoGen 0.4 (Full Framework)

Option 3: LangGraph + Custom Infrastructure (Hybrid)

Option 4: Fully Custom Solution

Decision

Architecture Overview

Component Responsibilities

Self-Hostability Guarantee

What We Build vs. What We Use

Integration Pattern

Consequences

Positive

Negative

Mitigation

Compliance

Alternatives Not Chosen

LangSmith for Observability

Temporal Cloud

References

16 KiB

Raw Blame History