forked from cardosofelipe/fast-next-template

Files

Felipe Cardoso 88cf4e0abc feat: Update to production model stack and fix remaining inconsistencies

## Model Stack Updates (User's Actual Models)

Updated all documentation to reflect production models:
- Claude Opus 4.5 (primary reasoning)
- GPT 5.1 Codex max (code generation specialist)
- Gemini 3 Pro/Flash (multimodal, fast inference)
- Qwen3-235B (cost-effective, self-hostable)
- DeepSeek V3.2 (self-hosted, open weights)

### Files Updated:
- ADR-004: Full model groups, failover chains, cost tables
- ADR-007: Code example with correct model identifiers
- ADR-012: Cost tracking with new model prices
- ARCHITECTURE.md: Model groups, failover diagram
- IMPLEMENTATION_ROADMAP.md: External services list

## Architecture Diagram Updates

- Added LangGraph Runtime to orchestration layer
- Added technology labels (Type-Instance, transitions)

## Self-Hostability Table Expanded

Added entries for:
- LangGraph (MIT)
- transitions (MIT)
- DeepSeek V3.2 (MIT)
- Qwen3-235B (Apache 2.0)

## Metric Alignments

- Response time: Split into API (<200ms) and Agent (<10s/<60s)
- Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing
- Added concurrent projects (10+) and agents (50+) metrics

## Infrastructure Updates

- Celery workers: 4-8 instances (was 2-4) across 4 queues
- MCP servers: Clarified Phase 2 + Phase 5 deployment
- Sync interval: Clarified 60s fallback + 15min reconciliation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-29 23:35:51 +01:00

20 KiB

Raw Permalink Blame History

ADR-007: Agentic Framework Selection

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-002, SPIKE-005, SPIKE-007

Context

Syndarix requires a robust multi-agent orchestration system capable of:

Managing 50+ concurrent agent instances
Supporting long-running workflows (sprints spanning days/weeks)
Providing durable execution that survives crashes/restarts
Enabling human-in-the-loop at configurable autonomy levels
Tracking token usage and costs per agent instance
Supporting multi-provider LLM failover

We evaluated whether to adopt an existing framework wholesale or build a custom solution.

Decision Drivers

Production Readiness: Must be battle-tested, not experimental
Self-Hostability: All components must be self-hostable with no mandatory subscriptions
Flexibility: Must support Syndarix-specific patterns (autonomy levels, client approvals)
Durability: Workflows must survive failures, restarts, and deployments
Observability: Full visibility into agent activities and costs
Scalability: Handle 50+ concurrent agents without architectural changes

Considered Options

Option 1: CrewAI (Full Framework)

Pros:

Easy to get started (role-based agents)
Good for sequential/hierarchical workflows
Strong enterprise traction ($18M Series A, 60% Fortune 500)
LLM-agnostic design

Cons:

Teams report hitting walls at 6-12 months of complexity
Multi-agent coordination can cause infinite loops
Limited ceiling for complex custom patterns
Flows architecture adds learning curve without solving durability

Verdict: Rejected - insufficient flexibility for Syndarix's complex requirements

Option 2: AutoGen 0.4 (Full Framework)

Pros:

Event-driven, async-first architecture
Cross-language support (.NET, Python)
Built-in observability (OpenTelemetry)
Microsoft ecosystem integration

Cons:

Tied to Microsoft patterns
Less flexible for custom orchestration
Newer 0.4 version still maturing
No built-in durability for week-long workflows

Verdict: Rejected - too opinionated, insufficient durability

Option 3: LangGraph + Custom Infrastructure (Hybrid)

Pros:

Fine-grained control over agent flow
Excellent state management with PostgreSQL persistence
Human-in-the-loop built-in
Production-proven (Klarna, Replit, Elastic)
Fully open source (MIT license)
Can implement any pattern (supervisor, hierarchical, peer-to-peer)

Cons:

Steep learning curve (graph theory, state machines)
Needs additional infrastructure for durability (Temporal)
Observability requires additional tooling

Verdict: Selected as foundation

Option 4: Fully Custom Solution

Pros:

Complete control
No external dependencies
Tailored to exact requirements

Cons:

Reinvents production-tested solutions
Higher development and maintenance cost
Longer time to market
More bugs in critical path

Verdict: Rejected - unnecessary when proven components exist

Decision

Adopt a hybrid architecture using LangGraph as the core agent framework, complemented by:

LangGraph - Agent state machines and logic
transitions + PostgreSQL + Celery - Durable workflow state machines
Redis Streams - Agent-to-agent communication
LiteLLM - Unified LLM access with failover
PostgreSQL + pgvector - State persistence and RAG

Why Not Temporal?

After evaluating both approaches, we chose the simpler transitions + PostgreSQL + Celery stack over Temporal:

Factor	Temporal	transitions + PostgreSQL
Complexity	High (separate cluster, workers, SDK)	Low (Python library + existing infra)
Learning Curve	Steep (new paradigm)	Gentle (familiar patterns)
Infrastructure	Dedicated cluster required	Uses existing PostgreSQL + Celery
Scale Target	Enterprise (1000s of workflows)	Syndarix (10s of agents)
Debugging	Temporal UI (powerful but complex)	Standard DB queries + logs

Temporal is overkill for our scale (10-50 concurrent agents). The simpler approach provides:

Full durability via PostgreSQL state persistence
Event sourcing via transition history table
Background execution via Celery workers
Simpler debugging with standard tools

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                    Syndarix Agentic Architecture                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │           Workflow Engine (transitions + PostgreSQL)               │  │
│  │                                                                    │  │
│  │  • State persistence to PostgreSQL (survives restarts)            │  │
│  │  • Event sourcing via workflow_transitions table                  │  │
│  │  • Human approval checkpoints (pause workflow, await signal)      │  │
│  │  • Background execution via Celery workers                        │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                 LangGraph Agent Runtime                            │  │
│  │                                                                    │  │
│  │  • Graph-based state machines for agent logic                     │  │
│  │  • Persistent checkpoints to PostgreSQL                           │  │
│  │  • Cycles, conditionals, parallel execution                       │  │
│  │  • Human-in-the-loop first-class support                          │  │
│  │                                                                    │  │
│  │  ┌─────────────────────────────────────────────────────────────┐  │  │
│  │  │              Agent State Graph                               │  │  │
│  │  │  [IDLE] ──► [THINKING] ──► [EXECUTING] ──► [WAITING]        │  │  │
│  │  │    ▲             │              │              │             │  │  │
│  │  │    └─────────────┴──────────────┴──────────────┘             │  │  │
│  │  └─────────────────────────────────────────────────────────────┘  │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │              Redis Streams Communication Layer                     │  │
│  │                                                                    │  │
│  │  • Agent-to-Agent messaging (A2A protocol concepts)               │  │
│  │  • Event-driven architecture                                      │  │
│  │  • Real-time activity streaming to UI                             │  │
│  │  • Project-scoped message channels                                │  │
│  │                                                                    │  │
│  │  License: BSD-3 | Self-Hosted: Yes | Subscription: None Required  │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                    │                                     │
│                                    ▼                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                    LiteLLM Gateway                                 │  │
│  │                                                                    │  │
│  │  • Unified API for 100+ LLM providers                             │  │
│  │  • Automatic failover chains (Claude → GPT-4 → Ollama)            │  │
│  │  • Token counting and cost calculation                            │  │
│  │  • Rate limiting and load balancing                               │  │
│  │                                                                    │  │
│  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Component	Responsibility	Why This Choice
LangGraph	Agent state machines, tool execution, reasoning loops	Production-proven, fine-grained control, PostgreSQL checkpointing
transitions	Workflow state machines (sprint, story, PR)	Lightweight, Pythonic, no external dependencies
Celery + Redis	Background task execution, async workflows	Already in stack, battle-tested
PostgreSQL	Workflow state persistence, event sourcing	ACID guarantees, survives restarts
Redis Streams	Agent messaging, real-time events, pub/sub	Low-latency, persistent streams, consumer groups
LiteLLM	LLM abstraction, failover, cost tracking	Unified API, automatic failover, no vendor lock-in

Reboot Survival (Durability)

The architecture fully supports system reboots and crashes:

Workflow State: Persisted to PostgreSQL workflow_instances table
Transition History: Event-sourced in workflow_transitions table
Agent Checkpoints: LangGraph persists to PostgreSQL
Pending Tasks: Celery tasks in Redis (configured with persistence)

Recovery Process:

System Restart
     │
     ▼
Load workflow_instances WHERE status = 'in_progress'
     │
     ▼
For each workflow:
├── Restore state from context JSONB
├── Identify current_state
├── Resume from last checkpoint
└── Continue execution

Self-Hostability Guarantee

All components are fully self-hostable with permissive open-source licenses:

Component	License	Paid Cloud Alternative	Required for Syndarix?
LangGraph	MIT	LangSmith (observability)	No - use LangFuse or custom
transitions	MIT	N/A	N/A - simple library
Celery	BSD-3	Various	No - self-host
LiteLLM	MIT	LiteLLM Enterprise	No - self-host proxy
Redis	BSD-3	Redis Cloud	No - self-host
PostgreSQL	PostgreSQL	Various managed DBs	No - self-host

No mandatory subscriptions. All paid alternatives are optional cloud-managed offerings.

What We Build vs. What We Use

Concern	Approach	Rationale
Agent Logic	USE LangGraph	Don't reinvent state machines
LLM Access	USE LiteLLM	Don't reinvent provider abstraction
Workflow State	USE transitions + PostgreSQL	Simple, durable, debuggable
Background Tasks	USE Celery	Already in stack, proven
Messaging	USE Redis Streams	Don't reinvent pub/sub
Orchestration	BUILD thin layer	Syndarix-specific (autonomy levels, team structure)
Agent Spawning	BUILD thin layer	Type-Instance pattern specific to Syndarix
Cost Attribution	BUILD thin layer	Per-agent, per-project tracking specific to Syndarix

Integration Pattern

# Example: How the layers integrate

# 1. Workflow state machine (transitions library)
class SprintWorkflow(Machine):
    states = ['planning', 'active', 'review', 'done']

    def __init__(self, sprint_id: str):
        self.sprint_id = sprint_id
        Machine.__init__(
            self,
            states=self.states,
            initial='planning',
            after_state_change='persist_state'
        )
        self.add_transition('start', 'planning', 'active', before='spawn_agents')
        self.add_transition('complete_work', 'active', 'review')
        self.add_transition('approve', 'review', 'done', conditions='has_approval')

    async def persist_state(self):
        """Save state to PostgreSQL (survives restarts)"""
        await db.execute("""
            UPDATE workflow_instances
            SET current_state = $1, context = $2, updated_at = NOW()
            WHERE id = $3
        """, self.state, self.context, self.sprint_id)

# 2. Background execution via Celery
@celery_app.task(bind=True, max_retries=3)
def run_sprint_workflow(self, sprint_id: str):
    workflow = SprintWorkflow.load(sprint_id)  # Restore from DB
    workflow.start()  # Triggers agent spawning
    # Workflow persists state, can resume after restart

# 3. LangGraph handles individual agent logic
def create_agent_graph() -> StateGraph:
    graph = StateGraph(AgentState)
    graph.add_node("think", think_node)      # LLM reasoning
    graph.add_node("execute", execute_node)  # Tool calls via MCP
    graph.add_node("handoff", handoff_node)  # Message to other agent
    # ... state transitions
    return graph.compile(checkpointer=PostgresSaver(...))

# 4. LiteLLM handles LLM calls with failover
async def think_node(state: AgentState) -> AgentState:
    response = await litellm.acompletion(
        model="claude-opus-4-5",  # Claude Opus 4.5 (primary)
        messages=state["messages"],
        fallbacks=["gpt-5.1-codex-max", "gemini-3-pro", "qwen3-235b", "deepseek-v3.2"],
        metadata={"agent_id": state["agent_id"]},
    )
    return {"messages": [response.choices[0].message]}

# 5. Redis Streams handles agent communication
async def handoff_node(state: AgentState) -> AgentState:
    await message_bus.publish(AgentMessage(
        source_agent_id=state["agent_id"],
        target_agent_id=state["handoff_target"],
        message_type="TASK_HANDOFF",
        payload=state["handoff_context"],
    ))
    return state

Human Approval Checkpoints

For workflows requiring human approval (FULL_CONTROL and MILESTONE modes):

class StoryWorkflow(Machine):
    async def request_approval_and_wait(self, action: str):
        """Pause workflow and await human decision."""
        # 1. Create approval request
        request = await approval_service.create(
            workflow_id=self.id,
            action=action,
            context=self.context
        )

        # 2. Transition to waiting state (persisted)
        self.state = 'awaiting_approval'
        await self.persist_state()

        # 3. Workflow is paused - Celery task completes
        # When user approves, a new task resumes the workflow

    @classmethod
    async def resume_on_approval(cls, workflow_id: str, approved: bool):
        """Called when user makes a decision."""
        workflow = await cls.load(workflow_id)
        if approved:
            workflow.trigger('approved')
        else:
            workflow.trigger('rejected')

Consequences

Positive

Production-tested foundations - LangGraph, Celery, LiteLLM are battle-tested
No subscription lock-in - All components self-hostable under permissive licenses
Right tool for each job - Specialized components for state, communication, background processing
Escape hatches - Can replace any component without full rewrite
Simpler operations - Uses existing PostgreSQL + Redis infrastructure, no new services
Reboot survival - Full durability via PostgreSQL persistence

Negative

Multiple technologies to learn - Team needs LangGraph, transitions, Redis Streams knowledge
Integration work - Thin glue layers needed between components
Manual recovery logic - Must implement workflow recovery on startup

Mitigation

Learning curve - Start with simple 2-3 agent workflows, expand gradually
Integration - Create clear abstractions; each layer only knows its immediate neighbors
Recovery - Implement startup recovery task that scans for in-progress workflows

Compliance

This decision aligns with:

FR-101-105: Agent management requirements (Type-Instance pattern)
FR-301-305: Workflow execution requirements
NFR-402: Fault tolerance (workflow durability, crash recovery)
TC-001: PostgreSQL as primary database
Core Principle: Self-hostability (all components MIT/BSD licensed)

Alternatives Not Chosen

LangSmith for Observability

LangSmith is LangChain's paid observability platform. Instead, we will:

Use LangFuse (open source, self-hostable) for LLM observability
Use standard logging + PostgreSQL queries for workflow visibility
Build custom dashboards for Syndarix-specific metrics

Temporal for Durable Workflows

Temporal was initially considered but rejected for this project:

Overkill for scale - Syndarix targets 10-50 concurrent agents, not thousands
Operational overhead - Requires separate cluster, workers, SDK learning curve
Simpler alternative available - transitions + PostgreSQL provides equivalent durability
Migration path - If scale demands grow, Temporal can be introduced later

References

This ADR establishes the foundational framework choices for Syndarix's multi-agent orchestration system.

20 KiB Raw Permalink Blame History

ADR-007: Agentic Framework Selection

Context

Decision Drivers

Considered Options

Option 1: CrewAI (Full Framework)

Option 2: AutoGen 0.4 (Full Framework)

Option 3: LangGraph + Custom Infrastructure (Hybrid)

Option 4: Fully Custom Solution

Decision

Why Not Temporal?

Architecture Overview

Component Responsibilities

Reboot Survival (Durability)

Self-Hostability Guarantee

What We Build vs. What We Use

Integration Pattern

Human Approval Checkpoints

Consequences

Positive

Negative

Mitigation

Compliance

Alternatives Not Chosen

LangSmith for Observability

Temporal for Durable Workflows

References

20 KiB

Raw Permalink Blame History