From de47d9ee4351ebe37bdee551f41ff913f0842d12 Mon Sep 17 00:00:00 2001 From: Felipe Cardoso Date: Mon, 29 Dec 2025 14:04:37 +0100 Subject: [PATCH] fix: Resolve ADR-007 vs ADR-010 Temporal contradiction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove Temporal from the architecture in favor of the simpler transitions + PostgreSQL + Celery approach. This aligns ADR-007 with ADR-010 based on user preference for simpler operations. Key changes: - ADR-007 now recommends transitions library instead of Temporal - Added explicit "Why Not Temporal?" section explaining the trade-off - Added "Reboot Survival" section documenting durability guarantees - Updated architecture diagrams and component responsibilities - Updated ARCHITECTURE.md summary matrix The simpler approach is more appropriate for Syndarix's scale (10-50 concurrent agents) and uses existing PostgreSQL + Celery infrastructure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- .../ADR-007-agentic-framework-selection.md | 174 +++++++++++++----- docs/architecture/ARCHITECTURE.md | 2 +- 2 files changed, 134 insertions(+), 42 deletions(-) diff --git a/docs/adrs/ADR-007-agentic-framework-selection.md b/docs/adrs/ADR-007-agentic-framework-selection.md index e88a985..367d54b 100644 --- a/docs/adrs/ADR-007-agentic-framework-selection.md +++ b/docs/adrs/ADR-007-agentic-framework-selection.md @@ -99,11 +99,29 @@ We evaluated whether to adopt an existing framework wholesale or build a custom **Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by: 1. **LangGraph** - Agent state machines and logic -2. **Temporal** - Durable workflow execution +2. **transitions + PostgreSQL + Celery** - Durable workflow state machines 3. **Redis Streams** - Agent-to-agent communication 4. **LiteLLM** - Unified LLM access with failover 5. **PostgreSQL + pgvector** - State persistence and RAG +### Why Not Temporal? + +After evaluating both approaches, we chose the simpler **transitions + PostgreSQL + Celery** stack over Temporal: + +| Factor | Temporal | transitions + PostgreSQL | +|--------|----------|-------------------------| +| Complexity | High (separate cluster, workers, SDK) | Low (Python library + existing infra) | +| Learning Curve | Steep (new paradigm) | Gentle (familiar patterns) | +| Infrastructure | Dedicated cluster required | Uses existing PostgreSQL + Celery | +| Scale Target | Enterprise (1000s of workflows) | Syndarix (10s of agents) | +| Debugging | Temporal UI (powerful but complex) | Standard DB queries + logs | + +**Temporal is overkill for our scale** (10-50 concurrent agents). The simpler approach provides: +- Full durability via PostgreSQL state persistence +- Event sourcing via transition history table +- Background execution via Celery workers +- Simpler debugging with standard tools + ### Architecture Overview ``` @@ -112,12 +130,12 @@ We evaluated whether to adopt an existing framework wholesale or build a custom ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ -│ │ Temporal Workflow Engine │ │ +│ │ Workflow Engine (transitions + PostgreSQL) │ │ │ │ │ │ -│ │ • Durable execution (survives crashes, restarts, deployments) │ │ -│ │ • Human approval checkpoints (wait indefinitely for client) │ │ -│ │ • Long-running workflows (projects spanning weeks/months) │ │ -│ │ • Built-in retry policies and timeouts │ │ +│ │ • State persistence to PostgreSQL (survives restarts) │ │ +│ │ • Event sourcing via workflow_transitions table │ │ +│ │ • Human approval checkpoints (pause workflow, await signal) │ │ +│ │ • Background execution via Celery workers │ │ │ │ │ │ │ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │ │ └───────────────────────────────────────────────────────────────────┘ │ @@ -173,10 +191,35 @@ We evaluated whether to adopt an existing framework wholesale or build a custom | Component | Responsibility | Why This Choice | |-----------|---------------|-----------------| | **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing | -| **Temporal** | Durable workflows, human approvals, long-running orchestration | Only solution for week-long workflows that survive failures | +| **transitions** | Workflow state machines (sprint, story, PR) | Lightweight, Pythonic, no external dependencies | +| **Celery + Redis** | Background task execution, async workflows | Already in stack, battle-tested | +| **PostgreSQL** | Workflow state persistence, event sourcing | ACID guarantees, survives restarts | | **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups | | **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in | -| **PostgreSQL** | State persistence, audit logs, agent data | Already in stack, pgvector for RAG | + +### Reboot Survival (Durability) + +The architecture **fully supports system reboots and crashes**: + +1. **Workflow State**: Persisted to PostgreSQL `workflow_instances` table +2. **Transition History**: Event-sourced in `workflow_transitions` table +3. **Agent Checkpoints**: LangGraph persists to PostgreSQL +4. **Pending Tasks**: Celery tasks in Redis (configured with persistence) + +**Recovery Process:** +``` +System Restart + │ + ▼ +Load workflow_instances WHERE status = 'in_progress' + │ + ▼ +For each workflow: +├── Restore state from context JSONB +├── Identify current_state +├── Resume from last checkpoint +└── Continue execution +``` ### Self-Hostability Guarantee @@ -185,7 +228,8 @@ All components are fully self-hostable with permissive open-source licenses: | Component | License | Paid Cloud Alternative | Required for Syndarix? | |-----------|---------|----------------------|----------------------| | LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom | -| Temporal | MIT | Temporal Cloud | No - self-host server | +| transitions | MIT | N/A | N/A - simple library | +| Celery | BSD-3 | Various | No - self-host | | LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy | | Redis | BSD-3 | Redis Cloud | No - self-host | | PostgreSQL | PostgreSQL | Various managed DBs | No - self-host | @@ -198,7 +242,8 @@ All components are fully self-hostable with permissive open-source licenses: |---------|----------|-----------| | Agent Logic | **USE LangGraph** | Don't reinvent state machines | | LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction | -| Durability | **USE Temporal** | Don't reinvent durable execution | +| Workflow State | **USE transitions + PostgreSQL** | Simple, durable, debuggable | +| Background Tasks | **USE Celery** | Already in stack, proven | | Messaging | **USE Redis Streams** | Don't reinvent pub/sub | | Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) | | Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix | @@ -209,28 +254,38 @@ All components are fully self-hostable with permissive open-source licenses: ```python # Example: How the layers integrate -# 1. Temporal orchestrates the high-level workflow -@workflow.defn -class SprintWorkflow: - @workflow.run - async def run(self, sprint: SprintConfig) -> SprintResult: - # Spawns agents and waits for completion - agents = await workflow.execute_activity(spawn_agent_team, sprint) +# 1. Workflow state machine (transitions library) +class SprintWorkflow(Machine): + states = ['planning', 'active', 'review', 'done'] - # Each agent runs a LangGraph state machine - results = await workflow.execute_activity( - run_agent_tasks, - agents, - start_to_close_timeout=timedelta(days=7), + def __init__(self, sprint_id: str): + self.sprint_id = sprint_id + Machine.__init__( + self, + states=self.states, + initial='planning', + after_state_change='persist_state' ) + self.add_transition('start', 'planning', 'active', before='spawn_agents') + self.add_transition('complete_work', 'active', 'review') + self.add_transition('approve', 'review', 'done', conditions='has_approval') - # Human checkpoint (waits indefinitely) - if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS: - await workflow.wait_condition(lambda: self._approved) + async def persist_state(self): + """Save state to PostgreSQL (survives restarts)""" + await db.execute(""" + UPDATE workflow_instances + SET current_state = $1, context = $2, updated_at = NOW() + WHERE id = $3 + """, self.state, self.context, self.sprint_id) - return results +# 2. Background execution via Celery +@celery_app.task(bind=True, max_retries=3) +def run_sprint_workflow(self, sprint_id: str): + workflow = SprintWorkflow.load(sprint_id) # Restore from DB + workflow.start() # Triggers agent spawning + # Workflow persists state, can resume after restart -# 2. LangGraph handles individual agent logic +# 3. LangGraph handles individual agent logic def create_agent_graph() -> StateGraph: graph = StateGraph(AgentState) graph.add_node("think", think_node) # LLM reasoning @@ -239,7 +294,7 @@ def create_agent_graph() -> StateGraph: # ... state transitions return graph.compile(checkpointer=PostgresSaver(...)) -# 3. LiteLLM handles LLM calls with failover +# 4. LiteLLM handles LLM calls with failover async def think_node(state: AgentState) -> AgentState: response = await litellm.acompletion( model="claude-sonnet-4-20250514", @@ -249,7 +304,7 @@ async def think_node(state: AgentState) -> AgentState: ) return {"messages": [response.choices[0].message]} -# 4. Redis Streams handles agent communication +# 5. Redis Streams handles agent communication async def handoff_node(state: AgentState) -> AgentState: await message_bus.publish(AgentMessage( source_agent_id=state["agent_id"], @@ -260,27 +315,60 @@ async def handoff_node(state: AgentState) -> AgentState: return state ``` +### Human Approval Checkpoints + +For workflows requiring human approval (FULL_CONTROL and MILESTONE modes): + +```python +class StoryWorkflow(Machine): + async def request_approval_and_wait(self, action: str): + """Pause workflow and await human decision.""" + # 1. Create approval request + request = await approval_service.create( + workflow_id=self.id, + action=action, + context=self.context + ) + + # 2. Transition to waiting state (persisted) + self.state = 'awaiting_approval' + await self.persist_state() + + # 3. Workflow is paused - Celery task completes + # When user approves, a new task resumes the workflow + + @classmethod + async def resume_on_approval(cls, workflow_id: str, approved: bool): + """Called when user makes a decision.""" + workflow = await cls.load(workflow_id) + if approved: + workflow.trigger('approved') + else: + workflow.trigger('rejected') +``` + ## Consequences ### Positive -- **Production-tested foundations** - LangGraph, Temporal, LiteLLM are battle-tested +- **Production-tested foundations** - LangGraph, Celery, LiteLLM are battle-tested - **No subscription lock-in** - All components self-hostable under permissive licenses -- **Right tool for each job** - Specialized components for durability, state, communication +- **Right tool for each job** - Specialized components for state, communication, background processing - **Escape hatches** - Can replace any component without full rewrite -- **Enterprise patterns** - Temporal used by Netflix, Uber, Stripe for similar problems +- **Simpler operations** - Uses existing PostgreSQL + Redis infrastructure, no new services +- **Reboot survival** - Full durability via PostgreSQL persistence ### Negative -- **Multiple technologies to learn** - Team needs LangGraph, Temporal, Redis Streams knowledge -- **Operational complexity** - More services to deploy and monitor +- **Multiple technologies to learn** - Team needs LangGraph, transitions, Redis Streams knowledge - **Integration work** - Thin glue layers needed between components +- **Manual recovery logic** - Must implement workflow recovery on startup ### Mitigation - **Learning curve** - Start with simple 2-3 agent workflows, expand gradually -- **Operational complexity** - Use Docker Compose locally, consider managed services for production if needed - **Integration** - Create clear abstractions; each layer only knows its immediate neighbors +- **Recovery** - Implement startup recovery task that scans for in-progress workflows ## Compliance @@ -297,23 +385,27 @@ This decision aligns with: LangSmith is LangChain's paid observability platform. Instead, we will: - Use **LangFuse** (open source, self-hostable) for LLM observability -- Use **Temporal UI** (built-in) for workflow visibility +- Use standard logging + PostgreSQL queries for workflow visibility - Build custom dashboards for Syndarix-specific metrics -### Temporal Cloud +### Temporal for Durable Workflows -Temporal offers a managed cloud service. Instead, we will: -- Self-host Temporal server (single-node for start, cluster for scale) -- Use PostgreSQL as Temporal's persistence backend (already in stack) +Temporal was initially considered but rejected for this project: +- **Overkill for scale** - Syndarix targets 10-50 concurrent agents, not thousands +- **Operational overhead** - Requires separate cluster, workers, SDK learning curve +- **Simpler alternative available** - transitions + PostgreSQL provides equivalent durability +- **Migration path** - If scale demands grow, Temporal can be introduced later ## References - [LangGraph Documentation](https://langchain-ai.github.io/langgraph/) -- [Temporal.io Documentation](https://docs.temporal.io/) +- [transitions Library](https://github.com/pytransitions/transitions) - [LiteLLM Documentation](https://docs.litellm.ai/) - [LangFuse (Open Source LLM Observability)](https://langfuse.com/) - [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md) - [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md) +- [SPIKE-008: Workflow State Machine](../spikes/SPIKE-008-workflow-state-machine.md) +- [ADR-010: Workflow State Machine](./ADR-010-workflow-state-machine.md) --- diff --git a/docs/architecture/ARCHITECTURE.md b/docs/architecture/ARCHITECTURE.md index 840dd86..84ac920 100644 --- a/docs/architecture/ARCHITECTURE.md +++ b/docs/architecture/ARCHITECTURE.md @@ -90,7 +90,7 @@ Syndarix is an autonomous AI-powered software consulting platform that orchestra | ADR-004 | LLM Provider | LiteLLM with failover | | ADR-005 | Tech Stack | PragmaStack + extensions | | ADR-006 | Agent Orchestration | Type-Instance pattern | -| ADR-007 | Framework Selection | Hybrid (LangGraph + custom) | +| ADR-007 | Framework Selection | Hybrid (LangGraph + transitions + Celery) | | ADR-008 | Knowledge Base | pgvector for RAG | | ADR-009 | Agent Communication | Structured messages + Redis Streams | | ADR-010 | Workflows | transitions + PostgreSQL + Celery |