fix: Resolve ADR-007 vs ADR-010 Temporal contradiction

Remove Temporal from the architecture in favor of the simpler transitions + PostgreSQL + Celery approach. This aligns ADR-007 with ADR-010 based on user preference for simpler operations. Key changes: - ADR-007 now recommends transitions library instead of Temporal - Added explicit "Why Not Temporal?" section explaining the trade-off - Added "Reboot Survival" section documenting durability guarantees - Updated architecture diagrams and component responsibilities - Updated ARCHITECTURE.md summary matrix The simpler approach is more appropriate for Syndarix's scale (10-50 concurrent agents) and uses existing PostgreSQL + Celery infrastructure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 14:04:37 +01:00
parent 406b25cda0
commit de47d9ee43
2 changed files with 134 additions and 42 deletions
--- a/docs/adrs/ADR-007-agentic-framework-selection.md
+++ b/docs/adrs/ADR-007-agentic-framework-selection.md
@@ -99,11 +99,29 @@ We evaluated whether to adopt an existing framework wholesale or build a custom
 **Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by:

 1. **LangGraph** - Agent state machines and logic
-2. **Temporal** - Durable workflow execution
+2. **transitions + PostgreSQL + Celery** - Durable workflow state machines
 3. **Redis Streams** - Agent-to-agent communication
 4. **LiteLLM** - Unified LLM access with failover
 5. **PostgreSQL + pgvector** - State persistence and RAG

+### Why Not Temporal?
+
+After evaluating both approaches, we chose the simpler **transitions + PostgreSQL + Celery** stack over Temporal:
+
+| Factor | Temporal | transitions + PostgreSQL |
+|--------|----------|-------------------------|
+| Complexity | High (separate cluster, workers, SDK) | Low (Python library + existing infra) |
+| Learning Curve | Steep (new paradigm) | Gentle (familiar patterns) |
+| Infrastructure | Dedicated cluster required | Uses existing PostgreSQL + Celery |
+| Scale Target | Enterprise (1000s of workflows) | Syndarix (10s of agents) |
+| Debugging | Temporal UI (powerful but complex) | Standard DB queries + logs |
+
+**Temporal is overkill for our scale** (10-50 concurrent agents). The simpler approach provides:
+- Full durability via PostgreSQL state persistence
+- Event sourcing via transition history table
+- Background execution via Celery workers
+- Simpler debugging with standard tools
+
 ### Architecture Overview

 ```
@@ -112,12 +130,12 @@ We evaluated whether to adopt an existing framework wholesale or build a custom
 ├─────────────────────────────────────────────────────────────────────────┤
 │                                                                          │
 │  ┌───────────────────────────────────────────────────────────────────┐  │
-│  │                 Temporal Workflow Engine                           │  │
+│  │           Workflow Engine (transitions + PostgreSQL)               │  │
 │  │                                                                    │  │
-│  │  • Durable execution (survives crashes, restarts, deployments)    │  │
-│  │  • Human approval checkpoints (wait indefinitely for client)      │  │
-│  │  • Long-running workflows (projects spanning weeks/months)        │  │
-│  │  • Built-in retry policies and timeouts                           │  │
+│  │  • State persistence to PostgreSQL (survives restarts)            │  │
+│  │  • Event sourcing via workflow_transitions table                  │  │
+│  │  • Human approval checkpoints (pause workflow, await signal)      │  │
+│  │  • Background execution via Celery workers                        │  │
 │  │                                                                    │  │
 │  │  License: MIT | Self-Hosted: Yes | Subscription: None Required    │  │
 │  └───────────────────────────────────────────────────────────────────┘  │
@@ -173,10 +191,35 @@ We evaluated whether to adopt an existing framework wholesale or build a custom
 | Component | Responsibility | Why This Choice |
 |-----------|---------------|-----------------|
 | **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing |
-| **Temporal** | Durable workflows, human approvals, long-running orchestration | Only solution for week-long workflows that survive failures |
+| **transitions** | Workflow state machines (sprint, story, PR) | Lightweight, Pythonic, no external dependencies |
+| **Celery + Redis** | Background task execution, async workflows | Already in stack, battle-tested |
+| **PostgreSQL** | Workflow state persistence, event sourcing | ACID guarantees, survives restarts |
 | **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups |
 | **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in |
-| **PostgreSQL** | State persistence, audit logs, agent data | Already in stack, pgvector for RAG |
+
+### Reboot Survival (Durability)
+
+The architecture **fully supports system reboots and crashes**:
+
+1. **Workflow State**: Persisted to PostgreSQL `workflow_instances` table
+2. **Transition History**: Event-sourced in `workflow_transitions` table
+3. **Agent Checkpoints**: LangGraph persists to PostgreSQL
+4. **Pending Tasks**: Celery tasks in Redis (configured with persistence)
+
+**Recovery Process:**
+```
+System Restart
+     │
+     ▼
+Load workflow_instances WHERE status = 'in_progress'
+     │
+     ▼
+For each workflow:
+├── Restore state from context JSONB
+├── Identify current_state
+├── Resume from last checkpoint
+└── Continue execution
+```

 ### Self-Hostability Guarantee

@@ -185,7 +228,8 @@ All components are fully self-hostable with permissive open-source licenses:
 | Component | License | Paid Cloud Alternative | Required for Syndarix? |
 |-----------|---------|----------------------|----------------------|
 | LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom |
-| Temporal | MIT | Temporal Cloud | No - self-host server |
+| transitions | MIT | N/A | N/A - simple library |
+| Celery | BSD-3 | Various | No - self-host |
 | LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy |
 | Redis | BSD-3 | Redis Cloud | No - self-host |
 | PostgreSQL | PostgreSQL | Various managed DBs | No - self-host |
@@ -198,7 +242,8 @@ All components are fully self-hostable with permissive open-source licenses:
 |---------|----------|-----------|
 | Agent Logic | **USE LangGraph** | Don't reinvent state machines |
 | LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction |
-| Durability | **USE Temporal** | Don't reinvent durable execution |
+| Workflow State | **USE transitions + PostgreSQL** | Simple, durable, debuggable |
+| Background Tasks | **USE Celery** | Already in stack, proven |
 | Messaging | **USE Redis Streams** | Don't reinvent pub/sub |
 | Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) |
 | Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix |
@@ -209,28 +254,38 @@ All components are fully self-hostable with permissive open-source licenses:
 ```python
 # Example: How the layers integrate

-# 1. Temporal orchestrates the high-level workflow
-@workflow.defn
-class SprintWorkflow:
-    @workflow.run
-    async def run(self, sprint: SprintConfig) -> SprintResult:
-        # Spawns agents and waits for completion
-        agents = await workflow.execute_activity(spawn_agent_team, sprint)
+# 1. Workflow state machine (transitions library)
+class SprintWorkflow(Machine):
+    states = ['planning', 'active', 'review', 'done']

-        # Each agent runs a LangGraph state machine
-        results = await workflow.execute_activity(
-            run_agent_tasks,
-            agents,
-            start_to_close_timeout=timedelta(days=7),
+    def __init__(self, sprint_id: str):
+        self.sprint_id = sprint_id
+        Machine.__init__(
+            self,
+            states=self.states,
+            initial='planning',
+            after_state_change='persist_state'
        )
+        self.add_transition('start', 'planning', 'active', before='spawn_agents')
+        self.add_transition('complete_work', 'active', 'review')
+        self.add_transition('approve', 'review', 'done', conditions='has_approval')

-        # Human checkpoint (waits indefinitely)
-        if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS:
-            await workflow.wait_condition(lambda: self._approved)
+    async def persist_state(self):
+        """Save state to PostgreSQL (survives restarts)"""
+        await db.execute("""
+            UPDATE workflow_instances
+            SET current_state = $1, context = $2, updated_at = NOW()
+            WHERE id = $3
+        """, self.state, self.context, self.sprint_id)

-        return results
+# 2. Background execution via Celery
+@celery_app.task(bind=True, max_retries=3)
+def run_sprint_workflow(self, sprint_id: str):
+    workflow = SprintWorkflow.load(sprint_id)  # Restore from DB
+    workflow.start()  # Triggers agent spawning
+    # Workflow persists state, can resume after restart

-# 2. LangGraph handles individual agent logic
+# 3. LangGraph handles individual agent logic
 def create_agent_graph() -> StateGraph:
    graph = StateGraph(AgentState)
    graph.add_node("think", think_node)      # LLM reasoning
@@ -239,7 +294,7 @@ def create_agent_graph() -> StateGraph:
    # ... state transitions
    return graph.compile(checkpointer=PostgresSaver(...))

-# 3. LiteLLM handles LLM calls with failover
+# 4. LiteLLM handles LLM calls with failover
 async def think_node(state: AgentState) -> AgentState:
    response = await litellm.acompletion(
        model="claude-sonnet-4-20250514",
@@ -249,7 +304,7 @@ async def think_node(state: AgentState) -> AgentState:
    )
    return {"messages": [response.choices[0].message]}

-# 4. Redis Streams handles agent communication
+# 5. Redis Streams handles agent communication
 async def handoff_node(state: AgentState) -> AgentState:
    await message_bus.publish(AgentMessage(
        source_agent_id=state["agent_id"],
@@ -260,27 +315,60 @@ async def handoff_node(state: AgentState) -> AgentState:
    return state
 ```

+### Human Approval Checkpoints
+
+For workflows requiring human approval (FULL_CONTROL and MILESTONE modes):
+
+```python
+class StoryWorkflow(Machine):
+    async def request_approval_and_wait(self, action: str):
+        """Pause workflow and await human decision."""
+        # 1. Create approval request
+        request = await approval_service.create(
+            workflow_id=self.id,
+            action=action,
+            context=self.context
+        )
+
+        # 2. Transition to waiting state (persisted)
+        self.state = 'awaiting_approval'
+        await self.persist_state()
+
+        # 3. Workflow is paused - Celery task completes
+        # When user approves, a new task resumes the workflow
+
+    @classmethod
+    async def resume_on_approval(cls, workflow_id: str, approved: bool):
+        """Called when user makes a decision."""
+        workflow = await cls.load(workflow_id)
+        if approved:
+            workflow.trigger('approved')
+        else:
+            workflow.trigger('rejected')
+```
+
 ## Consequences

 ### Positive

- **Production-tested foundations** - LangGraph, Temporal, LiteLLM are battle-tested
+- **Production-tested foundations** - LangGraph, Celery, LiteLLM are battle-tested
 - **No subscription lock-in** - All components self-hostable under permissive licenses
- **Right tool for each job** - Specialized components for durability, state, communication
+- **Right tool for each job** - Specialized components for state, communication, background processing
 - **Escape hatches** - Can replace any component without full rewrite
- **Enterprise patterns** - Temporal used by Netflix, Uber, Stripe for similar problems
+- **Simpler operations** - Uses existing PostgreSQL + Redis infrastructure, no new services
+- **Reboot survival** - Full durability via PostgreSQL persistence

 ### Negative

- **Multiple technologies to learn** - Team needs LangGraph, Temporal, Redis Streams knowledge
- **Operational complexity** - More services to deploy and monitor
+- **Multiple technologies to learn** - Team needs LangGraph, transitions, Redis Streams knowledge
 - **Integration work** - Thin glue layers needed between components
+- **Manual recovery logic** - Must implement workflow recovery on startup

 ### Mitigation

 - **Learning curve** - Start with simple 2-3 agent workflows, expand gradually
- **Operational complexity** - Use Docker Compose locally, consider managed services for production if needed
 - **Integration** - Create clear abstractions; each layer only knows its immediate neighbors
+- **Recovery** - Implement startup recovery task that scans for in-progress workflows

 ## Compliance

@@ -297,23 +385,27 @@ This decision aligns with:

 LangSmith is LangChain's paid observability platform. Instead, we will:
 - Use **LangFuse** (open source, self-hostable) for LLM observability
- Use **Temporal UI** (built-in) for workflow visibility
+- Use standard logging + PostgreSQL queries for workflow visibility
 - Build custom dashboards for Syndarix-specific metrics

-### Temporal Cloud
+### Temporal for Durable Workflows

-Temporal offers a managed cloud service. Instead, we will:
- Self-host Temporal server (single-node for start, cluster for scale)
- Use PostgreSQL as Temporal's persistence backend (already in stack)
+Temporal was initially considered but rejected for this project:
+- **Overkill for scale** - Syndarix targets 10-50 concurrent agents, not thousands
+- **Operational overhead** - Requires separate cluster, workers, SDK learning curve
+- **Simpler alternative available** - transitions + PostgreSQL provides equivalent durability
+- **Migration path** - If scale demands grow, Temporal can be introduced later

 ## References

 - [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
- [Temporal.io Documentation](https://docs.temporal.io/)
+- [transitions Library](https://github.com/pytransitions/transitions)
 - [LiteLLM Documentation](https://docs.litellm.ai/)
 - [LangFuse (Open Source LLM Observability)](https://langfuse.com/)
 - [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md)
 - [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md)
+- [SPIKE-008: Workflow State Machine](../spikes/SPIKE-008-workflow-state-machine.md)
+- [ADR-010: Workflow State Machine](./ADR-010-workflow-state-machine.md)

 ---

--- a/docs/architecture/ARCHITECTURE.md
+++ b/docs/architecture/ARCHITECTURE.md
@@ -90,7 +90,7 @@ Syndarix is an autonomous AI-powered software consulting platform that orchestra
 | ADR-004 | LLM Provider | LiteLLM with failover |
 | ADR-005 | Tech Stack | PragmaStack + extensions |
 | ADR-006 | Agent Orchestration | Type-Instance pattern |
-| ADR-007 | Framework Selection | Hybrid (LangGraph + custom) |
+| ADR-007 | Framework Selection | Hybrid (LangGraph + transitions + Celery) |
 | ADR-008 | Knowledge Base | pgvector for RAG |
 | ADR-009 | Agent Communication | Structured messages + Redis Streams |
 | ADR-010 | Workflows | transitions + PostgreSQL + Celery |