fix: Resolve ADR-007 vs ADR-010 Temporal contradiction
Remove Temporal from the architecture in favor of the simpler transitions + PostgreSQL + Celery approach. This aligns ADR-007 with ADR-010 based on user preference for simpler operations. Key changes: - ADR-007 now recommends transitions library instead of Temporal - Added explicit "Why Not Temporal?" section explaining the trade-off - Added "Reboot Survival" section documenting durability guarantees - Updated architecture diagrams and component responsibilities - Updated ARCHITECTURE.md summary matrix The simpler approach is more appropriate for Syndarix's scale (10-50 concurrent agents) and uses existing PostgreSQL + Celery infrastructure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -99,11 +99,29 @@ We evaluated whether to adopt an existing framework wholesale or build a custom
|
|||||||
**Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by:
|
**Adopt a hybrid architecture using LangGraph as the core agent framework**, complemented by:
|
||||||
|
|
||||||
1. **LangGraph** - Agent state machines and logic
|
1. **LangGraph** - Agent state machines and logic
|
||||||
2. **Temporal** - Durable workflow execution
|
2. **transitions + PostgreSQL + Celery** - Durable workflow state machines
|
||||||
3. **Redis Streams** - Agent-to-agent communication
|
3. **Redis Streams** - Agent-to-agent communication
|
||||||
4. **LiteLLM** - Unified LLM access with failover
|
4. **LiteLLM** - Unified LLM access with failover
|
||||||
5. **PostgreSQL + pgvector** - State persistence and RAG
|
5. **PostgreSQL + pgvector** - State persistence and RAG
|
||||||
|
|
||||||
|
### Why Not Temporal?
|
||||||
|
|
||||||
|
After evaluating both approaches, we chose the simpler **transitions + PostgreSQL + Celery** stack over Temporal:
|
||||||
|
|
||||||
|
| Factor | Temporal | transitions + PostgreSQL |
|
||||||
|
|--------|----------|-------------------------|
|
||||||
|
| Complexity | High (separate cluster, workers, SDK) | Low (Python library + existing infra) |
|
||||||
|
| Learning Curve | Steep (new paradigm) | Gentle (familiar patterns) |
|
||||||
|
| Infrastructure | Dedicated cluster required | Uses existing PostgreSQL + Celery |
|
||||||
|
| Scale Target | Enterprise (1000s of workflows) | Syndarix (10s of agents) |
|
||||||
|
| Debugging | Temporal UI (powerful but complex) | Standard DB queries + logs |
|
||||||
|
|
||||||
|
**Temporal is overkill for our scale** (10-50 concurrent agents). The simpler approach provides:
|
||||||
|
- Full durability via PostgreSQL state persistence
|
||||||
|
- Event sourcing via transition history table
|
||||||
|
- Background execution via Celery workers
|
||||||
|
- Simpler debugging with standard tools
|
||||||
|
|
||||||
### Architecture Overview
|
### Architecture Overview
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -112,12 +130,12 @@ We evaluated whether to adopt an existing framework wholesale or build a custom
|
|||||||
├─────────────────────────────────────────────────────────────────────────┤
|
├─────────────────────────────────────────────────────────────────────────┤
|
||||||
│ │
|
│ │
|
||||||
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||||
│ │ Temporal Workflow Engine │ │
|
│ │ Workflow Engine (transitions + PostgreSQL) │ │
|
||||||
│ │ │ │
|
│ │ │ │
|
||||||
│ │ • Durable execution (survives crashes, restarts, deployments) │ │
|
│ │ • State persistence to PostgreSQL (survives restarts) │ │
|
||||||
│ │ • Human approval checkpoints (wait indefinitely for client) │ │
|
│ │ • Event sourcing via workflow_transitions table │ │
|
||||||
│ │ • Long-running workflows (projects spanning weeks/months) │ │
|
│ │ • Human approval checkpoints (pause workflow, await signal) │ │
|
||||||
│ │ • Built-in retry policies and timeouts │ │
|
│ │ • Background execution via Celery workers │ │
|
||||||
│ │ │ │
|
│ │ │ │
|
||||||
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
|
│ │ License: MIT | Self-Hosted: Yes | Subscription: None Required │ │
|
||||||
│ └───────────────────────────────────────────────────────────────────┘ │
|
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||||
@@ -173,10 +191,35 @@ We evaluated whether to adopt an existing framework wholesale or build a custom
|
|||||||
| Component | Responsibility | Why This Choice |
|
| Component | Responsibility | Why This Choice |
|
||||||
|-----------|---------------|-----------------|
|
|-----------|---------------|-----------------|
|
||||||
| **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing |
|
| **LangGraph** | Agent state machines, tool execution, reasoning loops | Production-proven, fine-grained control, PostgreSQL checkpointing |
|
||||||
| **Temporal** | Durable workflows, human approvals, long-running orchestration | Only solution for week-long workflows that survive failures |
|
| **transitions** | Workflow state machines (sprint, story, PR) | Lightweight, Pythonic, no external dependencies |
|
||||||
|
| **Celery + Redis** | Background task execution, async workflows | Already in stack, battle-tested |
|
||||||
|
| **PostgreSQL** | Workflow state persistence, event sourcing | ACID guarantees, survives restarts |
|
||||||
| **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups |
|
| **Redis Streams** | Agent messaging, real-time events, pub/sub | Low-latency, persistent streams, consumer groups |
|
||||||
| **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in |
|
| **LiteLLM** | LLM abstraction, failover, cost tracking | Unified API, automatic failover, no vendor lock-in |
|
||||||
| **PostgreSQL** | State persistence, audit logs, agent data | Already in stack, pgvector for RAG |
|
|
||||||
|
### Reboot Survival (Durability)
|
||||||
|
|
||||||
|
The architecture **fully supports system reboots and crashes**:
|
||||||
|
|
||||||
|
1. **Workflow State**: Persisted to PostgreSQL `workflow_instances` table
|
||||||
|
2. **Transition History**: Event-sourced in `workflow_transitions` table
|
||||||
|
3. **Agent Checkpoints**: LangGraph persists to PostgreSQL
|
||||||
|
4. **Pending Tasks**: Celery tasks in Redis (configured with persistence)
|
||||||
|
|
||||||
|
**Recovery Process:**
|
||||||
|
```
|
||||||
|
System Restart
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Load workflow_instances WHERE status = 'in_progress'
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
For each workflow:
|
||||||
|
├── Restore state from context JSONB
|
||||||
|
├── Identify current_state
|
||||||
|
├── Resume from last checkpoint
|
||||||
|
└── Continue execution
|
||||||
|
```
|
||||||
|
|
||||||
### Self-Hostability Guarantee
|
### Self-Hostability Guarantee
|
||||||
|
|
||||||
@@ -185,7 +228,8 @@ All components are fully self-hostable with permissive open-source licenses:
|
|||||||
| Component | License | Paid Cloud Alternative | Required for Syndarix? |
|
| Component | License | Paid Cloud Alternative | Required for Syndarix? |
|
||||||
|-----------|---------|----------------------|----------------------|
|
|-----------|---------|----------------------|----------------------|
|
||||||
| LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom |
|
| LangGraph | MIT | LangSmith (observability) | No - use LangFuse or custom |
|
||||||
| Temporal | MIT | Temporal Cloud | No - self-host server |
|
| transitions | MIT | N/A | N/A - simple library |
|
||||||
|
| Celery | BSD-3 | Various | No - self-host |
|
||||||
| LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy |
|
| LiteLLM | MIT | LiteLLM Enterprise | No - self-host proxy |
|
||||||
| Redis | BSD-3 | Redis Cloud | No - self-host |
|
| Redis | BSD-3 | Redis Cloud | No - self-host |
|
||||||
| PostgreSQL | PostgreSQL | Various managed DBs | No - self-host |
|
| PostgreSQL | PostgreSQL | Various managed DBs | No - self-host |
|
||||||
@@ -198,7 +242,8 @@ All components are fully self-hostable with permissive open-source licenses:
|
|||||||
|---------|----------|-----------|
|
|---------|----------|-----------|
|
||||||
| Agent Logic | **USE LangGraph** | Don't reinvent state machines |
|
| Agent Logic | **USE LangGraph** | Don't reinvent state machines |
|
||||||
| LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction |
|
| LLM Access | **USE LiteLLM** | Don't reinvent provider abstraction |
|
||||||
| Durability | **USE Temporal** | Don't reinvent durable execution |
|
| Workflow State | **USE transitions + PostgreSQL** | Simple, durable, debuggable |
|
||||||
|
| Background Tasks | **USE Celery** | Already in stack, proven |
|
||||||
| Messaging | **USE Redis Streams** | Don't reinvent pub/sub |
|
| Messaging | **USE Redis Streams** | Don't reinvent pub/sub |
|
||||||
| Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) |
|
| Orchestration | **BUILD thin layer** | Syndarix-specific (autonomy levels, team structure) |
|
||||||
| Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix |
|
| Agent Spawning | **BUILD thin layer** | Type-Instance pattern specific to Syndarix |
|
||||||
@@ -209,28 +254,38 @@ All components are fully self-hostable with permissive open-source licenses:
|
|||||||
```python
|
```python
|
||||||
# Example: How the layers integrate
|
# Example: How the layers integrate
|
||||||
|
|
||||||
# 1. Temporal orchestrates the high-level workflow
|
# 1. Workflow state machine (transitions library)
|
||||||
@workflow.defn
|
class SprintWorkflow(Machine):
|
||||||
class SprintWorkflow:
|
states = ['planning', 'active', 'review', 'done']
|
||||||
@workflow.run
|
|
||||||
async def run(self, sprint: SprintConfig) -> SprintResult:
|
|
||||||
# Spawns agents and waits for completion
|
|
||||||
agents = await workflow.execute_activity(spawn_agent_team, sprint)
|
|
||||||
|
|
||||||
# Each agent runs a LangGraph state machine
|
def __init__(self, sprint_id: str):
|
||||||
results = await workflow.execute_activity(
|
self.sprint_id = sprint_id
|
||||||
run_agent_tasks,
|
Machine.__init__(
|
||||||
agents,
|
self,
|
||||||
start_to_close_timeout=timedelta(days=7),
|
states=self.states,
|
||||||
|
initial='planning',
|
||||||
|
after_state_change='persist_state'
|
||||||
)
|
)
|
||||||
|
self.add_transition('start', 'planning', 'active', before='spawn_agents')
|
||||||
|
self.add_transition('complete_work', 'active', 'review')
|
||||||
|
self.add_transition('approve', 'review', 'done', conditions='has_approval')
|
||||||
|
|
||||||
# Human checkpoint (waits indefinitely)
|
async def persist_state(self):
|
||||||
if sprint.autonomy_level != AutonomyLevel.AUTONOMOUS:
|
"""Save state to PostgreSQL (survives restarts)"""
|
||||||
await workflow.wait_condition(lambda: self._approved)
|
await db.execute("""
|
||||||
|
UPDATE workflow_instances
|
||||||
|
SET current_state = $1, context = $2, updated_at = NOW()
|
||||||
|
WHERE id = $3
|
||||||
|
""", self.state, self.context, self.sprint_id)
|
||||||
|
|
||||||
return results
|
# 2. Background execution via Celery
|
||||||
|
@celery_app.task(bind=True, max_retries=3)
|
||||||
|
def run_sprint_workflow(self, sprint_id: str):
|
||||||
|
workflow = SprintWorkflow.load(sprint_id) # Restore from DB
|
||||||
|
workflow.start() # Triggers agent spawning
|
||||||
|
# Workflow persists state, can resume after restart
|
||||||
|
|
||||||
# 2. LangGraph handles individual agent logic
|
# 3. LangGraph handles individual agent logic
|
||||||
def create_agent_graph() -> StateGraph:
|
def create_agent_graph() -> StateGraph:
|
||||||
graph = StateGraph(AgentState)
|
graph = StateGraph(AgentState)
|
||||||
graph.add_node("think", think_node) # LLM reasoning
|
graph.add_node("think", think_node) # LLM reasoning
|
||||||
@@ -239,7 +294,7 @@ def create_agent_graph() -> StateGraph:
|
|||||||
# ... state transitions
|
# ... state transitions
|
||||||
return graph.compile(checkpointer=PostgresSaver(...))
|
return graph.compile(checkpointer=PostgresSaver(...))
|
||||||
|
|
||||||
# 3. LiteLLM handles LLM calls with failover
|
# 4. LiteLLM handles LLM calls with failover
|
||||||
async def think_node(state: AgentState) -> AgentState:
|
async def think_node(state: AgentState) -> AgentState:
|
||||||
response = await litellm.acompletion(
|
response = await litellm.acompletion(
|
||||||
model="claude-sonnet-4-20250514",
|
model="claude-sonnet-4-20250514",
|
||||||
@@ -249,7 +304,7 @@ async def think_node(state: AgentState) -> AgentState:
|
|||||||
)
|
)
|
||||||
return {"messages": [response.choices[0].message]}
|
return {"messages": [response.choices[0].message]}
|
||||||
|
|
||||||
# 4. Redis Streams handles agent communication
|
# 5. Redis Streams handles agent communication
|
||||||
async def handoff_node(state: AgentState) -> AgentState:
|
async def handoff_node(state: AgentState) -> AgentState:
|
||||||
await message_bus.publish(AgentMessage(
|
await message_bus.publish(AgentMessage(
|
||||||
source_agent_id=state["agent_id"],
|
source_agent_id=state["agent_id"],
|
||||||
@@ -260,27 +315,60 @@ async def handoff_node(state: AgentState) -> AgentState:
|
|||||||
return state
|
return state
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Human Approval Checkpoints
|
||||||
|
|
||||||
|
For workflows requiring human approval (FULL_CONTROL and MILESTONE modes):
|
||||||
|
|
||||||
|
```python
|
||||||
|
class StoryWorkflow(Machine):
|
||||||
|
async def request_approval_and_wait(self, action: str):
|
||||||
|
"""Pause workflow and await human decision."""
|
||||||
|
# 1. Create approval request
|
||||||
|
request = await approval_service.create(
|
||||||
|
workflow_id=self.id,
|
||||||
|
action=action,
|
||||||
|
context=self.context
|
||||||
|
)
|
||||||
|
|
||||||
|
# 2. Transition to waiting state (persisted)
|
||||||
|
self.state = 'awaiting_approval'
|
||||||
|
await self.persist_state()
|
||||||
|
|
||||||
|
# 3. Workflow is paused - Celery task completes
|
||||||
|
# When user approves, a new task resumes the workflow
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
async def resume_on_approval(cls, workflow_id: str, approved: bool):
|
||||||
|
"""Called when user makes a decision."""
|
||||||
|
workflow = await cls.load(workflow_id)
|
||||||
|
if approved:
|
||||||
|
workflow.trigger('approved')
|
||||||
|
else:
|
||||||
|
workflow.trigger('rejected')
|
||||||
|
```
|
||||||
|
|
||||||
## Consequences
|
## Consequences
|
||||||
|
|
||||||
### Positive
|
### Positive
|
||||||
|
|
||||||
- **Production-tested foundations** - LangGraph, Temporal, LiteLLM are battle-tested
|
- **Production-tested foundations** - LangGraph, Celery, LiteLLM are battle-tested
|
||||||
- **No subscription lock-in** - All components self-hostable under permissive licenses
|
- **No subscription lock-in** - All components self-hostable under permissive licenses
|
||||||
- **Right tool for each job** - Specialized components for durability, state, communication
|
- **Right tool for each job** - Specialized components for state, communication, background processing
|
||||||
- **Escape hatches** - Can replace any component without full rewrite
|
- **Escape hatches** - Can replace any component without full rewrite
|
||||||
- **Enterprise patterns** - Temporal used by Netflix, Uber, Stripe for similar problems
|
- **Simpler operations** - Uses existing PostgreSQL + Redis infrastructure, no new services
|
||||||
|
- **Reboot survival** - Full durability via PostgreSQL persistence
|
||||||
|
|
||||||
### Negative
|
### Negative
|
||||||
|
|
||||||
- **Multiple technologies to learn** - Team needs LangGraph, Temporal, Redis Streams knowledge
|
- **Multiple technologies to learn** - Team needs LangGraph, transitions, Redis Streams knowledge
|
||||||
- **Operational complexity** - More services to deploy and monitor
|
|
||||||
- **Integration work** - Thin glue layers needed between components
|
- **Integration work** - Thin glue layers needed between components
|
||||||
|
- **Manual recovery logic** - Must implement workflow recovery on startup
|
||||||
|
|
||||||
### Mitigation
|
### Mitigation
|
||||||
|
|
||||||
- **Learning curve** - Start with simple 2-3 agent workflows, expand gradually
|
- **Learning curve** - Start with simple 2-3 agent workflows, expand gradually
|
||||||
- **Operational complexity** - Use Docker Compose locally, consider managed services for production if needed
|
|
||||||
- **Integration** - Create clear abstractions; each layer only knows its immediate neighbors
|
- **Integration** - Create clear abstractions; each layer only knows its immediate neighbors
|
||||||
|
- **Recovery** - Implement startup recovery task that scans for in-progress workflows
|
||||||
|
|
||||||
## Compliance
|
## Compliance
|
||||||
|
|
||||||
@@ -297,23 +385,27 @@ This decision aligns with:
|
|||||||
|
|
||||||
LangSmith is LangChain's paid observability platform. Instead, we will:
|
LangSmith is LangChain's paid observability platform. Instead, we will:
|
||||||
- Use **LangFuse** (open source, self-hostable) for LLM observability
|
- Use **LangFuse** (open source, self-hostable) for LLM observability
|
||||||
- Use **Temporal UI** (built-in) for workflow visibility
|
- Use standard logging + PostgreSQL queries for workflow visibility
|
||||||
- Build custom dashboards for Syndarix-specific metrics
|
- Build custom dashboards for Syndarix-specific metrics
|
||||||
|
|
||||||
### Temporal Cloud
|
### Temporal for Durable Workflows
|
||||||
|
|
||||||
Temporal offers a managed cloud service. Instead, we will:
|
Temporal was initially considered but rejected for this project:
|
||||||
- Self-host Temporal server (single-node for start, cluster for scale)
|
- **Overkill for scale** - Syndarix targets 10-50 concurrent agents, not thousands
|
||||||
- Use PostgreSQL as Temporal's persistence backend (already in stack)
|
- **Operational overhead** - Requires separate cluster, workers, SDK learning curve
|
||||||
|
- **Simpler alternative available** - transitions + PostgreSQL provides equivalent durability
|
||||||
|
- **Migration path** - If scale demands grow, Temporal can be introduced later
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
|
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
|
||||||
- [Temporal.io Documentation](https://docs.temporal.io/)
|
- [transitions Library](https://github.com/pytransitions/transitions)
|
||||||
- [LiteLLM Documentation](https://docs.litellm.ai/)
|
- [LiteLLM Documentation](https://docs.litellm.ai/)
|
||||||
- [LangFuse (Open Source LLM Observability)](https://langfuse.com/)
|
- [LangFuse (Open Source LLM Observability)](https://langfuse.com/)
|
||||||
- [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md)
|
- [SPIKE-002: Agent Orchestration Pattern](../spikes/SPIKE-002-agent-orchestration-pattern.md)
|
||||||
- [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md)
|
- [SPIKE-005: LLM Provider Abstraction](../spikes/SPIKE-005-llm-provider-abstraction.md)
|
||||||
|
- [SPIKE-008: Workflow State Machine](../spikes/SPIKE-008-workflow-state-machine.md)
|
||||||
|
- [ADR-010: Workflow State Machine](./ADR-010-workflow-state-machine.md)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@@ -90,7 +90,7 @@ Syndarix is an autonomous AI-powered software consulting platform that orchestra
|
|||||||
| ADR-004 | LLM Provider | LiteLLM with failover |
|
| ADR-004 | LLM Provider | LiteLLM with failover |
|
||||||
| ADR-005 | Tech Stack | PragmaStack + extensions |
|
| ADR-005 | Tech Stack | PragmaStack + extensions |
|
||||||
| ADR-006 | Agent Orchestration | Type-Instance pattern |
|
| ADR-006 | Agent Orchestration | Type-Instance pattern |
|
||||||
| ADR-007 | Framework Selection | Hybrid (LangGraph + custom) |
|
| ADR-007 | Framework Selection | Hybrid (LangGraph + transitions + Celery) |
|
||||||
| ADR-008 | Knowledge Base | pgvector for RAG |
|
| ADR-008 | Knowledge Base | pgvector for RAG |
|
||||||
| ADR-009 | Agent Communication | Structured messages + Redis Streams |
|
| ADR-009 | Agent Communication | Structured messages + Redis Streams |
|
||||||
| ADR-010 | Workflows | transitions + PostgreSQL + Celery |
|
| ADR-010 | Workflows | transitions + PostgreSQL + Celery |
|
||||||
|
|||||||
Reference in New Issue
Block a user