forked from cardosofelipe/fast-next-template
Added 7 new Architecture Decision Records completing the full set: - ADR-008: Knowledge Base and RAG (pgvector) - ADR-009: Agent Communication Protocol (structured messages) - ADR-010: Workflow State Machine (transitions + PostgreSQL) - ADR-011: Issue Synchronization (webhook-first + polling) - ADR-012: Cost Tracking (LiteLLM callbacks + Redis budgets) - ADR-013: Audit Logging (hash chaining + tiered storage) - ADR-014: Client Approval Flow (checkpoint-based) Added comprehensive ARCHITECTURE.md that: - Summarizes all 14 ADRs in decision matrix - Documents full system architecture with diagrams - Explains all component interactions - Details technology stack with self-hostability guarantee - Covers security, scalability, and deployment Updated IMPLEMENTATION_ROADMAP.md to mark Phase 0 completed items. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
190 lines
5.6 KiB
Markdown
190 lines
5.6 KiB
Markdown
# ADR-010: Workflow State Machine Architecture
|
|
|
|
**Status:** Accepted
|
|
**Date:** 2025-12-29
|
|
**Deciders:** Architecture Team
|
|
**Related Spikes:** SPIKE-008
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Syndarix requires durable state machines for orchestrating long-lived workflows that span hours to days:
|
|
- Sprint execution (1-2 weeks)
|
|
- Story implementation (hours to days)
|
|
- PR review cycles (hours)
|
|
- Approval flows (variable)
|
|
|
|
Workflows must survive system restarts, handle failures gracefully, and provide full visibility.
|
|
|
|
## Decision Drivers
|
|
|
|
- **Durability:** State must survive crashes and restarts
|
|
- **Visibility:** Clear status of all workflows
|
|
- **Flexibility:** Support various workflow types
|
|
- **Simplicity:** Avoid heavy infrastructure
|
|
- **Auditability:** Full history of state transitions
|
|
|
|
## Considered Options
|
|
|
|
### Option 1: Temporal.io
|
|
|
|
**Pros:**
|
|
- Durable execution out of the box
|
|
- Handles multi-day workflows
|
|
- Built-in retries, timeouts, versioning
|
|
|
|
**Cons:**
|
|
- Heavy infrastructure (cluster required)
|
|
- Operational burden
|
|
- Overkill for Syndarix's scale
|
|
|
|
### Option 2: Custom + transitions Library (Selected)
|
|
|
|
**Pros:**
|
|
- Lightweight, Pythonic
|
|
- PostgreSQL persistence (existing infra)
|
|
- Full control over behavior
|
|
- Event sourcing for audit trail
|
|
|
|
**Cons:**
|
|
- Manual persistence implementation
|
|
- No distributed coordination
|
|
|
|
### Option 3: Prefect
|
|
|
|
**Pros:** Good for data pipelines
|
|
**Cons:** Wrong abstraction for business workflows
|
|
|
|
## Decision
|
|
|
|
**Adopt custom workflow engine using `transitions` library** with PostgreSQL persistence and Celery task execution.
|
|
|
|
This approach provides durability and flexibility without the operational overhead of dedicated workflow engines. At Syndarix's scale (dozens, not thousands of concurrent workflows), this is the right trade-off.
|
|
|
|
## Implementation
|
|
|
|
### State Machine Definition
|
|
|
|
```python
|
|
from transitions import Machine
|
|
|
|
class StoryWorkflow:
|
|
states = [
|
|
'analysis', 'design', 'implementation',
|
|
'review', 'testing', 'done', 'blocked'
|
|
]
|
|
|
|
def __init__(self, story_id: str):
|
|
self.story_id = story_id
|
|
self.machine = Machine(
|
|
model=self,
|
|
states=self.states,
|
|
initial='analysis'
|
|
)
|
|
|
|
# Define transitions
|
|
self.machine.add_transition('design_complete', 'analysis', 'design')
|
|
self.machine.add_transition('start_coding', 'design', 'implementation')
|
|
self.machine.add_transition('submit_pr', 'implementation', 'review')
|
|
self.machine.add_transition('request_changes', 'review', 'implementation')
|
|
self.machine.add_transition('approve', 'review', 'testing')
|
|
self.machine.add_transition('tests_pass', 'testing', 'done')
|
|
self.machine.add_transition('tests_fail', 'testing', 'implementation')
|
|
self.machine.add_transition('block', '*', 'blocked')
|
|
self.machine.add_transition('unblock', 'blocked', 'implementation')
|
|
```
|
|
|
|
### Persistence Schema
|
|
|
|
```sql
|
|
CREATE TABLE workflow_instances (
|
|
id UUID PRIMARY KEY,
|
|
workflow_type VARCHAR(50) NOT NULL,
|
|
current_state VARCHAR(100) NOT NULL,
|
|
entity_id VARCHAR(100) NOT NULL, -- story_id, sprint_id, etc.
|
|
project_id UUID NOT NULL,
|
|
context JSONB DEFAULT '{}',
|
|
error TEXT,
|
|
retry_count INTEGER DEFAULT 0,
|
|
created_at TIMESTAMPTZ NOT NULL,
|
|
updated_at TIMESTAMPTZ NOT NULL
|
|
);
|
|
|
|
-- Event sourcing table
|
|
CREATE TABLE workflow_transitions (
|
|
id UUID PRIMARY KEY,
|
|
workflow_id UUID NOT NULL REFERENCES workflow_instances(id),
|
|
from_state VARCHAR(100) NOT NULL,
|
|
to_state VARCHAR(100) NOT NULL,
|
|
trigger VARCHAR(100) NOT NULL,
|
|
triggered_by VARCHAR(100), -- agent_id, user_id, or 'system'
|
|
metadata JSONB DEFAULT '{}',
|
|
created_at TIMESTAMPTZ NOT NULL
|
|
);
|
|
```
|
|
|
|
### Core Workflows
|
|
|
|
| Workflow | States | Duration | Approval Points |
|
|
|----------|--------|----------|-----------------|
|
|
| **Sprint** | planning → active → review → done | 1-2 weeks | Start, completion |
|
|
| **Story** | analysis → design → implementation → review → testing → done | Hours-days | PR merge |
|
|
| **PR Review** | submitted → reviewing → changes_requested → approved → merged | Hours | Merge |
|
|
| **Approval** | pending → approved/rejected/expired | Variable | N/A |
|
|
|
|
### Integration with Celery
|
|
|
|
```python
|
|
@celery_app.task(bind=True)
|
|
def execute_workflow_step(self, workflow_id: str, trigger: str):
|
|
"""Execute a workflow state transition as a Celery task."""
|
|
workflow = WorkflowService.load(workflow_id)
|
|
|
|
try:
|
|
# Attempt transition
|
|
workflow.trigger(trigger)
|
|
workflow.save()
|
|
|
|
# Publish state change event
|
|
event_bus.publish(f"project:{workflow.project_id}", {
|
|
"type": "workflow_transition",
|
|
"workflow_id": workflow_id,
|
|
"new_state": workflow.state
|
|
})
|
|
|
|
except TransitionNotAllowed:
|
|
logger.warning(f"Invalid transition {trigger} from {workflow.state}")
|
|
except Exception as e:
|
|
workflow.error = str(e)
|
|
workflow.retry_count += 1
|
|
workflow.save()
|
|
raise self.retry(exc=e, countdown=60 * workflow.retry_count)
|
|
```
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- Lightweight, uses existing infrastructure
|
|
- Full audit trail via event sourcing
|
|
- Easy to understand and modify
|
|
- Celery integration for async execution
|
|
|
|
### Negative
|
|
- Manual persistence implementation
|
|
- No distributed coordination (single-node)
|
|
|
|
### Migration Path
|
|
If scale requires distributed workflows, migrate to Temporal with the same state machine definitions.
|
|
|
|
## Compliance
|
|
|
|
This decision aligns with:
|
|
- FR-301-305: Workflow execution requirements
|
|
- NFR-402: Fault tolerance
|
|
- NFR-602: Audit logging
|
|
|
|
---
|
|
|
|
*This ADR establishes the workflow state machine architecture for Syndarix.*
|