Files
syndarix/docs/adrs/ADR-010-workflow-state-machine.md
Felipe Cardoso 406b25cda0 docs: add remaining ADRs and comprehensive architecture documentation
Added 7 new Architecture Decision Records completing the full set:
- ADR-008: Knowledge Base and RAG (pgvector)
- ADR-009: Agent Communication Protocol (structured messages)
- ADR-010: Workflow State Machine (transitions + PostgreSQL)
- ADR-011: Issue Synchronization (webhook-first + polling)
- ADR-012: Cost Tracking (LiteLLM callbacks + Redis budgets)
- ADR-013: Audit Logging (hash chaining + tiered storage)
- ADR-014: Client Approval Flow (checkpoint-based)

Added comprehensive ARCHITECTURE.md that:
- Summarizes all 14 ADRs in decision matrix
- Documents full system architecture with diagrams
- Explains all component interactions
- Details technology stack with self-hostability guarantee
- Covers security, scalability, and deployment

Updated IMPLEMENTATION_ROADMAP.md to mark Phase 0 completed items.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 13:54:43 +01:00

190 lines
5.6 KiB
Markdown

# ADR-010: Workflow State Machine Architecture
**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-008
---
## Context
Syndarix requires durable state machines for orchestrating long-lived workflows that span hours to days:
- Sprint execution (1-2 weeks)
- Story implementation (hours to days)
- PR review cycles (hours)
- Approval flows (variable)
Workflows must survive system restarts, handle failures gracefully, and provide full visibility.
## Decision Drivers
- **Durability:** State must survive crashes and restarts
- **Visibility:** Clear status of all workflows
- **Flexibility:** Support various workflow types
- **Simplicity:** Avoid heavy infrastructure
- **Auditability:** Full history of state transitions
## Considered Options
### Option 1: Temporal.io
**Pros:**
- Durable execution out of the box
- Handles multi-day workflows
- Built-in retries, timeouts, versioning
**Cons:**
- Heavy infrastructure (cluster required)
- Operational burden
- Overkill for Syndarix's scale
### Option 2: Custom + transitions Library (Selected)
**Pros:**
- Lightweight, Pythonic
- PostgreSQL persistence (existing infra)
- Full control over behavior
- Event sourcing for audit trail
**Cons:**
- Manual persistence implementation
- No distributed coordination
### Option 3: Prefect
**Pros:** Good for data pipelines
**Cons:** Wrong abstraction for business workflows
## Decision
**Adopt custom workflow engine using `transitions` library** with PostgreSQL persistence and Celery task execution.
This approach provides durability and flexibility without the operational overhead of dedicated workflow engines. At Syndarix's scale (dozens, not thousands of concurrent workflows), this is the right trade-off.
## Implementation
### State Machine Definition
```python
from transitions import Machine
class StoryWorkflow:
states = [
'analysis', 'design', 'implementation',
'review', 'testing', 'done', 'blocked'
]
def __init__(self, story_id: str):
self.story_id = story_id
self.machine = Machine(
model=self,
states=self.states,
initial='analysis'
)
# Define transitions
self.machine.add_transition('design_complete', 'analysis', 'design')
self.machine.add_transition('start_coding', 'design', 'implementation')
self.machine.add_transition('submit_pr', 'implementation', 'review')
self.machine.add_transition('request_changes', 'review', 'implementation')
self.machine.add_transition('approve', 'review', 'testing')
self.machine.add_transition('tests_pass', 'testing', 'done')
self.machine.add_transition('tests_fail', 'testing', 'implementation')
self.machine.add_transition('block', '*', 'blocked')
self.machine.add_transition('unblock', 'blocked', 'implementation')
```
### Persistence Schema
```sql
CREATE TABLE workflow_instances (
id UUID PRIMARY KEY,
workflow_type VARCHAR(50) NOT NULL,
current_state VARCHAR(100) NOT NULL,
entity_id VARCHAR(100) NOT NULL, -- story_id, sprint_id, etc.
project_id UUID NOT NULL,
context JSONB DEFAULT '{}',
error TEXT,
retry_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
);
-- Event sourcing table
CREATE TABLE workflow_transitions (
id UUID PRIMARY KEY,
workflow_id UUID NOT NULL REFERENCES workflow_instances(id),
from_state VARCHAR(100) NOT NULL,
to_state VARCHAR(100) NOT NULL,
trigger VARCHAR(100) NOT NULL,
triggered_by VARCHAR(100), -- agent_id, user_id, or 'system'
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL
);
```
### Core Workflows
| Workflow | States | Duration | Approval Points |
|----------|--------|----------|-----------------|
| **Sprint** | planning → active → review → done | 1-2 weeks | Start, completion |
| **Story** | analysis → design → implementation → review → testing → done | Hours-days | PR merge |
| **PR Review** | submitted → reviewing → changes_requested → approved → merged | Hours | Merge |
| **Approval** | pending → approved/rejected/expired | Variable | N/A |
### Integration with Celery
```python
@celery_app.task(bind=True)
def execute_workflow_step(self, workflow_id: str, trigger: str):
"""Execute a workflow state transition as a Celery task."""
workflow = WorkflowService.load(workflow_id)
try:
# Attempt transition
workflow.trigger(trigger)
workflow.save()
# Publish state change event
event_bus.publish(f"project:{workflow.project_id}", {
"type": "workflow_transition",
"workflow_id": workflow_id,
"new_state": workflow.state
})
except TransitionNotAllowed:
logger.warning(f"Invalid transition {trigger} from {workflow.state}")
except Exception as e:
workflow.error = str(e)
workflow.retry_count += 1
workflow.save()
raise self.retry(exc=e, countdown=60 * workflow.retry_count)
```
## Consequences
### Positive
- Lightweight, uses existing infrastructure
- Full audit trail via event sourcing
- Easy to understand and modify
- Celery integration for async execution
### Negative
- Manual persistence implementation
- No distributed coordination (single-node)
### Migration Path
If scale requires distributed workflows, migrate to Temporal with the same state machine definitions.
## Compliance
This decision aligns with:
- FR-301-305: Workflow execution requirements
- NFR-402: Fault tolerance
- NFR-602: Audit logging
---
*This ADR establishes the workflow state machine architecture for Syndarix.*