Added 7 new Architecture Decision Records completing the full set: - ADR-008: Knowledge Base and RAG (pgvector) - ADR-009: Agent Communication Protocol (structured messages) - ADR-010: Workflow State Machine (transitions + PostgreSQL) - ADR-011: Issue Synchronization (webhook-first + polling) - ADR-012: Cost Tracking (LiteLLM callbacks + Redis budgets) - ADR-013: Audit Logging (hash chaining + tiered storage) - ADR-014: Client Approval Flow (checkpoint-based) Added comprehensive ARCHITECTURE.md that: - Summarizes all 14 ADRs in decision matrix - Documents full system architecture with diagrams - Explains all component interactions - Details technology stack with self-hostability guarantee - Covers security, scalability, and deployment Updated IMPLEMENTATION_ROADMAP.md to mark Phase 0 completed items. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.6 KiB
ADR-010: Workflow State Machine Architecture
Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-008
Context
Syndarix requires durable state machines for orchestrating long-lived workflows that span hours to days:
- Sprint execution (1-2 weeks)
- Story implementation (hours to days)
- PR review cycles (hours)
- Approval flows (variable)
Workflows must survive system restarts, handle failures gracefully, and provide full visibility.
Decision Drivers
- Durability: State must survive crashes and restarts
- Visibility: Clear status of all workflows
- Flexibility: Support various workflow types
- Simplicity: Avoid heavy infrastructure
- Auditability: Full history of state transitions
Considered Options
Option 1: Temporal.io
Pros:
- Durable execution out of the box
- Handles multi-day workflows
- Built-in retries, timeouts, versioning
Cons:
- Heavy infrastructure (cluster required)
- Operational burden
- Overkill for Syndarix's scale
Option 2: Custom + transitions Library (Selected)
Pros:
- Lightweight, Pythonic
- PostgreSQL persistence (existing infra)
- Full control over behavior
- Event sourcing for audit trail
Cons:
- Manual persistence implementation
- No distributed coordination
Option 3: Prefect
Pros: Good for data pipelines Cons: Wrong abstraction for business workflows
Decision
Adopt custom workflow engine using transitions library with PostgreSQL persistence and Celery task execution.
This approach provides durability and flexibility without the operational overhead of dedicated workflow engines. At Syndarix's scale (dozens, not thousands of concurrent workflows), this is the right trade-off.
Implementation
State Machine Definition
from transitions import Machine
class StoryWorkflow:
states = [
'analysis', 'design', 'implementation',
'review', 'testing', 'done', 'blocked'
]
def __init__(self, story_id: str):
self.story_id = story_id
self.machine = Machine(
model=self,
states=self.states,
initial='analysis'
)
# Define transitions
self.machine.add_transition('design_complete', 'analysis', 'design')
self.machine.add_transition('start_coding', 'design', 'implementation')
self.machine.add_transition('submit_pr', 'implementation', 'review')
self.machine.add_transition('request_changes', 'review', 'implementation')
self.machine.add_transition('approve', 'review', 'testing')
self.machine.add_transition('tests_pass', 'testing', 'done')
self.machine.add_transition('tests_fail', 'testing', 'implementation')
self.machine.add_transition('block', '*', 'blocked')
self.machine.add_transition('unblock', 'blocked', 'implementation')
Persistence Schema
CREATE TABLE workflow_instances (
id UUID PRIMARY KEY,
workflow_type VARCHAR(50) NOT NULL,
current_state VARCHAR(100) NOT NULL,
entity_id VARCHAR(100) NOT NULL, -- story_id, sprint_id, etc.
project_id UUID NOT NULL,
context JSONB DEFAULT '{}',
error TEXT,
retry_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
);
-- Event sourcing table
CREATE TABLE workflow_transitions (
id UUID PRIMARY KEY,
workflow_id UUID NOT NULL REFERENCES workflow_instances(id),
from_state VARCHAR(100) NOT NULL,
to_state VARCHAR(100) NOT NULL,
trigger VARCHAR(100) NOT NULL,
triggered_by VARCHAR(100), -- agent_id, user_id, or 'system'
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL
);
Core Workflows
| Workflow | States | Duration | Approval Points |
|---|---|---|---|
| Sprint | planning → active → review → done | 1-2 weeks | Start, completion |
| Story | analysis → design → implementation → review → testing → done | Hours-days | PR merge |
| PR Review | submitted → reviewing → changes_requested → approved → merged | Hours | Merge |
| Approval | pending → approved/rejected/expired | Variable | N/A |
Integration with Celery
@celery_app.task(bind=True)
def execute_workflow_step(self, workflow_id: str, trigger: str):
"""Execute a workflow state transition as a Celery task."""
workflow = WorkflowService.load(workflow_id)
try:
# Attempt transition
workflow.trigger(trigger)
workflow.save()
# Publish state change event
event_bus.publish(f"project:{workflow.project_id}", {
"type": "workflow_transition",
"workflow_id": workflow_id,
"new_state": workflow.state
})
except TransitionNotAllowed:
logger.warning(f"Invalid transition {trigger} from {workflow.state}")
except Exception as e:
workflow.error = str(e)
workflow.retry_count += 1
workflow.save()
raise self.retry(exc=e, countdown=60 * workflow.retry_count)
Consequences
Positive
- Lightweight, uses existing infrastructure
- Full audit trail via event sourcing
- Easy to understand and modify
- Celery integration for async execution
Negative
- Manual persistence implementation
- No distributed coordination (single-node)
Migration Path
If scale requires distributed workflows, migrate to Temporal with the same state machine definitions.
Compliance
This decision aligns with:
- FR-301-305: Workflow execution requirements
- NFR-402: Fault tolerance
- NFR-602: Audit logging
This ADR establishes the workflow state machine architecture for Syndarix.