Files
syndarix/docs/adrs/ADR-010-workflow-state-machine.md
Felipe Cardoso 406b25cda0 docs: add remaining ADRs and comprehensive architecture documentation
Added 7 new Architecture Decision Records completing the full set:
- ADR-008: Knowledge Base and RAG (pgvector)
- ADR-009: Agent Communication Protocol (structured messages)
- ADR-010: Workflow State Machine (transitions + PostgreSQL)
- ADR-011: Issue Synchronization (webhook-first + polling)
- ADR-012: Cost Tracking (LiteLLM callbacks + Redis budgets)
- ADR-013: Audit Logging (hash chaining + tiered storage)
- ADR-014: Client Approval Flow (checkpoint-based)

Added comprehensive ARCHITECTURE.md that:
- Summarizes all 14 ADRs in decision matrix
- Documents full system architecture with diagrams
- Explains all component interactions
- Details technology stack with self-hostability guarantee
- Covers security, scalability, and deployment

Updated IMPLEMENTATION_ROADMAP.md to mark Phase 0 completed items.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 13:54:43 +01:00

5.6 KiB

ADR-010: Workflow State Machine Architecture

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-008


Context

Syndarix requires durable state machines for orchestrating long-lived workflows that span hours to days:

  • Sprint execution (1-2 weeks)
  • Story implementation (hours to days)
  • PR review cycles (hours)
  • Approval flows (variable)

Workflows must survive system restarts, handle failures gracefully, and provide full visibility.

Decision Drivers

  • Durability: State must survive crashes and restarts
  • Visibility: Clear status of all workflows
  • Flexibility: Support various workflow types
  • Simplicity: Avoid heavy infrastructure
  • Auditability: Full history of state transitions

Considered Options

Option 1: Temporal.io

Pros:

  • Durable execution out of the box
  • Handles multi-day workflows
  • Built-in retries, timeouts, versioning

Cons:

  • Heavy infrastructure (cluster required)
  • Operational burden
  • Overkill for Syndarix's scale

Option 2: Custom + transitions Library (Selected)

Pros:

  • Lightweight, Pythonic
  • PostgreSQL persistence (existing infra)
  • Full control over behavior
  • Event sourcing for audit trail

Cons:

  • Manual persistence implementation
  • No distributed coordination

Option 3: Prefect

Pros: Good for data pipelines Cons: Wrong abstraction for business workflows

Decision

Adopt custom workflow engine using transitions library with PostgreSQL persistence and Celery task execution.

This approach provides durability and flexibility without the operational overhead of dedicated workflow engines. At Syndarix's scale (dozens, not thousands of concurrent workflows), this is the right trade-off.

Implementation

State Machine Definition

from transitions import Machine

class StoryWorkflow:
    states = [
        'analysis', 'design', 'implementation',
        'review', 'testing', 'done', 'blocked'
    ]

    def __init__(self, story_id: str):
        self.story_id = story_id
        self.machine = Machine(
            model=self,
            states=self.states,
            initial='analysis'
        )

        # Define transitions
        self.machine.add_transition('design_complete', 'analysis', 'design')
        self.machine.add_transition('start_coding', 'design', 'implementation')
        self.machine.add_transition('submit_pr', 'implementation', 'review')
        self.machine.add_transition('request_changes', 'review', 'implementation')
        self.machine.add_transition('approve', 'review', 'testing')
        self.machine.add_transition('tests_pass', 'testing', 'done')
        self.machine.add_transition('tests_fail', 'testing', 'implementation')
        self.machine.add_transition('block', '*', 'blocked')
        self.machine.add_transition('unblock', 'blocked', 'implementation')

Persistence Schema

CREATE TABLE workflow_instances (
    id UUID PRIMARY KEY,
    workflow_type VARCHAR(50) NOT NULL,
    current_state VARCHAR(100) NOT NULL,
    entity_id VARCHAR(100) NOT NULL,  -- story_id, sprint_id, etc.
    project_id UUID NOT NULL,
    context JSONB DEFAULT '{}',
    error TEXT,
    retry_count INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL
);

-- Event sourcing table
CREATE TABLE workflow_transitions (
    id UUID PRIMARY KEY,
    workflow_id UUID NOT NULL REFERENCES workflow_instances(id),
    from_state VARCHAR(100) NOT NULL,
    to_state VARCHAR(100) NOT NULL,
    trigger VARCHAR(100) NOT NULL,
    triggered_by VARCHAR(100),  -- agent_id, user_id, or 'system'
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL
);

Core Workflows

Workflow States Duration Approval Points
Sprint planning → active → review → done 1-2 weeks Start, completion
Story analysis → design → implementation → review → testing → done Hours-days PR merge
PR Review submitted → reviewing → changes_requested → approved → merged Hours Merge
Approval pending → approved/rejected/expired Variable N/A

Integration with Celery

@celery_app.task(bind=True)
def execute_workflow_step(self, workflow_id: str, trigger: str):
    """Execute a workflow state transition as a Celery task."""
    workflow = WorkflowService.load(workflow_id)

    try:
        # Attempt transition
        workflow.trigger(trigger)
        workflow.save()

        # Publish state change event
        event_bus.publish(f"project:{workflow.project_id}", {
            "type": "workflow_transition",
            "workflow_id": workflow_id,
            "new_state": workflow.state
        })

    except TransitionNotAllowed:
        logger.warning(f"Invalid transition {trigger} from {workflow.state}")
    except Exception as e:
        workflow.error = str(e)
        workflow.retry_count += 1
        workflow.save()
        raise self.retry(exc=e, countdown=60 * workflow.retry_count)

Consequences

Positive

  • Lightweight, uses existing infrastructure
  • Full audit trail via event sourcing
  • Easy to understand and modify
  • Celery integration for async execution

Negative

  • Manual persistence implementation
  • No distributed coordination (single-node)

Migration Path

If scale requires distributed workflows, migrate to Temporal with the same state machine definitions.

Compliance

This decision aligns with:

  • FR-301-305: Workflow execution requirements
  • NFR-402: Fault tolerance
  • NFR-602: Audit logging

This ADR establishes the workflow state machine architecture for Syndarix.