# ADR-010: Workflow State Machine Architecture

**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-008

---

## Context

Syndarix requires durable state machines for orchestrating long-lived workflows that span hours to days:
- Sprint execution (1-2 weeks)
- Story implementation (hours to days)
- PR review cycles (hours)
- Approval flows (variable)

Workflows must survive system restarts, handle failures gracefully, and provide full visibility.

## Decision Drivers

- **Durability:** State must survive crashes and restarts
- **Visibility:** Clear status of all workflows
- **Flexibility:** Support various workflow types
- **Simplicity:** Avoid heavy infrastructure
- **Auditability:** Full history of state transitions

## Considered Options

### Option 1: Temporal.io

**Pros:**
- Durable execution out of the box
- Handles multi-day workflows
- Built-in retries, timeouts, versioning

**Cons:**
- Heavy infrastructure (cluster required)
- Operational burden
- Overkill for Syndarix's scale

### Option 2: Custom + transitions Library (Selected)

**Pros:**
- Lightweight, Pythonic
- PostgreSQL persistence (existing infra)
- Full control over behavior
- Event sourcing for audit trail

**Cons:**
- Manual persistence implementation
- No distributed coordination

### Option 3: Prefect

**Pros:** Good for data pipelines
**Cons:** Wrong abstraction for business workflows

## Decision

**Adopt custom workflow engine using `transitions` library** with PostgreSQL persistence and Celery task execution.

This approach provides durability and flexibility without the operational overhead of dedicated workflow engines. At Syndarix's scale (dozens, not thousands of concurrent workflows), this is the right trade-off.

## Implementation

### State Machine Definition

```python
from transitions import Machine

class StoryWorkflow:
    states = [
        'analysis', 'design', 'implementation',
        'review', 'testing', 'done', 'blocked'
    ]

    def __init__(self, story_id: str):
        self.story_id = story_id
        self.machine = Machine(
            model=self,
            states=self.states,
            initial='analysis'
        )

        # Define transitions
        self.machine.add_transition('design_complete', 'analysis', 'design')
        self.machine.add_transition('start_coding', 'design', 'implementation')
        self.machine.add_transition('submit_pr', 'implementation', 'review')
        self.machine.add_transition('request_changes', 'review', 'implementation')
        self.machine.add_transition('approve', 'review', 'testing')
        self.machine.add_transition('tests_pass', 'testing', 'done')
        self.machine.add_transition('tests_fail', 'testing', 'implementation')
        self.machine.add_transition('block', '*', 'blocked')
        self.machine.add_transition('unblock', 'blocked', 'implementation')
```

### Persistence Schema

```sql
CREATE TABLE workflow_instances (
    id UUID PRIMARY KEY,
    workflow_type VARCHAR(50) NOT NULL,
    current_state VARCHAR(100) NOT NULL,
    entity_id VARCHAR(100) NOT NULL,  -- story_id, sprint_id, etc.
    project_id UUID NOT NULL,
    context JSONB DEFAULT '{}',
    error TEXT,
    retry_count INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL
);

-- Event sourcing table
CREATE TABLE workflow_transitions (
    id UUID PRIMARY KEY,
    workflow_id UUID NOT NULL REFERENCES workflow_instances(id),
    from_state VARCHAR(100) NOT NULL,
    to_state VARCHAR(100) NOT NULL,
    trigger VARCHAR(100) NOT NULL,
    triggered_by VARCHAR(100),  -- agent_id, user_id, or 'system'
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL
);
```

### Core Workflows

| Workflow | States | Duration | Approval Points |
|----------|--------|----------|-----------------|
| **Sprint** | planning → active → review → done | 1-2 weeks | Start, completion |
| **Story** | analysis → design → implementation → review → testing → done | Hours-days | PR merge |
| **PR Review** | submitted → reviewing → changes_requested → approved → merged | Hours | Merge |
| **Approval** | pending → approved/rejected/expired | Variable | N/A |

### Integration with Celery

```python
@celery_app.task(bind=True)
def execute_workflow_step(self, workflow_id: str, trigger: str):
    """Execute a workflow state transition as a Celery task."""
    workflow = WorkflowService.load(workflow_id)

    try:
        # Attempt transition
        workflow.trigger(trigger)
        workflow.save()

        # Publish state change event
        event_bus.publish(f"project:{workflow.project_id}", {
            "type": "workflow_transition",
            "workflow_id": workflow_id,
            "new_state": workflow.state
        })

    except TransitionNotAllowed:
        logger.warning(f"Invalid transition {trigger} from {workflow.state}")
    except Exception as e:
        workflow.error = str(e)
        workflow.retry_count += 1
        workflow.save()
        raise self.retry(exc=e, countdown=60 * workflow.retry_count)
```

## Consequences

### Positive
- Lightweight, uses existing infrastructure
- Full audit trail via event sourcing
- Easy to understand and modify
- Celery integration for async execution

### Negative
- Manual persistence implementation
- No distributed coordination (single-node)

### Migration Path
If scale requires distributed workflows, migrate to Temporal with the same state machine definitions.

## Compliance

This decision aligns with:
- FR-301-305: Workflow execution requirements
- NFR-402: Fault tolerance
- NFR-602: Audit logging

---

*This ADR establishes the workflow state machine architecture for Syndarix.*