forked from cardosofelipe/fast-next-template

Files

Felipe Cardoso 406b25cda0 docs: add remaining ADRs and comprehensive architecture documentation

Added 7 new Architecture Decision Records completing the full set:
- ADR-008: Knowledge Base and RAG (pgvector)
- ADR-009: Agent Communication Protocol (structured messages)
- ADR-010: Workflow State Machine (transitions + PostgreSQL)
- ADR-011: Issue Synchronization (webhook-first + polling)
- ADR-012: Cost Tracking (LiteLLM callbacks + Redis budgets)
- ADR-013: Audit Logging (hash chaining + tiered storage)
- ADR-014: Client Approval Flow (checkpoint-based)

Added comprehensive ARCHITECTURE.md that:
- Summarizes all 14 ADRs in decision matrix
- Documents full system architecture with diagrams
- Explains all component interactions
- Details technology stack with self-hostability guarantee
- Covers security, scalability, and deployment

Updated IMPLEMENTATION_ROADMAP.md to mark Phase 0 completed items.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-29 13:54:43 +01:00

5.6 KiB

Raw Permalink Blame History

ADR-010: Workflow State Machine Architecture

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-008

Context

Syndarix requires durable state machines for orchestrating long-lived workflows that span hours to days:

Sprint execution (1-2 weeks)
Story implementation (hours to days)
PR review cycles (hours)
Approval flows (variable)

Workflows must survive system restarts, handle failures gracefully, and provide full visibility.

Decision Drivers

Durability: State must survive crashes and restarts
Visibility: Clear status of all workflows
Flexibility: Support various workflow types
Simplicity: Avoid heavy infrastructure
Auditability: Full history of state transitions

Considered Options

Option 1: Temporal.io

Pros:

Durable execution out of the box
Handles multi-day workflows
Built-in retries, timeouts, versioning

Cons:

Heavy infrastructure (cluster required)
Operational burden
Overkill for Syndarix's scale

Option 2: Custom + transitions Library (Selected)

Pros:

Lightweight, Pythonic
PostgreSQL persistence (existing infra)
Full control over behavior
Event sourcing for audit trail

Cons:

Manual persistence implementation
No distributed coordination

Option 3: Prefect

Pros: Good for data pipelines Cons: Wrong abstraction for business workflows

Decision

Adopt custom workflow engine using transitions library with PostgreSQL persistence and Celery task execution.

This approach provides durability and flexibility without the operational overhead of dedicated workflow engines. At Syndarix's scale (dozens, not thousands of concurrent workflows), this is the right trade-off.

Implementation

State Machine Definition

from transitions import Machine

class StoryWorkflow:
    states = [
        'analysis', 'design', 'implementation',
        'review', 'testing', 'done', 'blocked'
    ]

    def __init__(self, story_id: str):
        self.story_id = story_id
        self.machine = Machine(
            model=self,
            states=self.states,
            initial='analysis'
        )

        # Define transitions
        self.machine.add_transition('design_complete', 'analysis', 'design')
        self.machine.add_transition('start_coding', 'design', 'implementation')
        self.machine.add_transition('submit_pr', 'implementation', 'review')
        self.machine.add_transition('request_changes', 'review', 'implementation')
        self.machine.add_transition('approve', 'review', 'testing')
        self.machine.add_transition('tests_pass', 'testing', 'done')
        self.machine.add_transition('tests_fail', 'testing', 'implementation')
        self.machine.add_transition('block', '*', 'blocked')
        self.machine.add_transition('unblock', 'blocked', 'implementation')

Persistence Schema

CREATE TABLE workflow_instances (
    id UUID PRIMARY KEY,
    workflow_type VARCHAR(50) NOT NULL,
    current_state VARCHAR(100) NOT NULL,
    entity_id VARCHAR(100) NOT NULL,  -- story_id, sprint_id, etc.
    project_id UUID NOT NULL,
    context JSONB DEFAULT '{}',
    error TEXT,
    retry_count INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL
);

-- Event sourcing table
CREATE TABLE workflow_transitions (
    id UUID PRIMARY KEY,
    workflow_id UUID NOT NULL REFERENCES workflow_instances(id),
    from_state VARCHAR(100) NOT NULL,
    to_state VARCHAR(100) NOT NULL,
    trigger VARCHAR(100) NOT NULL,
    triggered_by VARCHAR(100),  -- agent_id, user_id, or 'system'
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL
);

Core Workflows

Workflow	States	Duration	Approval Points
Sprint	planning → active → review → done	1-2 weeks	Start, completion
Story	analysis → design → implementation → review → testing → done	Hours-days	PR merge
PR Review	submitted → reviewing → changes_requested → approved → merged	Hours	Merge
Approval	pending → approved/rejected/expired	Variable	N/A

Integration with Celery

@celery_app.task(bind=True)
def execute_workflow_step(self, workflow_id: str, trigger: str):
    """Execute a workflow state transition as a Celery task."""
    workflow = WorkflowService.load(workflow_id)

    try:
        # Attempt transition
        workflow.trigger(trigger)
        workflow.save()

        # Publish state change event
        event_bus.publish(f"project:{workflow.project_id}", {
            "type": "workflow_transition",
            "workflow_id": workflow_id,
            "new_state": workflow.state
        })

    except TransitionNotAllowed:
        logger.warning(f"Invalid transition {trigger} from {workflow.state}")
    except Exception as e:
        workflow.error = str(e)
        workflow.retry_count += 1
        workflow.save()
        raise self.retry(exc=e, countdown=60 * workflow.retry_count)

Consequences

Positive

Lightweight, uses existing infrastructure
Full audit trail via event sourcing
Easy to understand and modify
Celery integration for async execution

Negative

Manual persistence implementation
No distributed coordination (single-node)

Migration Path

If scale requires distributed workflows, migrate to Temporal with the same state machine definitions.

Compliance

This decision aligns with:

FR-301-305: Workflow execution requirements
NFR-402: Fault tolerance
NFR-602: Audit logging

This ADR establishes the workflow state machine architecture for Syndarix.

5.6 KiB Raw Permalink Blame History