[SPIKE-002] Agent Orchestration & State Machine #2

Closed
opened 2025-12-29 03:50:14 +00:00 by cardosofelipe · 1 comment

Objective

Design the agent orchestration system that manages agent lifecycle, task assignment, and inter-agent communication.

Key Questions

  1. How do we represent agent state? (idle, working, blocked, waiting_review)
  2. What state machine library to use? (transitions for Python, xstate for frontend?)
  3. How do agents communicate with each other?
  4. How do we handle parallel agent execution?
  5. How do we track agent "conversations" and context?

Research Areas

  • Python state machine libraries (transitions, pytransitions, statemachine)
  • Actor model patterns (similar to Akka but for Python)
  • Message queue patterns for agent communication
  • Context management across agent interactions

Expected Deliverables

  • State machine design for agent lifecycle
  • Communication protocol between agents
  • Proof-of-concept with 2 agents collaborating
  • ADR documenting the pattern

Acceptance Criteria

  • Agents can transition through defined states
  • Agents can pass context/messages to each other
  • Parallel execution works correctly
  • State is persisted and recoverable

Labels

spike, architecture, agents

## Objective Design the agent orchestration system that manages agent lifecycle, task assignment, and inter-agent communication. ## Key Questions 1. How do we represent agent state? (idle, working, blocked, waiting_review) 2. What state machine library to use? (transitions for Python, xstate for frontend?) 3. How do agents communicate with each other? 4. How do we handle parallel agent execution? 5. How do we track agent "conversations" and context? ## Research Areas - [ ] Python state machine libraries (transitions, pytransitions, statemachine) - [ ] Actor model patterns (similar to Akka but for Python) - [ ] Message queue patterns for agent communication - [ ] Context management across agent interactions ## Expected Deliverables - State machine design for agent lifecycle - Communication protocol between agents - Proof-of-concept with 2 agents collaborating - ADR documenting the pattern ## Acceptance Criteria - [ ] Agents can transition through defined states - [ ] Agents can pass context/messages to each other - [ ] Parallel execution works correctly - [ ] State is persisted and recoverable ## Labels `spike`, `architecture`, `agents`
Author
Owner

SPIKE-002 Research Completed

The comprehensive spike document has been created at docs/spikes/SPIKE-002-agent-orchestration-pattern.md.

Executive Summary

After researching leading multi-agent orchestration frameworks (AutoGen 0.4, CrewAI, LangGraph) and enterprise patterns, we recommend a hybrid architecture for Syndarix:

Component Technology Purpose
Agent Logic LangGraph Graph-based state machines for agent reasoning
Durability Temporal Long-running workflow persistence (hours/days)
Communication Redis Streams Event-driven agent-to-agent messaging
Topology Hierarchical Supervisor Project Manager coordinates teams
LLM Access LiteLLM Unified provider abstraction with failover

Key Findings

1. Framework Comparison:

  • AutoGen 0.4: Microsoft-centric, event-driven, good for standardized patterns
  • CrewAI: Easy to start, but teams report hitting walls at 6-12 months
  • LangGraph: Most flexible, steep learning curve, best for complex stateful workflows
  • Recommendation: Don't adopt any single framework wholesale; build custom layer using best components

2. Orchestration Patterns:

  • Supervisor pattern for sprint coordination (Project Manager as supervisor)
  • Hierarchical pattern for team structure (Architect leads Dev team)
  • Peer-to-peer for brainstorming within teams

3. Long-Running Workflows:

  • Temporal provides durable execution that survives crashes and restarts
  • Workflows can run for hours/days without losing state
  • OpenAI has official Temporal integration (announced 2025)

4. Agent Communication:

  • Event-driven via Redis Streams (A2A protocol concepts)
  • Each agent has inbox stream + project broadcast stream
  • Supports task handoffs, context sharing, reviews

5. Token/Cost Tracking:

  • Per-agent instance tracking (not just per-type)
  • Budget guards with automatic caps
  • Real-time cost visibility via dashboards

Syndarix-Specific Considerations

Requirement Solution
50+ concurrent agents LangGraph + Temporal workers (horizontally scalable)
Multiple instances of same type AgentInstance model with unique names (Dave, Ellis, Kate)
Individual context/memory Redis-backed working memory + PostgreSQL persistence
Token tracking per agent TokenUsage model linked to AgentInstance
LLM failover per agent LiteLLM Router with per-agent-type model chains
Real-time visibility WebSocket/SSE activity streams

Risks Identified

  1. Complexity: Temporal + LangGraph learning curve (mitigate: phased rollout)
  2. Cost explosion: Multi-agent systems use 15x more tokens (mitigate: budget guards)
  3. Debugging: Distributed agents hard to trace (mitigate: LangSmith, Temporal UI)

Implementation Roadmap

  • Weeks 1-2: Foundation (Temporal, LangGraph basics, Redis Streams)
  • Weeks 3-4: Core orchestration (supervisor, spawning, communication)
  • Weeks 5-6: Durability (workflows, checkpoints, recovery)
  • Week 7: Observability (dashboards, tracing, analytics)

Next Steps

  1. Review spike document for completeness
  2. Create ADR-002 documenting the architectural decision
  3. Set up Temporal server in development environment
  4. Create proof-of-concept with 2 agents collaborating

Full document: docs/spikes/SPIKE-002-agent-orchestration-pattern.md

## SPIKE-002 Research Completed The comprehensive spike document has been created at `docs/spikes/SPIKE-002-agent-orchestration-pattern.md`. ### Executive Summary After researching leading multi-agent orchestration frameworks (AutoGen 0.4, CrewAI, LangGraph) and enterprise patterns, we recommend a **hybrid architecture** for Syndarix: | Component | Technology | Purpose | |-----------|------------|---------| | Agent Logic | **LangGraph** | Graph-based state machines for agent reasoning | | Durability | **Temporal** | Long-running workflow persistence (hours/days) | | Communication | **Redis Streams** | Event-driven agent-to-agent messaging | | Topology | **Hierarchical Supervisor** | Project Manager coordinates teams | | LLM Access | **LiteLLM** | Unified provider abstraction with failover | ### Key Findings **1. Framework Comparison:** - **AutoGen 0.4**: Microsoft-centric, event-driven, good for standardized patterns - **CrewAI**: Easy to start, but teams report hitting walls at 6-12 months - **LangGraph**: Most flexible, steep learning curve, best for complex stateful workflows - **Recommendation**: Don't adopt any single framework wholesale; build custom layer using best components **2. Orchestration Patterns:** - **Supervisor pattern** for sprint coordination (Project Manager as supervisor) - **Hierarchical pattern** for team structure (Architect leads Dev team) - **Peer-to-peer** for brainstorming within teams **3. Long-Running Workflows:** - **Temporal** provides durable execution that survives crashes and restarts - Workflows can run for hours/days without losing state - OpenAI has official Temporal integration (announced 2025) **4. Agent Communication:** - Event-driven via Redis Streams (A2A protocol concepts) - Each agent has inbox stream + project broadcast stream - Supports task handoffs, context sharing, reviews **5. Token/Cost Tracking:** - Per-agent instance tracking (not just per-type) - Budget guards with automatic caps - Real-time cost visibility via dashboards ### Syndarix-Specific Considerations | Requirement | Solution | |-------------|----------| | 50+ concurrent agents | LangGraph + Temporal workers (horizontally scalable) | | Multiple instances of same type | AgentInstance model with unique names (Dave, Ellis, Kate) | | Individual context/memory | Redis-backed working memory + PostgreSQL persistence | | Token tracking per agent | TokenUsage model linked to AgentInstance | | LLM failover per agent | LiteLLM Router with per-agent-type model chains | | Real-time visibility | WebSocket/SSE activity streams | ### Risks Identified 1. **Complexity**: Temporal + LangGraph learning curve (mitigate: phased rollout) 2. **Cost explosion**: Multi-agent systems use 15x more tokens (mitigate: budget guards) 3. **Debugging**: Distributed agents hard to trace (mitigate: LangSmith, Temporal UI) ### Implementation Roadmap - **Weeks 1-2**: Foundation (Temporal, LangGraph basics, Redis Streams) - **Weeks 3-4**: Core orchestration (supervisor, spawning, communication) - **Weeks 5-6**: Durability (workflows, checkpoints, recovery) - **Week 7**: Observability (dashboards, tracing, analytics) ### Next Steps 1. Review spike document for completeness 2. Create ADR-002 documenting the architectural decision 3. Set up Temporal server in development environment 4. Create proof-of-concept with 2 agents collaborating --- Full document: [`docs/spikes/SPIKE-002-agent-orchestration-pattern.md`](https://gitea.pragmazest.com/cardosofelipe/syndarix/src/branch/main/docs/spikes/SPIKE-002-agent-orchestration-pattern.md)
Sign in to join this conversation.