[SPIKE-002] Agent Orchestration & State Machine #2

New Issue

cardosofelipe · 2025-12-29T03:50:14Z

cardosofelipe commented

2025-12-29 03:50:14 +00:00

Objective

Design the agent orchestration system that manages agent lifecycle, task assignment, and inter-agent communication.

Key Questions

How do we represent agent state? (idle, working, blocked, waiting_review)
What state machine library to use? (transitions for Python, xstate for frontend?)
How do agents communicate with each other?
How do we handle parallel agent execution?
How do we track agent "conversations" and context?

Research Areas

Python state machine libraries (transitions, pytransitions, statemachine)
Actor model patterns (similar to Akka but for Python)
Message queue patterns for agent communication
Context management across agent interactions

Expected Deliverables

State machine design for agent lifecycle
Communication protocol between agents
Proof-of-concept with 2 agents collaborating
ADR documenting the pattern

Acceptance Criteria

Agents can transition through defined states
Agents can pass context/messages to each other
Parallel execution works correctly
State is persisted and recoverable

Labels

spike, architecture, agents

## Objective Design the agent orchestration system that manages agent lifecycle, task assignment, and inter-agent communication. ## Key Questions 1. How do we represent agent state? (idle, working, blocked, waiting_review) 2. What state machine library to use? (transitions for Python, xstate for frontend?) 3. How do agents communicate with each other? 4. How do we handle parallel agent execution? 5. How do we track agent "conversations" and context? ## Research Areas - [ ] Python state machine libraries (transitions, pytransitions, statemachine) - [ ] Actor model patterns (similar to Akka but for Python) - [ ] Message queue patterns for agent communication - [ ] Context management across agent interactions ## Expected Deliverables - State machine design for agent lifecycle - Communication protocol between agents - Proof-of-concept with 2 agents collaborating - ADR documenting the pattern ## Acceptance Criteria - [ ] Agents can transition through defined states - [ ] Agents can pass context/messages to each other - [ ] Parallel execution works correctly - [ ] State is persisted and recoverable ## Labels `spike`, `architecture`, `agents`

cardosofelipe commented

2025-12-29 12:22:50 +00:00

SPIKE-002 Research Completed

The comprehensive spike document has been created at docs/spikes/SPIKE-002-agent-orchestration-pattern.md.

Executive Summary

After researching leading multi-agent orchestration frameworks (AutoGen 0.4, CrewAI, LangGraph) and enterprise patterns, we recommend a hybrid architecture for Syndarix:

Component	Technology	Purpose
Agent Logic	LangGraph	Graph-based state machines for agent reasoning
Durability	Temporal	Long-running workflow persistence (hours/days)
Communication	Redis Streams	Event-driven agent-to-agent messaging
Topology	Hierarchical Supervisor	Project Manager coordinates teams
LLM Access	LiteLLM	Unified provider abstraction with failover

Key Findings

1. Framework Comparison:

AutoGen 0.4: Microsoft-centric, event-driven, good for standardized patterns
CrewAI: Easy to start, but teams report hitting walls at 6-12 months
LangGraph: Most flexible, steep learning curve, best for complex stateful workflows
Recommendation: Don't adopt any single framework wholesale; build custom layer using best components

2. Orchestration Patterns:

Supervisor pattern for sprint coordination (Project Manager as supervisor)
Hierarchical pattern for team structure (Architect leads Dev team)
Peer-to-peer for brainstorming within teams

3. Long-Running Workflows:

Temporal provides durable execution that survives crashes and restarts
Workflows can run for hours/days without losing state
OpenAI has official Temporal integration (announced 2025)

4. Agent Communication:

Event-driven via Redis Streams (A2A protocol concepts)
Each agent has inbox stream + project broadcast stream
Supports task handoffs, context sharing, reviews

5. Token/Cost Tracking:

Per-agent instance tracking (not just per-type)
Budget guards with automatic caps
Real-time cost visibility via dashboards

Syndarix-Specific Considerations

Requirement	Solution
50+ concurrent agents	LangGraph + Temporal workers (horizontally scalable)
Multiple instances of same type	AgentInstance model with unique names (Dave, Ellis, Kate)
Individual context/memory	Redis-backed working memory + PostgreSQL persistence
Token tracking per agent	TokenUsage model linked to AgentInstance
LLM failover per agent	LiteLLM Router with per-agent-type model chains
Real-time visibility	WebSocket/SSE activity streams

Risks Identified

Complexity: Temporal + LangGraph learning curve (mitigate: phased rollout)
Cost explosion: Multi-agent systems use 15x more tokens (mitigate: budget guards)
Debugging: Distributed agents hard to trace (mitigate: LangSmith, Temporal UI)

Implementation Roadmap

Weeks 1-2: Foundation (Temporal, LangGraph basics, Redis Streams)
Weeks 3-4: Core orchestration (supervisor, spawning, communication)
Weeks 5-6: Durability (workflows, checkpoints, recovery)
Week 7: Observability (dashboards, tracing, analytics)

Next Steps

Review spike document for completeness
Create ADR-002 documenting the architectural decision
Set up Temporal server in development environment
Create proof-of-concept with 2 agents collaborating

Full document: docs/spikes/SPIKE-002-agent-orchestration-pattern.md

## SPIKE-002 Research Completed The comprehensive spike document has been created at `docs/spikes/SPIKE-002-agent-orchestration-pattern.md`. ### Executive Summary After researching leading multi-agent orchestration frameworks (AutoGen 0.4, CrewAI, LangGraph) and enterprise patterns, we recommend a **hybrid architecture** for Syndarix: | Component | Technology | Purpose | |-----------|------------|---------| | Agent Logic | **LangGraph** | Graph-based state machines for agent reasoning | | Durability | **Temporal** | Long-running workflow persistence (hours/days) | | Communication | **Redis Streams** | Event-driven agent-to-agent messaging | | Topology | **Hierarchical Supervisor** | Project Manager coordinates teams | | LLM Access | **LiteLLM** | Unified provider abstraction with failover | ### Key Findings **1. Framework Comparison:** - **AutoGen 0.4**: Microsoft-centric, event-driven, good for standardized patterns - **CrewAI**: Easy to start, but teams report hitting walls at 6-12 months - **LangGraph**: Most flexible, steep learning curve, best for complex stateful workflows - **Recommendation**: Don't adopt any single framework wholesale; build custom layer using best components **2. Orchestration Patterns:** - **Supervisor pattern** for sprint coordination (Project Manager as supervisor) - **Hierarchical pattern** for team structure (Architect leads Dev team) - **Peer-to-peer** for brainstorming within teams **3. Long-Running Workflows:** - **Temporal** provides durable execution that survives crashes and restarts - Workflows can run for hours/days without losing state - OpenAI has official Temporal integration (announced 2025) **4. Agent Communication:** - Event-driven via Redis Streams (A2A protocol concepts) - Each agent has inbox stream + project broadcast stream - Supports task handoffs, context sharing, reviews **5. Token/Cost Tracking:** - Per-agent instance tracking (not just per-type) - Budget guards with automatic caps - Real-time cost visibility via dashboards ### Syndarix-Specific Considerations | Requirement | Solution | |-------------|----------| | 50+ concurrent agents | LangGraph + Temporal workers (horizontally scalable) | | Multiple instances of same type | AgentInstance model with unique names (Dave, Ellis, Kate) | | Individual context/memory | Redis-backed working memory + PostgreSQL persistence | | Token tracking per agent | TokenUsage model linked to AgentInstance | | LLM failover per agent | LiteLLM Router with per-agent-type model chains | | Real-time visibility | WebSocket/SSE activity streams | ### Risks Identified 1. **Complexity**: Temporal + LangGraph learning curve (mitigate: phased rollout) 2. **Cost explosion**: Multi-agent systems use 15x more tokens (mitigate: budget guards) 3. **Debugging**: Distributed agents hard to trace (mitigate: LangSmith, Temporal UI) ### Implementation Roadmap - **Weeks 1-2**: Foundation (Temporal, LangGraph basics, Redis Streams) - **Weeks 3-4**: Core orchestration (supervisor, spawning, communication) - **Weeks 5-6**: Durability (workflows, checkpoints, recovery) - **Week 7**: Observability (dashboards, tracing, analytics) ### Next Steps 1. Review spike document for completeness 2. Create ADR-002 documenting the architectural decision 3. Set up Temporal server in development environment 4. Create proof-of-concept with 2 agents collaborating --- Full document: [`docs/spikes/SPIKE-002-agent-orchestration-pattern.md`](https://gitea.pragmazest.com/cardosofelipe/syndarix/src/branch/main/docs/spikes/SPIKE-002-agent-orchestration-pattern.md)

cardosofelipe referenced this issue from a commit

2025-12-29 12:31:10 +00:00

docs: add architecture spikes and deep analysis documentation

cardosofelipe closed this issue

2025-12-29 12:31:46 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cardosofelipe/syndarix#2