Add comprehensive spike research documents: - SPIKE-002: Agent Orchestration Pattern (LangGraph + Temporal hybrid) - SPIKE-006: Knowledge Base pgvector (RAG with hybrid search) - SPIKE-007: Agent Communication Protocol (JSON-RPC + Redis Streams) - SPIKE-008: Workflow State Machine (transitions lib + event sourcing) - SPIKE-009: Issue Synchronization (bi-directional sync with conflict resolution) - SPIKE-010: Cost Tracking (LiteLLM callbacks + budget enforcement) - SPIKE-011: Audit Logging (structured event sourcing) - SPIKE-012: Client Approval Flow (checkpoint-based approvals) Add architecture documentation: - ARCHITECTURE_DEEP_ANALYSIS.md: Memory management, security, testing strategy - IMPLEMENTATION_ROADMAP.md: 6-phase, 24-week implementation plan Closes #2, #6, #7, #8, #9, #10, #11, #12 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
34 KiB
Syndarix Architecture Deep Analysis
Version: 1.0 Date: 2025-12-29 Status: Draft - Architectural Thinking
Executive Summary
This document captures deep architectural thinking about Syndarix beyond the immediate spikes. It addresses complex challenges that arise when building a truly autonomous multi-agent system and proposes solutions based on first principles.
1. Agent Memory and Context Management
The Challenge
Agents in Syndarix may work on projects for weeks or months. LLM context windows are finite (128K-200K tokens), but project context grows unboundedly. How do we maintain coherent agent "memory" over time?
Analysis
Context Window Constraints:
| Model | Context Window | Practical Limit (with tools) |
|---|---|---|
| Claude 3.5 Sonnet | 200K tokens | ~150K usable |
| GPT-4 Turbo | 128K tokens | ~100K usable |
| Llama 3 (70B) | 8K-128K tokens | ~80K usable |
Memory Types Needed:
- Working Memory - Current task context (fits in context window)
- Short-term Memory - Recent conversation history (RAG-retrievable)
- Long-term Memory - Project knowledge, past decisions (RAG + summarization)
- Episodic Memory - Specific past events/mistakes to learn from
Proposed Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Agent Memory System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Working │ │ Short-term │ │ Long-term │ │
│ │ Memory │ │ Memory │ │ Memory │ │
│ │ (Context) │ │ (Redis) │ │ (pgvector) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └───────────────────┼──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Context Assembler │ │
│ │ │ │
│ │ 1. System prompt (agent personality, role) │ │
│ │ 2. Project context (from long-term memory) │ │
│ │ 3. Task context (current issue, requirements) │ │
│ │ 4. Relevant history (from short-term memory) │ │
│ │ 5. User message │ │
│ │ │ │
│ │ Total: Fit within context window limits │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Context Compression Strategy:
class ContextManager:
"""Manages agent context to fit within LLM limits."""
MAX_CONTEXT_TOKENS = 100_000 # Leave room for response
async def build_context(
self,
agent: AgentInstance,
task: Task,
user_message: str
) -> list[Message]:
# Fixed costs
system_prompt = self._get_system_prompt(agent) # ~2K tokens
task_context = self._get_task_context(task) # ~1K tokens
# Variable budget
remaining = self.MAX_CONTEXT_TOKENS - token_count(system_prompt, task_context, user_message)
# Allocate remaining to memories
long_term = await self._query_long_term(agent, task, budget=remaining * 0.4)
short_term = await self._get_short_term(agent, budget=remaining * 0.4)
episodic = await self._get_relevant_episodes(agent, task, budget=remaining * 0.2)
return self._assemble_messages(
system_prompt, task_context, long_term, short_term, episodic, user_message
)
Conversation Summarization:
- After every N turns (e.g., 10), summarize conversation and archive
- Use smaller/cheaper model for summarization
- Store summaries in pgvector for semantic retrieval
Recommendation
Implement a tiered memory system with automatic context compression and semantic retrieval. Use Redis for hot short-term memory, pgvector for cold long-term memory, and automatic summarization to prevent context overflow.
2. Cross-Project Knowledge Sharing
The Challenge
Each project has isolated knowledge, but agents could benefit from cross-project learnings:
- Common patterns (authentication, testing, CI/CD)
- Technology expertise (how to configure Kubernetes)
- Anti-patterns (what didn't work before)
Analysis
Privacy Considerations:
- Client data must remain isolated (contractual, legal)
- Technical patterns are generally shareable
- Need clear data classification
Knowledge Categories:
| Category | Scope | Examples |
|---|---|---|
| Client Data | Project-only | Requirements, business logic, code |
| Technical Patterns | Global | Best practices, configurations |
| Agent Learnings | Global | What approaches worked/failed |
| Anti-patterns | Global | Common mistakes to avoid |
Proposed Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Knowledge Graph │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ GLOBAL KNOWLEDGE │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Patterns │ │ Anti-patterns│ │ Expertise │ │ │
│ │ │ Library │ │ Library │ │ Index │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Curated extraction │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Project A │ │ Project B │ │ Project C │ │
│ │ Knowledge │ │ Knowledge │ │ Knowledge │ │
│ │ (Isolated) │ │ (Isolated) │ │ (Isolated) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Knowledge Extraction Pipeline:
class KnowledgeExtractor:
"""Extracts shareable learnings from project work."""
async def extract_learnings(self, project_id: str) -> list[Learning]:
"""
Run periodically or after sprints to extract learnings.
Human review required before promoting to global.
"""
# Get completed work
completed_issues = await self.get_completed_issues(project_id)
# Extract patterns using LLM
patterns = await self.llm.extract_patterns(
completed_issues,
categories=["architecture", "testing", "deployment", "security"]
)
# Classify privacy
for pattern in patterns:
pattern.privacy_level = await self.llm.classify_privacy(pattern)
# Return only shareable patterns for review
return [p for p in patterns if p.privacy_level == "public"]
Recommendation
Implement privacy-aware knowledge extraction with human review gate. Project knowledge stays isolated by default; only explicitly approved patterns flow to global knowledge.
3. Agent Specialization vs Generalization Trade-offs
The Challenge
Should each agent type be highly specialized (depth) or have overlapping capabilities (breadth)?
Analysis
Specialization Benefits:
- Deeper expertise in domain
- Cleaner system prompts
- Less confusion about responsibilities
- Easier to optimize prompts per role
Generalization Benefits:
- Fewer agent types to maintain
- Smoother handoffs (shared context)
- More flexible team composition
- Graceful degradation if agent unavailable
Current Agent Types (10):
| Role | Primary Domain | Potential Overlap |
|---|---|---|
| Product Owner | Requirements | Business Analyst |
| Business Analyst | Documentation | Product Owner |
| Project Manager | Planning | Product Owner |
| Software Architect | Design | Senior Engineer |
| Software Engineer | Coding | Architect, QA |
| UI/UX Designer | Interface | Frontend Engineer |
| QA Engineer | Testing | Software Engineer |
| DevOps Engineer | Infrastructure | Senior Engineer |
| AI/ML Engineer | ML/AI | Software Engineer |
| Security Expert | Security | All |
Proposed Approach: Layered Specialization
┌─────────────────────────────────────────────────────────────────┐
│ Agent Capability Layers │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 3: Role-Specific Expertise │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Product │ │ Architect│ │Engineer │ │ QA │ │
│ │ Owner │ │ │ │ │ │ │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ Layer 2: Shared Professional Skills │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Technical Communication | Code Understanding | Git │ │
│ │ Documentation | Research | Problem Decomposition │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ Layer 1: Foundation Model Capabilities │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Reasoning | Analysis | Writing | Coding (LLM Base) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Capability Inheritance:
class AgentTypeBuilder:
"""Builds agent types with layered capabilities."""
BASE_CAPABILITIES = [
"reasoning", "analysis", "writing", "coding_assist"
]
PROFESSIONAL_SKILLS = [
"technical_communication", "code_understanding",
"git_operations", "documentation", "research"
]
ROLE_SPECIFIC = {
"ENGINEER": ["code_generation", "code_review", "testing", "debugging"],
"ARCHITECT": ["system_design", "adr_writing", "tech_selection"],
"QA": ["test_planning", "test_automation", "bug_reporting"],
# ...
}
def build_capabilities(self, role: AgentRole) -> list[str]:
return (
self.BASE_CAPABILITIES +
self.PROFESSIONAL_SKILLS +
self.ROLE_SPECIFIC[role]
)
Recommendation
Adopt layered specialization where all agents share foundational and professional capabilities, with role-specific expertise on top. This enables smooth collaboration while maintaining clear responsibilities.
4. Human-Agent Collaboration Model
The Challenge
Beyond approval gates, how do humans effectively collaborate with autonomous agents during active work?
Interaction Patterns
| Pattern | Use Case | Frequency |
|---|---|---|
| Approval | Confirm before action | Per checkpoint |
| Guidance | Steer direction | On-demand |
| Override | Correct mistake | Rare |
| Pair Working | Work together | Optional |
| Review | Evaluate output | Post-completion |
Proposed Collaboration Interface
┌─────────────────────────────────────────────────────────────────┐
│ Human-Agent Collaboration Dashboard │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Activity Stream │ │
│ │ ────────────────────────────────────────────────────── │ │
│ │ [10:23] Dave (Engineer) is implementing login API │ │
│ │ [10:24] Dave created auth/service.py │ │
│ │ [10:25] Dave is writing unit tests │ │
│ │ [LIVE] Dave: "I'm adding JWT validation. Using HS256..." │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Intervention Panel │ │
│ │ │ │
│ │ [💬 Chat] [⏸️ Pause] [↩️ Undo Last] [📝 Guide] │ │
│ │ │ │
│ │ Quick Guidance: │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ "Use RS256 instead of HS256 for JWT signing" │ │ │
│ │ │ [Send] 📤 │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Intervention API:
@router.post("/agents/{agent_id}/intervene")
async def intervene(
agent_id: UUID,
intervention: InterventionRequest,
current_user: User = Depends(get_current_user)
):
"""Allow human to intervene in agent work."""
match intervention.type:
case "pause":
await orchestrator.pause_agent(agent_id)
case "resume":
await orchestrator.resume_agent(agent_id)
case "guide":
await orchestrator.send_guidance(agent_id, intervention.message)
case "undo":
await orchestrator.undo_last_action(agent_id)
case "override":
await orchestrator.override_decision(agent_id, intervention.decision)
Recommendation
Build a real-time collaboration dashboard with intervention capabilities. Humans should be able to observe, guide, pause, and correct agents without stopping the entire workflow.
5. Testing Strategy for Autonomous AI Systems
The Challenge
Traditional testing (unit, integration, E2E) doesn't capture autonomous agent behavior. How do we ensure quality?
Testing Pyramid for AI Agents
▲
╱ ╲
╱ ╲
╱ E2E ╲ Agent Scenarios
╱ Agent ╲ (Full workflows)
╱─────────╲
╱ Integration╲ Tool + LLM Integration
╱ (with mocks) ╲ (Deterministic responses)
╱─────────────────╲
╱ Unit Tests ╲ Orchestrator, Services
╱ (no LLM needed) ╲ (Pure logic)
╱───────────────────────╲
╱ Prompt Testing ╲ System prompt evaluation
╱ (LLM evals) ╲(Quality metrics)
╱─────────────────────────────╲
Test Categories
1. Prompt Testing (Eval Framework):
class PromptEvaluator:
"""Evaluate system prompt quality."""
TEST_CASES = [
EvalCase(
name="requirement_extraction",
input="Client wants a mobile app for food delivery",
expected_behaviors=[
"asks clarifying questions",
"identifies stakeholders",
"considers non-functional requirements"
]
),
EvalCase(
name="code_review_thoroughness",
input="Review this PR: [vulnerable SQL code]",
expected_behaviors=[
"identifies SQL injection",
"suggests parameterized queries",
"mentions security best practices"
]
)
]
async def evaluate(self, agent_type: AgentType) -> EvalReport:
results = []
for case in self.TEST_CASES:
response = await self.llm.complete(
system=agent_type.system_prompt,
user=case.input
)
score = await self.judge_response(response, case.expected_behaviors)
results.append(score)
return EvalReport(results)
2. Integration Testing (Mock LLM):
@pytest.fixture
def mock_llm():
"""Deterministic LLM responses for integration tests."""
responses = {
"analyze requirements": "...",
"generate code": "def hello(): return 'world'",
"review code": "LGTM"
}
return MockLLM(responses)
async def test_story_implementation_workflow(mock_llm):
"""Test full workflow with predictable responses."""
orchestrator = AgentOrchestrator(llm=mock_llm)
result = await orchestrator.execute_workflow(
workflow="implement_story",
inputs={"story_id": "TEST-123"}
)
assert result.status == "completed"
assert "hello" in result.artifacts["code"]
3. Agent Scenario Testing:
class AgentScenarioTest:
"""End-to-end agent behavior testing."""
@scenario("engineer_handles_bug_report")
async def test_bug_resolution(self):
"""Engineer agent should fix bugs correctly."""
# Setup
project = await create_test_project()
engineer = await spawn_agent("engineer", project)
# Act
bug = await create_issue(
project,
title="Login button not working",
type="bug"
)
result = await engineer.handle(bug)
# Assert
assert result.pr_created
assert result.tests_pass
assert "button" in result.changes_summary.lower()
Recommendation
Implement a multi-layer testing strategy with prompt evals, deterministic integration tests, and scenario-based agent testing. Use LLM-as-judge for evaluating open-ended responses.
6. Rollback and Recovery
The Challenge
Autonomous agents will make mistakes. How do we recover gracefully?
Error Categories
| Category | Example | Recovery Strategy |
|---|---|---|
| Reversible | Wrong code generated | Revert commit, regenerate |
| Partially Reversible | Merged bad PR | Revert PR, fix, re-merge |
| Non-reversible | Deployed to production | Forward-fix or rollback deploy |
| External Side Effects | Email sent to client | Apology + correction |
Recovery Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Recovery System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Action Log │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ Action ID | Agent | Type | Reversible | State │ │ │
│ │ ├──────────────────────────────────────────────────┤ │ │
│ │ │ a-001 | Dave | commit | Yes | completed │ │ │
│ │ │ a-002 | Dave | push | Yes | completed │ │ │
│ │ │ a-003 | Dave | create_pr | Yes | completed │ │ │
│ │ │ a-004 | Kate | merge_pr | Partial | completed │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Rollback Engine │ │
│ │ │ │
│ │ rollback_to(action_id) -> Reverses all actions after │ │
│ │ undo_action(action_id) -> Reverses single action │ │
│ │ compensate(action_id) -> Creates compensating action │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Action Logging:
class ActionLog:
"""Immutable log of all agent actions for recovery."""
async def record(
self,
agent_id: UUID,
action_type: str,
inputs: dict,
outputs: dict,
reversible: bool,
reverse_action: str | None = None
) -> ActionRecord:
record = ActionRecord(
id=uuid4(),
agent_id=agent_id,
action_type=action_type,
inputs=inputs,
outputs=outputs,
reversible=reversible,
reverse_action=reverse_action,
timestamp=datetime.utcnow()
)
await self.db.add(record)
return record
async def rollback_to(self, action_id: UUID) -> RollbackResult:
"""Rollback all actions after the given action."""
actions = await self.get_actions_after(action_id)
results = []
for action in reversed(actions):
if action.reversible:
result = await self._execute_reverse(action)
results.append(result)
else:
results.append(RollbackSkipped(action, reason="non-reversible"))
return RollbackResult(results)
Compensation Pattern:
class CompensationEngine:
"""Handles compensating actions for non-reversible operations."""
COMPENSATIONS = {
"email_sent": "send_correction_email",
"deployment": "rollback_deployment",
"external_api_call": "create_reversal_request"
}
async def compensate(self, action: ActionRecord) -> CompensationResult:
if action.action_type in self.COMPENSATIONS:
compensation = self.COMPENSATIONS[action.action_type]
return await self._execute_compensation(compensation, action)
else:
return CompensationResult(
status="manual_required",
message=f"No automatic compensation for {action.action_type}"
)
Recommendation
Implement comprehensive action logging with rollback capabilities. Define compensation strategies for non-reversible actions. Enable point-in-time recovery for project state.
7. Security Considerations for Autonomous Agents
Threat Model
| Threat | Risk | Mitigation |
|---|---|---|
| Agent executes malicious code | High | Sandboxed execution, code review gates |
| Agent exfiltrates data | High | Network isolation, output filtering |
| Prompt injection via user input | Medium | Input sanitization, prompt hardening |
| Agent credential abuse | Medium | Least-privilege tokens, short TTL |
| Agent collusion | Low | Independent agent instances, monitoring |
Security Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Security Layers │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 4: Output Filtering │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Code scan before commit │ │
│ │ - Secrets detection │ │
│ │ - Policy compliance check │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Layer 3: Action Authorization │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Role-based permissions │ │
│ │ - Project scope enforcement │ │
│ │ - Sensitive action approval │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Layer 2: Input Sanitization │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Prompt injection detection │ │
│ │ - Content filtering │ │
│ │ - Schema validation │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Layer 1: Infrastructure Isolation │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ - Container sandboxing │ │
│ │ - Network segmentation │ │
│ │ - File system restrictions │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Recommendation
Implement defense-in-depth with multiple security layers. Assume agents can be compromised and design for containment.
Summary of Recommendations
| Area | Recommendation | Priority |
|---|---|---|
| Memory | Tiered memory with context compression | High |
| Knowledge | Privacy-aware extraction with human gate | Medium |
| Specialization | Layered capabilities with role-specific top | Medium |
| Collaboration | Real-time dashboard with intervention | High |
| Testing | Multi-layer with prompt evals | High |
| Recovery | Action logging with rollback engine | High |
| Security | Defense-in-depth, assume compromise | High |
Next Steps
- Validate with spike research - Update based on spike findings
- Create detailed ADRs - For memory, recovery, security
- Prototype critical paths - Memory system, rollback engine
- Security review - External audit before production
This document captures architectural thinking to guide implementation. It should be updated as spikes complete and design evolves.