forked from cardosofelipe/fast-next-template
docs: add remaining ADRs and comprehensive architecture documentation
Added 7 new Architecture Decision Records completing the full set: - ADR-008: Knowledge Base and RAG (pgvector) - ADR-009: Agent Communication Protocol (structured messages) - ADR-010: Workflow State Machine (transitions + PostgreSQL) - ADR-011: Issue Synchronization (webhook-first + polling) - ADR-012: Cost Tracking (LiteLLM callbacks + Redis budgets) - ADR-013: Audit Logging (hash chaining + tiered storage) - ADR-014: Client Approval Flow (checkpoint-based) Added comprehensive ARCHITECTURE.md that: - Summarizes all 14 ADRs in decision matrix - Documents full system architecture with diagrams - Explains all component interactions - Details technology stack with self-hostability guarantee - Covers security, scalability, and deployment Updated IMPLEMENTATION_ROADMAP.md to mark Phase 0 completed items. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
170
docs/adrs/ADR-008-knowledge-base-rag.md
Normal file
170
docs/adrs/ADR-008-knowledge-base-rag.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# ADR-008: Knowledge Base and RAG Architecture
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-006
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix agents require access to project-specific knowledge bases for Retrieval-Augmented Generation (RAG). This enables agents to reference requirements, codebase context, documentation, and past decisions when performing tasks.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Operational Simplicity:** Minimize infrastructure complexity
|
||||
- **Performance:** Sub-100ms query latency
|
||||
- **Isolation:** Per-project knowledge separation
|
||||
- **Cost:** Avoid expensive dedicated vector databases
|
||||
- **Flexibility:** Support multiple content types (code, docs, conversations)
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: Dedicated Vector Database (Pinecone, Qdrant)
|
||||
|
||||
**Pros:**
|
||||
- Purpose-built for vector search
|
||||
- Excellent query performance at scale
|
||||
- Managed offerings available
|
||||
|
||||
**Cons:**
|
||||
- Additional infrastructure
|
||||
- Cost at scale ($27-$70/month per 1M vectors)
|
||||
- Data sync complexity with PostgreSQL
|
||||
|
||||
### Option 2: pgvector Extension (Selected)
|
||||
|
||||
**Pros:**
|
||||
- Already using PostgreSQL (zero additional infrastructure)
|
||||
- ACID transactions with application data
|
||||
- Row-level security for multi-tenant isolation
|
||||
- Handles 10-100M vectors effectively
|
||||
- Hybrid search with PostgreSQL full-text
|
||||
|
||||
**Cons:**
|
||||
- Less performant than dedicated solutions at billion-scale
|
||||
- Requires PostgreSQL 15+
|
||||
|
||||
### Option 3: Weaviate (Self-hosted)
|
||||
|
||||
**Pros:**
|
||||
- Multi-modal support
|
||||
- Knowledge graph features
|
||||
|
||||
**Cons:**
|
||||
- Additional service to manage
|
||||
- Overkill for our scale
|
||||
|
||||
## Decision
|
||||
|
||||
**Adopt pgvector** as the vector store for RAG functionality.
|
||||
|
||||
Syndarix's per-project isolation means knowledge bases remain in the thousands to millions of vectors per tenant, well within pgvector's optimal range. The operational simplicity of using existing PostgreSQL infrastructure outweighs the performance benefits of dedicated vector databases.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Embedding Model Strategy
|
||||
|
||||
| Content Type | Embedding Model | Dimensions | Rationale |
|
||||
|-------------|-----------------|------------|-----------|
|
||||
| Code files | voyage-code-3 | 1024 | State-of-art for code retrieval |
|
||||
| Documentation | text-embedding-3-small | 1536 | Good balance cost/quality |
|
||||
| Conversations | text-embedding-3-small | 1536 | General purpose |
|
||||
|
||||
### Chunking Strategy
|
||||
|
||||
| Content Type | Strategy | Chunk Size |
|
||||
|--------------|----------|------------|
|
||||
| Python/JS code | AST-based (function/class) | Per function |
|
||||
| Markdown docs | Heading-based | Per section |
|
||||
| PDF specs | Page-level + semantic | 1000 tokens |
|
||||
| Conversations | Turn-based | Per exchange |
|
||||
|
||||
### Database Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE knowledge_chunks (
|
||||
id UUID PRIMARY KEY,
|
||||
project_id UUID NOT NULL REFERENCES projects(id),
|
||||
source_type VARCHAR(50) NOT NULL, -- 'code', 'doc', 'conversation'
|
||||
source_path VARCHAR(500),
|
||||
content TEXT NOT NULL,
|
||||
embedding vector(1536),
|
||||
metadata JSONB,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX ON knowledge_chunks USING ivfflat (embedding vector_cosine_ops)
|
||||
WITH (lists = 100);
|
||||
CREATE INDEX ON knowledge_chunks (project_id);
|
||||
CREATE INDEX ON knowledge_chunks USING gin (metadata);
|
||||
```
|
||||
|
||||
### Hybrid Search
|
||||
|
||||
```python
|
||||
async def hybrid_search(
|
||||
project_id: str,
|
||||
query: str,
|
||||
top_k: int = 10,
|
||||
vector_weight: float = 0.7
|
||||
) -> list[Chunk]:
|
||||
"""Combine vector similarity with keyword matching."""
|
||||
query_embedding = await embed(query)
|
||||
|
||||
results = await db.execute("""
|
||||
WITH vector_results AS (
|
||||
SELECT id, content, metadata,
|
||||
1 - (embedding <=> $1) as vector_score
|
||||
FROM knowledge_chunks
|
||||
WHERE project_id = $2
|
||||
ORDER BY embedding <=> $1
|
||||
LIMIT $3 * 2
|
||||
),
|
||||
keyword_results AS (
|
||||
SELECT id, content, metadata,
|
||||
ts_rank(to_tsvector(content), plainto_tsquery($4)) as text_score
|
||||
FROM knowledge_chunks
|
||||
WHERE project_id = $2
|
||||
AND to_tsvector(content) @@ plainto_tsquery($4)
|
||||
LIMIT $3 * 2
|
||||
)
|
||||
SELECT DISTINCT ON (id) id, content, metadata,
|
||||
COALESCE(v.vector_score, 0) * $5 +
|
||||
COALESCE(k.text_score, 0) * (1 - $5) as combined_score
|
||||
FROM vector_results v
|
||||
FULL OUTER JOIN keyword_results k USING (id, content, metadata)
|
||||
ORDER BY combined_score DESC
|
||||
LIMIT $3
|
||||
""", query_embedding, project_id, top_k, query, vector_weight)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Zero additional infrastructure
|
||||
- Transactional consistency with application data
|
||||
- Unified backup/restore
|
||||
- Row-level security for tenant isolation
|
||||
|
||||
### Negative
|
||||
- May need migration to dedicated vector DB if scaling beyond 100M vectors
|
||||
- Index tuning required for optimal performance
|
||||
|
||||
### Migration Path
|
||||
If scale requires it, migrate to Qdrant (self-hosted, open-source) with the same embedding models, preserving vectors.
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-103: Agent domain knowledge (RAG)
|
||||
- NFR-501: Self-hostability requirement
|
||||
- TC-001: PostgreSQL as primary database
|
||||
|
||||
---
|
||||
|
||||
*This ADR establishes the knowledge base and RAG architecture for Syndarix.*
|
||||
166
docs/adrs/ADR-009-agent-communication-protocol.md
Normal file
166
docs/adrs/ADR-009-agent-communication-protocol.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# ADR-009: Agent Communication Protocol
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-007
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix requires a robust protocol for inter-agent communication. 10+ specialized AI agents must collaborate on software projects, sharing context, delegating tasks, and resolving conflicts.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Auditability:** All communication must be traceable
|
||||
- **Flexibility:** Support various communication patterns
|
||||
- **Performance:** Low-latency for interactive collaboration
|
||||
- **Reliability:** Messages must not be lost
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: Pure Natural Language
|
||||
Agents communicate via free-form text messages.
|
||||
|
||||
**Pros:** Simple, flexible
|
||||
**Cons:** Difficult to route, parse, and audit
|
||||
|
||||
### Option 2: Rigid RPC Protocol
|
||||
Strongly-typed function calls between agents.
|
||||
|
||||
**Pros:** Predictable, type-safe
|
||||
**Cons:** Loses LLM reasoning flexibility
|
||||
|
||||
### Option 3: Structured Envelope + Natural Language Payload (Selected)
|
||||
JSON envelope for routing/auditing with natural language content.
|
||||
|
||||
**Pros:** Best of both worlds - routeable and auditable while preserving LLM capabilities
|
||||
**Cons:** Slightly more complex
|
||||
|
||||
## Decision
|
||||
|
||||
**Adopt structured message envelopes with natural language payloads**, inspired by Google's A2A protocol concepts.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Message Schema
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class AgentMessage:
|
||||
id: UUID # Unique message ID
|
||||
type: Literal["request", "response", "broadcast", "notification"]
|
||||
|
||||
# Routing
|
||||
from_agent: AgentIdentity # Source agent
|
||||
to_agent: AgentIdentity | None # Target (None = broadcast)
|
||||
routing: Literal["direct", "role", "broadcast", "topic"]
|
||||
|
||||
# Action
|
||||
action: str # e.g., "request_guidance", "task_handoff"
|
||||
priority: Literal["low", "normal", "high", "urgent"]
|
||||
|
||||
# Context
|
||||
project_id: str
|
||||
conversation_id: str | None # For threading
|
||||
correlation_id: UUID | None # For request/response matching
|
||||
|
||||
# Content
|
||||
content: str # Natural language message
|
||||
attachments: list[Attachment] # Code snippets, files, etc.
|
||||
|
||||
# Metadata
|
||||
created_at: datetime
|
||||
expires_at: datetime | None
|
||||
requires_response: bool
|
||||
```
|
||||
|
||||
### Routing Strategies
|
||||
|
||||
| Strategy | Syntax | Use Case |
|
||||
|----------|--------|----------|
|
||||
| Direct | `to: "agent-123"` | Specific agent |
|
||||
| Role-based | `to: "@engineers"` | All agents of role |
|
||||
| Broadcast | `to: "@all"` | Project-wide |
|
||||
| Topic-based | `to: "#auth-module"` | Subscribed agents |
|
||||
|
||||
### Communication Modes
|
||||
|
||||
```python
|
||||
class MessageMode(str, Enum):
|
||||
SYNC = "sync" # Await response (< 30s)
|
||||
ASYNC = "async" # Queue, callback later
|
||||
FIRE_AND_FORGET = "fire" # No response expected
|
||||
STREAM = "stream" # Continuous updates
|
||||
```
|
||||
|
||||
### Message Bus Implementation
|
||||
|
||||
```python
|
||||
class AgentMessageBus:
|
||||
"""Redis Streams-based message bus for agent communication."""
|
||||
|
||||
async def send(self, message: AgentMessage) -> None:
|
||||
# Persist to PostgreSQL for audit
|
||||
await self.store.save(message)
|
||||
|
||||
# Publish to Redis for real-time delivery
|
||||
channel = self._get_channel(message)
|
||||
await self.redis.xadd(channel, message.to_dict())
|
||||
|
||||
# Publish SSE event for UI visibility
|
||||
await self.event_bus.publish(
|
||||
f"project:{message.project_id}",
|
||||
{"type": "agent_message", "preview": message.content[:100]}
|
||||
)
|
||||
|
||||
async def subscribe(self, agent_id: str) -> AsyncIterator[AgentMessage]:
|
||||
"""Subscribe to messages for an agent."""
|
||||
channels = [
|
||||
f"agent:{agent_id}", # Direct messages
|
||||
f"role:{agent.role}", # Role-based
|
||||
f"project:{agent.project_id}", # Broadcasts
|
||||
]
|
||||
# ... Redis Streams consumer group logic
|
||||
```
|
||||
|
||||
### Context Hierarchy
|
||||
|
||||
1. **Conversation Context** (short-term): Current thread, last N exchanges
|
||||
2. **Session Context** (medium-term): Sprint goals, recent decisions
|
||||
3. **Project Context** (long-term): Architecture, requirements, knowledge base
|
||||
|
||||
### Conflict Resolution
|
||||
|
||||
When agents disagree:
|
||||
1. **Peer Resolution:** Agents attempt consensus (2 attempts)
|
||||
2. **Supervisor Escalation:** Product Owner or Architect decides
|
||||
3. **Human Override:** Client approval if configured
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Full audit trail of all agent communication
|
||||
- Flexible routing supports various collaboration patterns
|
||||
- Natural language preserves LLM reasoning quality
|
||||
- Real-time UI visibility into agent collaboration
|
||||
|
||||
### Negative
|
||||
- Additional complexity vs simple function calls
|
||||
- Message persistence storage requirements
|
||||
|
||||
### Mitigation
|
||||
- Archival policy for old messages
|
||||
- Compression for large attachments
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-104: Inter-agent communication
|
||||
- FR-105: Agent activity monitoring
|
||||
- NFR-602: Comprehensive audit logging
|
||||
|
||||
---
|
||||
|
||||
*This ADR establishes the agent communication protocol for Syndarix.*
|
||||
189
docs/adrs/ADR-010-workflow-state-machine.md
Normal file
189
docs/adrs/ADR-010-workflow-state-machine.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# ADR-010: Workflow State Machine Architecture
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-008
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix requires durable state machines for orchestrating long-lived workflows that span hours to days:
|
||||
- Sprint execution (1-2 weeks)
|
||||
- Story implementation (hours to days)
|
||||
- PR review cycles (hours)
|
||||
- Approval flows (variable)
|
||||
|
||||
Workflows must survive system restarts, handle failures gracefully, and provide full visibility.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Durability:** State must survive crashes and restarts
|
||||
- **Visibility:** Clear status of all workflows
|
||||
- **Flexibility:** Support various workflow types
|
||||
- **Simplicity:** Avoid heavy infrastructure
|
||||
- **Auditability:** Full history of state transitions
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: Temporal.io
|
||||
|
||||
**Pros:**
|
||||
- Durable execution out of the box
|
||||
- Handles multi-day workflows
|
||||
- Built-in retries, timeouts, versioning
|
||||
|
||||
**Cons:**
|
||||
- Heavy infrastructure (cluster required)
|
||||
- Operational burden
|
||||
- Overkill for Syndarix's scale
|
||||
|
||||
### Option 2: Custom + transitions Library (Selected)
|
||||
|
||||
**Pros:**
|
||||
- Lightweight, Pythonic
|
||||
- PostgreSQL persistence (existing infra)
|
||||
- Full control over behavior
|
||||
- Event sourcing for audit trail
|
||||
|
||||
**Cons:**
|
||||
- Manual persistence implementation
|
||||
- No distributed coordination
|
||||
|
||||
### Option 3: Prefect
|
||||
|
||||
**Pros:** Good for data pipelines
|
||||
**Cons:** Wrong abstraction for business workflows
|
||||
|
||||
## Decision
|
||||
|
||||
**Adopt custom workflow engine using `transitions` library** with PostgreSQL persistence and Celery task execution.
|
||||
|
||||
This approach provides durability and flexibility without the operational overhead of dedicated workflow engines. At Syndarix's scale (dozens, not thousands of concurrent workflows), this is the right trade-off.
|
||||
|
||||
## Implementation
|
||||
|
||||
### State Machine Definition
|
||||
|
||||
```python
|
||||
from transitions import Machine
|
||||
|
||||
class StoryWorkflow:
|
||||
states = [
|
||||
'analysis', 'design', 'implementation',
|
||||
'review', 'testing', 'done', 'blocked'
|
||||
]
|
||||
|
||||
def __init__(self, story_id: str):
|
||||
self.story_id = story_id
|
||||
self.machine = Machine(
|
||||
model=self,
|
||||
states=self.states,
|
||||
initial='analysis'
|
||||
)
|
||||
|
||||
# Define transitions
|
||||
self.machine.add_transition('design_complete', 'analysis', 'design')
|
||||
self.machine.add_transition('start_coding', 'design', 'implementation')
|
||||
self.machine.add_transition('submit_pr', 'implementation', 'review')
|
||||
self.machine.add_transition('request_changes', 'review', 'implementation')
|
||||
self.machine.add_transition('approve', 'review', 'testing')
|
||||
self.machine.add_transition('tests_pass', 'testing', 'done')
|
||||
self.machine.add_transition('tests_fail', 'testing', 'implementation')
|
||||
self.machine.add_transition('block', '*', 'blocked')
|
||||
self.machine.add_transition('unblock', 'blocked', 'implementation')
|
||||
```
|
||||
|
||||
### Persistence Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE workflow_instances (
|
||||
id UUID PRIMARY KEY,
|
||||
workflow_type VARCHAR(50) NOT NULL,
|
||||
current_state VARCHAR(100) NOT NULL,
|
||||
entity_id VARCHAR(100) NOT NULL, -- story_id, sprint_id, etc.
|
||||
project_id UUID NOT NULL,
|
||||
context JSONB DEFAULT '{}',
|
||||
error TEXT,
|
||||
retry_count INTEGER DEFAULT 0,
|
||||
created_at TIMESTAMPTZ NOT NULL,
|
||||
updated_at TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
|
||||
-- Event sourcing table
|
||||
CREATE TABLE workflow_transitions (
|
||||
id UUID PRIMARY KEY,
|
||||
workflow_id UUID NOT NULL REFERENCES workflow_instances(id),
|
||||
from_state VARCHAR(100) NOT NULL,
|
||||
to_state VARCHAR(100) NOT NULL,
|
||||
trigger VARCHAR(100) NOT NULL,
|
||||
triggered_by VARCHAR(100), -- agent_id, user_id, or 'system'
|
||||
metadata JSONB DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
```
|
||||
|
||||
### Core Workflows
|
||||
|
||||
| Workflow | States | Duration | Approval Points |
|
||||
|----------|--------|----------|-----------------|
|
||||
| **Sprint** | planning → active → review → done | 1-2 weeks | Start, completion |
|
||||
| **Story** | analysis → design → implementation → review → testing → done | Hours-days | PR merge |
|
||||
| **PR Review** | submitted → reviewing → changes_requested → approved → merged | Hours | Merge |
|
||||
| **Approval** | pending → approved/rejected/expired | Variable | N/A |
|
||||
|
||||
### Integration with Celery
|
||||
|
||||
```python
|
||||
@celery_app.task(bind=True)
|
||||
def execute_workflow_step(self, workflow_id: str, trigger: str):
|
||||
"""Execute a workflow state transition as a Celery task."""
|
||||
workflow = WorkflowService.load(workflow_id)
|
||||
|
||||
try:
|
||||
# Attempt transition
|
||||
workflow.trigger(trigger)
|
||||
workflow.save()
|
||||
|
||||
# Publish state change event
|
||||
event_bus.publish(f"project:{workflow.project_id}", {
|
||||
"type": "workflow_transition",
|
||||
"workflow_id": workflow_id,
|
||||
"new_state": workflow.state
|
||||
})
|
||||
|
||||
except TransitionNotAllowed:
|
||||
logger.warning(f"Invalid transition {trigger} from {workflow.state}")
|
||||
except Exception as e:
|
||||
workflow.error = str(e)
|
||||
workflow.retry_count += 1
|
||||
workflow.save()
|
||||
raise self.retry(exc=e, countdown=60 * workflow.retry_count)
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Lightweight, uses existing infrastructure
|
||||
- Full audit trail via event sourcing
|
||||
- Easy to understand and modify
|
||||
- Celery integration for async execution
|
||||
|
||||
### Negative
|
||||
- Manual persistence implementation
|
||||
- No distributed coordination (single-node)
|
||||
|
||||
### Migration Path
|
||||
If scale requires distributed workflows, migrate to Temporal with the same state machine definitions.
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-301-305: Workflow execution requirements
|
||||
- NFR-402: Fault tolerance
|
||||
- NFR-602: Audit logging
|
||||
|
||||
---
|
||||
|
||||
*This ADR establishes the workflow state machine architecture for Syndarix.*
|
||||
232
docs/adrs/ADR-011-issue-synchronization.md
Normal file
232
docs/adrs/ADR-011-issue-synchronization.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# ADR-011: Issue Synchronization Architecture
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-009
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix must synchronize issues bi-directionally with external trackers (Gitea, GitHub, GitLab). Agents create and update issues internally, which must reflect in external systems. External changes must flow back to Syndarix.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Real-time:** Changes visible within seconds
|
||||
- **Consistency:** Eventual consistency acceptable
|
||||
- **Conflict Resolution:** Clear rules when edits conflict
|
||||
- **Multi-provider:** Support Gitea (primary), GitHub, GitLab
|
||||
- **Reliability:** Handle network failures gracefully
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: Polling Only
|
||||
Periodically fetch all issues from external trackers.
|
||||
|
||||
**Pros:** Simple, reliable
|
||||
**Cons:** High latency (minutes), API quota waste
|
||||
|
||||
### Option 2: Webhooks Only
|
||||
Rely solely on external webhooks.
|
||||
|
||||
**Pros:** Real-time
|
||||
**Cons:** May miss events during outages
|
||||
|
||||
### Option 3: Webhook-First + Polling Fallback (Selected)
|
||||
Primary: webhooks for real-time. Secondary: polling for reconciliation.
|
||||
|
||||
**Pros:** Real-time with reliability
|
||||
**Cons:** Slightly more complex
|
||||
|
||||
## Decision
|
||||
|
||||
**Adopt webhook-first architecture with polling fallback** and Last-Writer-Wins (LWW) conflict resolution.
|
||||
|
||||
External trackers are the source of truth. Syndarix maintains local mirrors for unified agent access.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Sync Architecture
|
||||
|
||||
```
|
||||
External Trackers (Gitea/GitHub/GitLab)
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ Webhooks │ (real-time)
|
||||
└─────────┬─────────┘
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ Webhook Handler │ → Redis Queue → Sync Engine
|
||||
└───────────────────┘
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ Polling Worker │ (reconciliation every 15 min)
|
||||
└───────────────────┘
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ PostgreSQL │
|
||||
│ (issues, sync_log)│
|
||||
└───────────────────┘
|
||||
```
|
||||
|
||||
### Provider Abstraction
|
||||
|
||||
```python
|
||||
class IssueProvider(ABC):
|
||||
"""Abstract interface for issue tracker providers."""
|
||||
|
||||
@abstractmethod
|
||||
async def get_issue(self, issue_id: str) -> ExternalIssue: ...
|
||||
|
||||
@abstractmethod
|
||||
async def list_issues(self, repo: str, since: datetime) -> list[ExternalIssue]: ...
|
||||
|
||||
@abstractmethod
|
||||
async def create_issue(self, repo: str, issue: IssueCreate) -> ExternalIssue: ...
|
||||
|
||||
@abstractmethod
|
||||
async def update_issue(self, issue_id: str, issue: IssueUpdate) -> ExternalIssue: ...
|
||||
|
||||
@abstractmethod
|
||||
def parse_webhook(self, payload: dict) -> WebhookEvent: ...
|
||||
|
||||
# Provider implementations
|
||||
class GiteaProvider(IssueProvider): ...
|
||||
class GitHubProvider(IssueProvider): ...
|
||||
class GitLabProvider(IssueProvider): ...
|
||||
```
|
||||
|
||||
### Conflict Resolution
|
||||
|
||||
| Scenario | Resolution |
|
||||
|----------|------------|
|
||||
| Same field, different timestamps | Last-Writer-Wins (LWW) |
|
||||
| Same field, concurrent edits | Mark conflict, notify user |
|
||||
| Different fields modified | Merge both changes |
|
||||
| Delete vs Update | Delete wins (configurable) |
|
||||
|
||||
### Database Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE issues (
|
||||
id UUID PRIMARY KEY,
|
||||
project_id UUID NOT NULL,
|
||||
external_id VARCHAR(100),
|
||||
external_provider VARCHAR(50), -- 'gitea', 'github', 'gitlab'
|
||||
external_url VARCHAR(500),
|
||||
|
||||
-- Canonical fields
|
||||
title VARCHAR(500) NOT NULL,
|
||||
body TEXT,
|
||||
state VARCHAR(50) NOT NULL,
|
||||
labels JSONB DEFAULT '[]',
|
||||
assignees JSONB DEFAULT '[]',
|
||||
|
||||
-- Sync metadata
|
||||
external_updated_at TIMESTAMPTZ,
|
||||
local_updated_at TIMESTAMPTZ,
|
||||
sync_status VARCHAR(50) DEFAULT 'synced',
|
||||
sync_conflict JSONB,
|
||||
|
||||
created_at TIMESTAMPTZ NOT NULL,
|
||||
updated_at TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE issue_sync_log (
|
||||
id UUID PRIMARY KEY,
|
||||
issue_id UUID NOT NULL,
|
||||
direction VARCHAR(10) NOT NULL, -- 'inbound', 'outbound'
|
||||
action VARCHAR(50) NOT NULL, -- 'create', 'update', 'delete'
|
||||
before_state JSONB,
|
||||
after_state JSONB,
|
||||
provider VARCHAR(50) NOT NULL,
|
||||
sync_time TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
```
|
||||
|
||||
### Webhook Handler
|
||||
|
||||
```python
|
||||
@router.post("/webhooks/{provider}")
|
||||
async def handle_webhook(
|
||||
provider: str,
|
||||
request: Request,
|
||||
background_tasks: BackgroundTasks
|
||||
):
|
||||
"""Handle incoming webhooks from issue trackers."""
|
||||
payload = await request.json()
|
||||
|
||||
# Validate signature
|
||||
provider_impl = get_provider(provider)
|
||||
if not provider_impl.verify_signature(request.headers, payload):
|
||||
raise HTTPException(401, "Invalid signature")
|
||||
|
||||
# Queue for processing (deduplication in Redis)
|
||||
event = provider_impl.parse_webhook(payload)
|
||||
await redis.xadd(
|
||||
f"sync:webhooks:{provider}",
|
||||
{"event": event.json()},
|
||||
id="*",
|
||||
maxlen=10000
|
||||
)
|
||||
|
||||
return {"status": "queued"}
|
||||
```
|
||||
|
||||
### Outbox Pattern for Outbound Sync
|
||||
|
||||
```python
|
||||
class SyncOutbox:
|
||||
"""Reliable outbound sync with retry."""
|
||||
|
||||
async def queue_update(self, issue_id: str, changes: dict):
|
||||
await db.execute("""
|
||||
INSERT INTO sync_outbox (issue_id, changes, status, created_at)
|
||||
VALUES ($1, $2, 'pending', NOW())
|
||||
""", issue_id, json.dumps(changes))
|
||||
|
||||
@celery_app.task
|
||||
def process_sync_outbox():
|
||||
"""Process pending outbound syncs with exponential backoff."""
|
||||
pending = db.query("SELECT * FROM sync_outbox WHERE status = 'pending' LIMIT 100")
|
||||
|
||||
for item in pending:
|
||||
try:
|
||||
issue = db.get_issue(item.issue_id)
|
||||
provider = get_provider(issue.external_provider)
|
||||
await provider.update_issue(issue.external_id, item.changes)
|
||||
|
||||
item.status = 'completed'
|
||||
except Exception as e:
|
||||
item.retry_count += 1
|
||||
item.next_retry = datetime.now() + timedelta(minutes=2 ** item.retry_count)
|
||||
if item.retry_count > 5:
|
||||
item.status = 'failed'
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Real-time sync via webhooks
|
||||
- Reliable reconciliation via polling
|
||||
- Clear conflict resolution rules
|
||||
- Provider-agnostic design
|
||||
|
||||
### Negative
|
||||
- Eventual consistency (brief inconsistency windows)
|
||||
- Webhook infrastructure required
|
||||
|
||||
### Mitigation
|
||||
- Manual refresh available in UI
|
||||
- Conflict notification alerts users
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-201-205: Issue tracking integration
|
||||
- NFR-201: Multi-provider support
|
||||
|
||||
---
|
||||
|
||||
*This ADR establishes the issue synchronization architecture for Syndarix.*
|
||||
199
docs/adrs/ADR-012-cost-tracking.md
Normal file
199
docs/adrs/ADR-012-cost-tracking.md
Normal file
@@ -0,0 +1,199 @@
|
||||
# ADR-012: Cost Tracking and Budget Management
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-010
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix agents make potentially expensive LLM API calls. Without proper cost tracking and budget enforcement, projects could incur unexpected charges. We need:
|
||||
- Real-time cost visibility
|
||||
- Per-project budget enforcement
|
||||
- Cost optimization strategies
|
||||
- Historical analytics
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Visibility:** Real-time cost tracking per agent/project
|
||||
- **Control:** Budget enforcement with soft/hard limits
|
||||
- **Optimization:** Identify and reduce unnecessary costs
|
||||
- **Attribution:** Clear cost allocation for billing
|
||||
|
||||
## Decision
|
||||
|
||||
**Implement multi-layered cost tracking** using:
|
||||
1. **LiteLLM Callbacks** for real-time usage capture
|
||||
2. **Redis** for budget enforcement
|
||||
3. **PostgreSQL** for persistent analytics
|
||||
4. **SSE Events** for dashboard updates
|
||||
|
||||
## Implementation
|
||||
|
||||
### Cost Attribution Hierarchy
|
||||
|
||||
```
|
||||
Organization (Billing Entity)
|
||||
└── Project (Cost Center)
|
||||
└── Sprint (Time-bounded Budget)
|
||||
└── Agent Instance (Worker)
|
||||
└── LLM Request (Atomic Cost Unit)
|
||||
```
|
||||
|
||||
### LiteLLM Callback
|
||||
|
||||
```python
|
||||
from litellm.integrations.custom_logger import CustomLogger
|
||||
|
||||
class SyndarixCostLogger(CustomLogger):
|
||||
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
|
||||
agent_id = kwargs.get("metadata", {}).get("agent_id")
|
||||
project_id = kwargs.get("metadata", {}).get("project_id")
|
||||
model = kwargs.get("model")
|
||||
cost = kwargs.get("response_cost", 0)
|
||||
usage = response_obj.usage
|
||||
|
||||
# Real-time budget check (Redis)
|
||||
await self.budget_service.increment(
|
||||
project_id=project_id,
|
||||
cost=cost,
|
||||
tokens=usage.total_tokens
|
||||
)
|
||||
|
||||
# Persistent record (async queue to PostgreSQL)
|
||||
await self.usage_queue.enqueue({
|
||||
"agent_id": agent_id,
|
||||
"project_id": project_id,
|
||||
"model": model,
|
||||
"prompt_tokens": usage.prompt_tokens,
|
||||
"completion_tokens": usage.completion_tokens,
|
||||
"cost_usd": cost,
|
||||
"timestamp": datetime.utcnow()
|
||||
})
|
||||
|
||||
# Check budget status
|
||||
budget_status = await self.budget_service.check_status(project_id)
|
||||
if budget_status == "exceeded":
|
||||
await self.notify_budget_exceeded(project_id)
|
||||
```
|
||||
|
||||
### Budget Enforcement
|
||||
|
||||
```python
|
||||
class BudgetService:
|
||||
async def check_budget(self, project_id: str) -> BudgetStatus:
|
||||
"""Check current budget status."""
|
||||
budget = await self.get_budget(project_id)
|
||||
usage = await self.redis.get(f"cost:{project_id}:daily")
|
||||
|
||||
percentage = (usage / budget.daily_limit) * 100
|
||||
|
||||
if percentage >= 100 and budget.enforcement == "hard":
|
||||
return BudgetStatus.BLOCKED
|
||||
elif percentage >= 100:
|
||||
return BudgetStatus.EXCEEDED
|
||||
elif percentage >= 80:
|
||||
return BudgetStatus.WARNING
|
||||
elif percentage >= 50:
|
||||
return BudgetStatus.APPROACHING
|
||||
else:
|
||||
return BudgetStatus.OK
|
||||
|
||||
async def enforce(self, project_id: str) -> bool:
|
||||
"""Returns True if request should proceed."""
|
||||
status = await self.check_budget(project_id)
|
||||
|
||||
if status == BudgetStatus.BLOCKED:
|
||||
raise BudgetExceededException(project_id)
|
||||
|
||||
if status in [BudgetStatus.EXCEEDED, BudgetStatus.WARNING]:
|
||||
# Auto-downgrade to cheaper model
|
||||
await self.set_model_override(project_id, "cost-optimized")
|
||||
|
||||
return True
|
||||
```
|
||||
|
||||
### Database Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE token_usage (
|
||||
id UUID PRIMARY KEY,
|
||||
agent_id UUID,
|
||||
project_id UUID NOT NULL,
|
||||
model VARCHAR(100) NOT NULL,
|
||||
prompt_tokens INTEGER NOT NULL,
|
||||
completion_tokens INTEGER NOT NULL,
|
||||
total_tokens INTEGER NOT NULL,
|
||||
cost_usd DECIMAL(10, 6) NOT NULL,
|
||||
timestamp TIMESTAMPTZ NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE project_budgets (
|
||||
id UUID PRIMARY KEY,
|
||||
project_id UUID NOT NULL UNIQUE,
|
||||
daily_limit_usd DECIMAL(10, 2) DEFAULT 50.00,
|
||||
weekly_limit_usd DECIMAL(10, 2) DEFAULT 250.00,
|
||||
monthly_limit_usd DECIMAL(10, 2) DEFAULT 1000.00,
|
||||
enforcement VARCHAR(20) DEFAULT 'soft', -- 'soft', 'hard'
|
||||
alert_thresholds JSONB DEFAULT '[50, 80, 100]'
|
||||
);
|
||||
|
||||
-- Materialized view for analytics
|
||||
CREATE MATERIALIZED VIEW daily_cost_summary AS
|
||||
SELECT
|
||||
project_id,
|
||||
DATE(timestamp) as date,
|
||||
SUM(cost_usd) as total_cost,
|
||||
SUM(total_tokens) as total_tokens,
|
||||
COUNT(*) as request_count
|
||||
FROM token_usage
|
||||
GROUP BY project_id, DATE(timestamp);
|
||||
```
|
||||
|
||||
### Cost Model Prices
|
||||
|
||||
| Model | Input ($/1M) | Output ($/1M) |
|
||||
|-------|-------------|---------------|
|
||||
| Claude 3.5 Sonnet | $3.00 | $15.00 |
|
||||
| Claude 3 Haiku | $0.25 | $1.25 |
|
||||
| GPT-4 Turbo | $10.00 | $30.00 |
|
||||
| GPT-4o Mini | $0.15 | $0.60 |
|
||||
| Ollama (local) | $0.00 | $0.00 |
|
||||
|
||||
### Cost Optimization Strategies
|
||||
|
||||
| Strategy | Savings | Implementation |
|
||||
|----------|---------|----------------|
|
||||
| Semantic caching | 15-30% | Redis cache for repeated queries |
|
||||
| Model cascading | 60-80% | Start with Haiku, escalate to Sonnet |
|
||||
| Prompt compression | 10-20% | Remove redundant context |
|
||||
| Local fallback | 100% for some | Ollama for simple tasks |
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Complete cost visibility at all levels
|
||||
- Automatic budget enforcement
|
||||
- Cost optimization reduces spend significantly
|
||||
- Real-time dashboard updates
|
||||
|
||||
### Negative
|
||||
- Redis dependency for real-time tracking
|
||||
- Additional complexity in LLM gateway
|
||||
|
||||
### Mitigation
|
||||
- Redis already required for other features
|
||||
- Clear separation of concerns in cost tracking module
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-401: Cost tracking per agent/project
|
||||
- FR-402: Budget enforcement
|
||||
- NFR-302: Budget alert system
|
||||
|
||||
---
|
||||
|
||||
*This ADR establishes the cost tracking and budget management architecture for Syndarix.*
|
||||
228
docs/adrs/ADR-013-audit-logging.md
Normal file
228
docs/adrs/ADR-013-audit-logging.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# ADR-013: Audit Logging Architecture
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-011
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
As an autonomous AI-powered system, Syndarix requires comprehensive audit logging for:
|
||||
- Compliance (SOC2, GDPR)
|
||||
- Debugging agent behavior
|
||||
- Client trust and transparency
|
||||
- Security investigation
|
||||
|
||||
Every action taken by agents must be traceable and tamper-evident.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Completeness:** Log all significant events
|
||||
- **Immutability:** Tamper-evident audit trail
|
||||
- **Queryability:** Fast search and filtering
|
||||
- **Scalability:** Handle high event volumes
|
||||
- **Retention:** Configurable retention policies
|
||||
|
||||
## Decision
|
||||
|
||||
**Implement structured audit logging** using:
|
||||
- **Structlog** for JSON event formatting
|
||||
- **PostgreSQL** for hot storage (0-90 days)
|
||||
- **S3-compatible storage** for cold archival
|
||||
- **Cryptographic hash chaining** for immutability
|
||||
|
||||
## Implementation
|
||||
|
||||
### Event Categories
|
||||
|
||||
| Category | Event Types |
|
||||
|----------|-------------|
|
||||
| **Agent** | spawned, action_started, action_completed, decision, terminated |
|
||||
| **LLM** | request, response, error, tool_call |
|
||||
| **MCP** | tool_invoked, tool_result, tool_error |
|
||||
| **Approval** | requested, granted, rejected, timeout |
|
||||
| **Git** | commit, branch_created, pr_created, pr_merged |
|
||||
| **Project** | created, sprint_started, milestone_completed |
|
||||
|
||||
### Event Schema
|
||||
|
||||
```python
|
||||
class AuditEvent(BaseModel):
|
||||
# Identity
|
||||
event_id: str # UUID v7 (time-ordered)
|
||||
trace_id: str | None # OpenTelemetry correlation
|
||||
parent_event_id: str | None # Event chain
|
||||
|
||||
# Timestamp
|
||||
timestamp: datetime
|
||||
timestamp_unix_ms: int
|
||||
|
||||
# Classification
|
||||
event_type: str # e.g., "agent.action.completed"
|
||||
event_category: str # e.g., "agent"
|
||||
severity: Literal["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
|
||||
|
||||
# Context
|
||||
project_id: str | None
|
||||
agent_id: str | None
|
||||
user_id: str | None
|
||||
|
||||
# Content
|
||||
action: str # Human-readable description
|
||||
data: dict # Event-specific payload
|
||||
before_state: dict | None
|
||||
after_state: dict | None
|
||||
|
||||
# Immutability
|
||||
previous_hash: str | None # Hash of previous event
|
||||
event_hash: str # SHA-256 of this event
|
||||
```
|
||||
|
||||
### Hash Chain Implementation
|
||||
|
||||
```python
|
||||
class AuditLogger:
|
||||
def __init__(self):
|
||||
self._last_hash: str | None = None
|
||||
|
||||
async def log(self, event: AuditEvent) -> None:
|
||||
# Set hash chain
|
||||
event.previous_hash = self._last_hash
|
||||
event.event_hash = self._compute_hash(event)
|
||||
self._last_hash = event.event_hash
|
||||
|
||||
# Persist
|
||||
await self._store(event)
|
||||
|
||||
def _compute_hash(self, event: AuditEvent) -> str:
|
||||
payload = json.dumps({
|
||||
"event_id": event.event_id,
|
||||
"timestamp_unix_ms": event.timestamp_unix_ms,
|
||||
"event_type": event.event_type,
|
||||
"data": event.data,
|
||||
"previous_hash": event.previous_hash
|
||||
}, sort_keys=True)
|
||||
return hashlib.sha256(payload.encode()).hexdigest()
|
||||
|
||||
async def verify_chain(self, events: list[AuditEvent]) -> bool:
|
||||
"""Verify audit trail integrity."""
|
||||
for i, event in enumerate(events):
|
||||
expected_hash = self._compute_hash(event)
|
||||
if expected_hash != event.event_hash:
|
||||
return False
|
||||
if i > 0 and event.previous_hash != events[i-1].event_hash:
|
||||
return False
|
||||
return True
|
||||
```
|
||||
|
||||
### Database Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE audit_events (
|
||||
event_id VARCHAR(36) PRIMARY KEY,
|
||||
trace_id VARCHAR(36),
|
||||
parent_event_id VARCHAR(36),
|
||||
|
||||
timestamp TIMESTAMPTZ NOT NULL,
|
||||
timestamp_unix_ms BIGINT NOT NULL,
|
||||
|
||||
event_type VARCHAR(100) NOT NULL,
|
||||
event_category VARCHAR(50) NOT NULL,
|
||||
severity VARCHAR(20) NOT NULL,
|
||||
|
||||
project_id UUID,
|
||||
agent_id UUID,
|
||||
user_id UUID,
|
||||
|
||||
action TEXT NOT NULL,
|
||||
data JSONB NOT NULL,
|
||||
before_state JSONB,
|
||||
after_state JSONB,
|
||||
|
||||
previous_hash VARCHAR(64),
|
||||
event_hash VARCHAR(64) NOT NULL
|
||||
);
|
||||
|
||||
-- Indexes for common queries
|
||||
CREATE INDEX idx_audit_timestamp ON audit_events (timestamp DESC);
|
||||
CREATE INDEX idx_audit_project ON audit_events (project_id, timestamp DESC);
|
||||
CREATE INDEX idx_audit_agent ON audit_events (agent_id, timestamp DESC);
|
||||
CREATE INDEX idx_audit_type ON audit_events (event_type, timestamp DESC);
|
||||
```
|
||||
|
||||
### Storage Tiers
|
||||
|
||||
| Tier | Storage | Retention | Query Speed |
|
||||
|------|---------|-----------|-------------|
|
||||
| Hot | PostgreSQL | 0-90 days | Fast |
|
||||
| Cold | S3/MinIO | 90+ days | Slow |
|
||||
|
||||
### Archival Process
|
||||
|
||||
```python
|
||||
@celery_app.task
|
||||
def archive_old_events():
|
||||
"""Move events older than 90 days to cold storage."""
|
||||
cutoff = datetime.utcnow() - timedelta(days=90)
|
||||
|
||||
# Export to S3 in daily batches
|
||||
events = db.query("""
|
||||
SELECT * FROM audit_events
|
||||
WHERE timestamp < $1
|
||||
ORDER BY timestamp
|
||||
""", cutoff)
|
||||
|
||||
for date, batch in group_by_date(events):
|
||||
s3.put_object(
|
||||
Bucket="syndarix-audit",
|
||||
Key=f"audit/{date.isoformat()}.jsonl.gz",
|
||||
Body=gzip.compress(batch.to_jsonl())
|
||||
)
|
||||
|
||||
# Delete from PostgreSQL
|
||||
db.execute("DELETE FROM audit_events WHERE timestamp < $1", cutoff)
|
||||
```
|
||||
|
||||
### Audit Viewer API
|
||||
|
||||
```python
|
||||
@router.get("/projects/{project_id}/audit")
|
||||
async def get_audit_trail(
|
||||
project_id: str,
|
||||
event_type: str | None = None,
|
||||
agent_id: str | None = None,
|
||||
start_time: datetime | None = None,
|
||||
end_time: datetime | None = None,
|
||||
limit: int = 100
|
||||
) -> list[AuditEvent]:
|
||||
"""Query audit trail with filters."""
|
||||
...
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Complete audit trail of all agent actions
|
||||
- Tamper-evident through hash chaining
|
||||
- Fast queries for recent events
|
||||
- Cost-effective long-term storage
|
||||
|
||||
### Negative
|
||||
- Storage requirements grow with activity
|
||||
- Hash chain verification adds complexity
|
||||
|
||||
### Mitigation
|
||||
- Tiered storage with archival
|
||||
- Batch verification for chain integrity
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- NFR-602: Comprehensive audit logging
|
||||
- Compliance: SOC2, GDPR requirements
|
||||
|
||||
---
|
||||
|
||||
*This ADR establishes the audit logging architecture for Syndarix.*
|
||||
280
docs/adrs/ADR-014-client-approval-flow.md
Normal file
280
docs/adrs/ADR-014-client-approval-flow.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# ADR-014: Client Approval Flow Architecture
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-012
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix supports configurable autonomy levels. Depending on the level, agents may require client approval before proceeding with certain actions. We need a flexible approval system that:
|
||||
- Respects autonomy level configuration
|
||||
- Provides clear approval UX
|
||||
- Handles timeouts gracefully
|
||||
- Supports mobile-friendly approvals
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Configurability:** Per-project autonomy settings
|
||||
- **Usability:** Easy approve/reject with context
|
||||
- **Reliability:** Approvals must not be lost
|
||||
- **Flexibility:** Support batch and individual approvals
|
||||
- **Responsiveness:** Real-time notifications
|
||||
|
||||
## Decision
|
||||
|
||||
**Implement checkpoint-based approval system** with:
|
||||
- Queue-based approval management
|
||||
- Confidence-aware routing
|
||||
- Multi-channel notifications (SSE, email, mobile push)
|
||||
- Configurable timeout and escalation policies
|
||||
|
||||
## Implementation
|
||||
|
||||
### Autonomy Levels
|
||||
|
||||
| Level | Description | Approval Required |
|
||||
|-------|-------------|-------------------|
|
||||
| **FULL_CONTROL** | Approve every significant action | All actions |
|
||||
| **MILESTONE** | Approve at sprint boundaries | Sprint start/end, major decisions |
|
||||
| **AUTONOMOUS** | Only critical decisions | Budget, production, architecture |
|
||||
|
||||
### Approval Categories
|
||||
|
||||
```python
|
||||
class ApprovalCategory(str, Enum):
|
||||
CRITICAL = "critical" # Always require approval
|
||||
MILESTONE = "milestone" # MILESTONE and FULL_CONTROL
|
||||
ROUTINE = "routine" # FULL_CONTROL only
|
||||
UNCERTAINTY = "uncertainty" # Low confidence decisions
|
||||
EXPERTISE = "expertise" # Agent requests human input
|
||||
```
|
||||
|
||||
### Approval Matrix
|
||||
|
||||
| Action | FULL_CONTROL | MILESTONE | AUTONOMOUS |
|
||||
|--------|--------------|-----------|------------|
|
||||
| Requirements approval | Required | Required | Required |
|
||||
| Architecture decisions | Required | Required | Required |
|
||||
| Sprint start | Required | Required | Auto |
|
||||
| Story implementation | Required | Auto | Auto |
|
||||
| PR merge | Required | Auto | Auto |
|
||||
| Sprint completion | Required | Required | Auto |
|
||||
| Budget threshold exceeded | Required | Required | Required |
|
||||
| Production deployment | Required | Required | Required |
|
||||
|
||||
### Database Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE approval_requests (
|
||||
id UUID PRIMARY KEY,
|
||||
project_id UUID NOT NULL,
|
||||
|
||||
-- What needs approval
|
||||
category VARCHAR(50) NOT NULL,
|
||||
action_type VARCHAR(100) NOT NULL,
|
||||
title VARCHAR(500) NOT NULL,
|
||||
description TEXT,
|
||||
context JSONB NOT NULL,
|
||||
|
||||
-- Who requested
|
||||
requested_by_agent_id UUID,
|
||||
requested_at TIMESTAMPTZ NOT NULL,
|
||||
|
||||
-- Status
|
||||
status VARCHAR(50) DEFAULT 'pending', -- pending, approved, rejected, expired
|
||||
decided_by_user_id UUID,
|
||||
decided_at TIMESTAMPTZ,
|
||||
decision_comment TEXT,
|
||||
|
||||
-- Timeout handling
|
||||
expires_at TIMESTAMPTZ,
|
||||
escalation_policy JSONB,
|
||||
|
||||
-- AI context
|
||||
confidence_score FLOAT,
|
||||
ai_recommendation VARCHAR(50),
|
||||
reasoning TEXT
|
||||
);
|
||||
```
|
||||
|
||||
### Approval Service
|
||||
|
||||
```python
|
||||
class ApprovalService:
|
||||
async def request_approval(
|
||||
self,
|
||||
project_id: str,
|
||||
action_type: str,
|
||||
category: ApprovalCategory,
|
||||
context: dict,
|
||||
requested_by: str,
|
||||
confidence: float | None = None,
|
||||
ai_recommendation: str | None = None
|
||||
) -> ApprovalRequest:
|
||||
"""Create an approval request and notify stakeholders."""
|
||||
|
||||
project = await self.get_project(project_id)
|
||||
|
||||
# Check if approval needed based on autonomy level
|
||||
if not self._needs_approval(project.autonomy_level, category):
|
||||
return ApprovalRequest(status="auto_approved")
|
||||
|
||||
# Create request
|
||||
request = ApprovalRequest(
|
||||
project_id=project_id,
|
||||
category=category,
|
||||
action_type=action_type,
|
||||
context=context,
|
||||
requested_by_agent_id=requested_by,
|
||||
confidence_score=confidence,
|
||||
ai_recommendation=ai_recommendation,
|
||||
expires_at=datetime.utcnow() + self._get_timeout(category)
|
||||
)
|
||||
await self.db.add(request)
|
||||
|
||||
# Send notifications
|
||||
await self._notify_approvers(project, request)
|
||||
|
||||
return request
|
||||
|
||||
async def await_decision(
|
||||
self,
|
||||
request_id: str,
|
||||
timeout: timedelta = timedelta(hours=24)
|
||||
) -> ApprovalDecision:
|
||||
"""Wait for approval decision (used in workflows)."""
|
||||
deadline = datetime.utcnow() + timeout
|
||||
|
||||
while datetime.utcnow() < deadline:
|
||||
request = await self.get_request(request_id)
|
||||
|
||||
if request.status == "approved":
|
||||
return ApprovalDecision.APPROVED
|
||||
elif request.status == "rejected":
|
||||
return ApprovalDecision.REJECTED
|
||||
elif request.status == "expired":
|
||||
return await self._handle_expiration(request)
|
||||
|
||||
await asyncio.sleep(5)
|
||||
|
||||
return await self._handle_timeout(request)
|
||||
|
||||
async def _handle_timeout(self, request: ApprovalRequest) -> ApprovalDecision:
|
||||
"""Handle approval timeout based on escalation policy."""
|
||||
policy = request.escalation_policy or {"action": "block"}
|
||||
|
||||
if policy["action"] == "auto_approve":
|
||||
request.status = "auto_approved"
|
||||
return ApprovalDecision.APPROVED
|
||||
elif policy["action"] == "escalate":
|
||||
await self._escalate(request, policy["escalate_to"])
|
||||
return await self.await_decision(request.id, timedelta(hours=24))
|
||||
else: # block
|
||||
request.status = "expired"
|
||||
return ApprovalDecision.BLOCKED
|
||||
```
|
||||
|
||||
### Notification Channels
|
||||
|
||||
```python
|
||||
class ApprovalNotifier:
|
||||
async def notify(self, project: Project, request: ApprovalRequest):
|
||||
# SSE for real-time dashboard
|
||||
await self.event_bus.publish(f"project:{project.id}", {
|
||||
"type": "approval_required",
|
||||
"request_id": str(request.id),
|
||||
"title": request.title,
|
||||
"category": request.category
|
||||
})
|
||||
|
||||
# Email for async notification
|
||||
await self.email_service.send_approval_request(
|
||||
to=project.owner.email,
|
||||
request=request
|
||||
)
|
||||
|
||||
# Mobile push if configured
|
||||
if project.push_enabled:
|
||||
await self.push_service.send(
|
||||
user_id=project.owner_id,
|
||||
title="Approval Required",
|
||||
body=request.title,
|
||||
data={"request_id": str(request.id)}
|
||||
)
|
||||
```
|
||||
|
||||
### Batch Approval UI
|
||||
|
||||
For FULL_CONTROL mode with many routine approvals:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ APPROVAL QUEUE (12 pending) │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ ☑ PR #45: Add user authentication [ROUTINE] 2h ago │
|
||||
│ ☑ PR #46: Fix login validation [ROUTINE] 2h ago │
|
||||
│ ☑ PR #47: Update dependencies [ROUTINE] 1h ago │
|
||||
│ ☐ Sprint 4 Start [MILESTONE] 30m │
|
||||
│ ☐ Production Deploy v1.2 [CRITICAL] 15m │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ [Approve Selected (3)] [Reject Selected] [Review All] │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Decision Context Display
|
||||
|
||||
```python
|
||||
class ApprovalContextBuilder:
|
||||
def build_context(self, request: ApprovalRequest) -> ApprovalContext:
|
||||
"""Build rich context for approval decision."""
|
||||
return ApprovalContext(
|
||||
summary=request.title,
|
||||
description=request.description,
|
||||
|
||||
# What the AI recommends
|
||||
ai_recommendation=request.ai_recommendation,
|
||||
confidence=request.confidence_score,
|
||||
reasoning=request.reasoning,
|
||||
|
||||
# Impact assessment
|
||||
affected_files=request.context.get("files", []),
|
||||
estimated_impact=request.context.get("impact", "unknown"),
|
||||
|
||||
# Agent info
|
||||
requesting_agent=self._get_agent_info(request.requested_by_agent_id),
|
||||
|
||||
# Quick actions
|
||||
approve_url=f"/api/approvals/{request.id}/approve",
|
||||
reject_url=f"/api/approvals/{request.id}/reject"
|
||||
)
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Flexible autonomy levels support various client preferences
|
||||
- Real-time notifications ensure timely responses
|
||||
- Batch approval reduces friction in FULL_CONTROL mode
|
||||
- AI confidence routing escalates appropriately
|
||||
|
||||
### Negative
|
||||
- Approval latency can slow autonomous workflows
|
||||
- Complex state management for pending approvals
|
||||
|
||||
### Mitigation
|
||||
- Encourage MILESTONE mode for efficiency
|
||||
- Configurable timeouts with auto-approve options
|
||||
- Mobile notifications for quick responses
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-601-605: Human-in-the-loop requirements
|
||||
- FR-102: Autonomy level configuration
|
||||
|
||||
---
|
||||
|
||||
*This ADR establishes the client approval flow architecture for Syndarix.*
|
||||
425
docs/architecture/ARCHITECTURE.md
Normal file
425
docs/architecture/ARCHITECTURE.md
Normal file
@@ -0,0 +1,425 @@
|
||||
# Syndarix Architecture
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2025-12-29
|
||||
**Status:** Approved
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Syndarix is an autonomous AI-powered software consulting platform that orchestrates specialized AI agents to deliver complete software solutions. This document describes the chosen architecture, key decisions, and component interactions.
|
||||
|
||||
### Core Principles
|
||||
|
||||
1. **Self-Hostable First:** All components are fully self-hostable with permissive licenses (MIT/BSD)
|
||||
2. **Production-Ready:** Use battle-tested technologies, not experimental frameworks
|
||||
3. **Hybrid Architecture:** Combine best-in-class tools rather than monolithic frameworks
|
||||
4. **Auditability:** Every agent action is logged and traceable
|
||||
5. **Human-in-the-Loop:** Configurable autonomy with approval checkpoints
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────────┐
|
||||
│ SYNDARIX PLATFORM │
|
||||
├─────────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ FRONTEND (Next.js 16) │ │
|
||||
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
|
||||
│ │ │ Dashboard │ │ Project │ │ Agent │ │ Approval │ │ │
|
||||
│ │ │ Pages │ │ Views │ │ Monitor │ │ Queue │ │ │
|
||||
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ REST + SSE + WebSocket │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ BACKEND (FastAPI) │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ ORCHESTRATION LAYER │ │ │
|
||||
│ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │
|
||||
│ │ │ │ Agent │ │ Workflow │ │ Approval │ │ │ │
|
||||
│ │ │ │ Orchestrator│ │ Engine │ │ Service │ │ │ │
|
||||
│ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │
|
||||
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ INTEGRATION LAYER │ │ │
|
||||
│ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │
|
||||
│ │ │ │ LLM Gateway │ │ MCP Client │ │ Event │ │ │ │
|
||||
│ │ │ │ (LiteLLM) │ │ Manager │ │ Bus │ │ │ │
|
||||
│ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │
|
||||
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌───────────────────────────┼───────────────────────────┐ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
|
||||
│ │ PostgreSQL │ │ Redis │ │ Celery Workers│ │
|
||||
│ │ + pgvector │ │ (Cache/Queue) │ │ (Background) │ │
|
||||
│ └────────────────┘ └────────────────┘ └────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ MCP SERVERS │ │
|
||||
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
|
||||
│ │ │ LLM │ │Knowledge │ │ Git │ │ Issues │ │ File │ │ │
|
||||
│ │ │ Gateway │ │ Base │ │ MCP │ │ MCP │ │ System │ │ │
|
||||
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Architecture Decisions
|
||||
|
||||
### ADR Summary Matrix
|
||||
|
||||
| ADR | Decision | Key Technology |
|
||||
|-----|----------|----------------|
|
||||
| ADR-001 | MCP Integration | FastMCP 2.0, Unified Singletons |
|
||||
| ADR-002 | Real-time Communication | SSE primary, WebSocket for chat |
|
||||
| ADR-003 | Background Tasks | Celery + Redis |
|
||||
| ADR-004 | LLM Provider | LiteLLM with failover |
|
||||
| ADR-005 | Tech Stack | PragmaStack + extensions |
|
||||
| ADR-006 | Agent Orchestration | Type-Instance pattern |
|
||||
| ADR-007 | Framework Selection | Hybrid (LangGraph + custom) |
|
||||
| ADR-008 | Knowledge Base | pgvector for RAG |
|
||||
| ADR-009 | Agent Communication | Structured messages + Redis Streams |
|
||||
| ADR-010 | Workflows | transitions + PostgreSQL + Celery |
|
||||
| ADR-011 | Issue Sync | Webhook-first + polling fallback |
|
||||
| ADR-012 | Cost Tracking | LiteLLM callbacks + Redis budgets |
|
||||
| ADR-013 | Audit Logging | Structlog + hash chaining |
|
||||
| ADR-014 | Client Approval | Checkpoint-based + notifications |
|
||||
|
||||
---
|
||||
|
||||
## Component Deep Dives
|
||||
|
||||
### 1. Agent Orchestration
|
||||
|
||||
**Pattern:** Type-Instance
|
||||
|
||||
- **Agent Types:** Templates defining model, expertise, personality, capabilities
|
||||
- **Agent Instances:** Runtime instances spawned from types, assigned to projects
|
||||
- **Orchestrator:** Manages lifecycle, routing, and resource tracking
|
||||
|
||||
```
|
||||
Agent Type (Template) Agent Instance (Runtime)
|
||||
┌─────────────────────┐ ┌─────────────────────┐
|
||||
│ name: "Engineer" │───spawn───▶│ id: "eng-001" │
|
||||
│ model: "sonnet" │ │ name: "Dave" │
|
||||
│ expertise: [py, js] │ │ project: "proj-123" │
|
||||
│ capabilities: [...] │ │ context: {...} │
|
||||
└─────────────────────┘ │ status: ACTIVE │
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
### 2. LLM Gateway (LiteLLM)
|
||||
|
||||
**Failover Chain:**
|
||||
```
|
||||
Claude 3.5 Sonnet (Primary)
|
||||
│
|
||||
▼ (on failure)
|
||||
GPT-4 Turbo (Fallback)
|
||||
│
|
||||
▼ (on failure)
|
||||
Ollama/Llama 3 (Local)
|
||||
```
|
||||
|
||||
**Model Groups:**
|
||||
| Group | Use Case | Primary Model |
|
||||
|-------|----------|---------------|
|
||||
| high-reasoning | Architecture, complex analysis | Claude 3.5 Sonnet |
|
||||
| fast-response | Quick tasks, status updates | Claude 3 Haiku |
|
||||
| cost-optimized | High-volume, non-critical | Local Llama 3 |
|
||||
|
||||
### 3. Knowledge Base (RAG)
|
||||
|
||||
**Stack:** pgvector + LiteLLM embeddings
|
||||
|
||||
**Chunking Strategy:**
|
||||
| Content | Strategy | Model |
|
||||
|---------|----------|-------|
|
||||
| Code | AST-based (function/class) | voyage-code-3 |
|
||||
| Docs | Heading-based | text-embedding-3-small |
|
||||
| Conversations | Turn-based | text-embedding-3-small |
|
||||
|
||||
**Search:** Hybrid (70% vector + 30% keyword)
|
||||
|
||||
### 4. Workflow Engine
|
||||
|
||||
**Stack:** transitions library + PostgreSQL + Celery
|
||||
|
||||
**Core Workflows:**
|
||||
- **Sprint Workflow:** planning → active → review → done
|
||||
- **Story Workflow:** analysis → design → implementation → review → testing → done
|
||||
- **PR Workflow:** submitted → reviewing → changes_requested → approved → merged
|
||||
|
||||
**Durability:** Event sourcing with state persistence to PostgreSQL
|
||||
|
||||
### 5. Real-time Communication
|
||||
|
||||
**SSE (90% of use cases):**
|
||||
- Agent activity streams
|
||||
- Project progress updates
|
||||
- Approval notifications
|
||||
- Issue change notifications
|
||||
|
||||
**WebSocket (10% - bidirectional):**
|
||||
- Interactive chat with agents
|
||||
- Real-time debugging
|
||||
|
||||
**Event Bus:** Redis Pub/Sub for cross-instance distribution
|
||||
|
||||
### 6. Issue Synchronization
|
||||
|
||||
**Architecture:** Webhook-first + polling fallback
|
||||
|
||||
**Supported Providers:**
|
||||
- Gitea (primary)
|
||||
- GitHub
|
||||
- GitLab
|
||||
|
||||
**Conflict Resolution:** Last-Writer-Wins with version vectors
|
||||
|
||||
### 7. Cost Tracking
|
||||
|
||||
**Real-time Pipeline:**
|
||||
```
|
||||
LLM Request → LiteLLM Callback → Redis INCR → Budget Check
|
||||
│
|
||||
Async Queue → PostgreSQL → SSE Dashboard Update
|
||||
```
|
||||
|
||||
**Budget Enforcement:**
|
||||
- Soft limits: Alerts + model downgrade
|
||||
- Hard limits: Block requests
|
||||
|
||||
### 8. Audit Logging
|
||||
|
||||
**Immutability:** SHA-256 hash chaining
|
||||
|
||||
**Storage Tiers:**
|
||||
| Tier | Storage | Retention |
|
||||
|------|---------|-----------|
|
||||
| Hot | PostgreSQL | 0-90 days |
|
||||
| Cold | S3/MinIO | 90+ days |
|
||||
|
||||
### 9. Client Approval Flow
|
||||
|
||||
**Autonomy Levels:**
|
||||
| Level | Description |
|
||||
|-------|-------------|
|
||||
| FULL_CONTROL | Approve every action |
|
||||
| MILESTONE | Approve sprint boundaries |
|
||||
| AUTONOMOUS | Only critical decisions |
|
||||
|
||||
**Notifications:** SSE + Email + Mobile Push
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
### Core Technologies
|
||||
|
||||
| Layer | Technology | Version | License |
|
||||
|-------|------------|---------|---------|
|
||||
| Backend | FastAPI | 0.115+ | MIT |
|
||||
| Frontend | Next.js | 16 | MIT |
|
||||
| Database | PostgreSQL + pgvector | 15+ | PostgreSQL |
|
||||
| Cache/Queue | Redis | 7.0+ | BSD-3 |
|
||||
| Task Queue | Celery | 5.3+ | BSD-3 |
|
||||
| LLM Gateway | LiteLLM | Latest | MIT |
|
||||
| MCP Framework | FastMCP | 2.0+ | MIT |
|
||||
|
||||
### Self-Hostability Guarantee
|
||||
|
||||
**All components are fully self-hostable with no mandatory subscriptions:**
|
||||
|
||||
| Component | Self-Hosted | Managed Alternative (Optional) |
|
||||
|-----------|-------------|--------------------------------|
|
||||
| PostgreSQL | Yes | RDS, Neon, Supabase |
|
||||
| Redis | Yes | Redis Cloud |
|
||||
| LiteLLM | Yes | LiteLLM Enterprise |
|
||||
| Celery | Yes | - |
|
||||
| FastMCP | Yes | - |
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Diagrams
|
||||
|
||||
### Agent Task Execution
|
||||
|
||||
```
|
||||
1. Client creates story in Syndarix
|
||||
│
|
||||
▼
|
||||
2. Story workflow transitions to "implementation"
|
||||
│
|
||||
▼
|
||||
3. Agent Orchestrator spawns Engineer instance
|
||||
│
|
||||
▼
|
||||
4. Engineer queries Knowledge Base (RAG)
|
||||
│
|
||||
▼
|
||||
5. Engineer calls LLM Gateway for code generation
|
||||
│
|
||||
▼
|
||||
6. Engineer calls Git MCP to create branch & commit
|
||||
│
|
||||
▼
|
||||
7. Engineer creates PR via Git MCP
|
||||
│
|
||||
▼
|
||||
8. Workflow transitions to "review"
|
||||
│
|
||||
▼
|
||||
9. If autonomy_level != AUTONOMOUS:
|
||||
└── Approval request created
|
||||
└── Client notified via SSE + email
|
||||
│
|
||||
▼
|
||||
10. Client approves → PR merged → Workflow to "testing"
|
||||
```
|
||||
|
||||
### Real-time Event Flow
|
||||
|
||||
```
|
||||
Agent Action
|
||||
│
|
||||
▼
|
||||
Event Bus (Redis Pub/Sub)
|
||||
│
|
||||
├──▶ SSE Endpoint ──▶ Frontend Dashboard
|
||||
│
|
||||
├──▶ Audit Logger ──▶ PostgreSQL
|
||||
│
|
||||
└──▶ Other Backend Instances (horizontal scaling)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Authentication Flow
|
||||
|
||||
- **Users:** JWT dual-token (access + refresh) via PragmaStack
|
||||
- **Agents:** Service tokens for MCP communication
|
||||
- **MCP Servers:** Internal network only, validated service tokens
|
||||
|
||||
### Multi-Tenancy
|
||||
|
||||
- **Project Isolation:** All queries scoped by project_id
|
||||
- **Row-Level Security:** PostgreSQL RLS for knowledge base
|
||||
- **Agent Scoping:** Every MCP tool requires project_id + agent_id
|
||||
|
||||
### Audit Trail
|
||||
|
||||
- **Hash Chaining:** Tamper-evident event log
|
||||
- **Complete Coverage:** All agent actions, LLM calls, MCP tool invocations
|
||||
|
||||
---
|
||||
|
||||
## Scalability Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
| Component | Scaling Strategy |
|
||||
|-----------|-----------------|
|
||||
| FastAPI | Multiple instances behind load balancer |
|
||||
| Celery Workers | Add workers per queue as needed |
|
||||
| PostgreSQL | Read replicas, connection pooling |
|
||||
| Redis | Cluster mode for high availability |
|
||||
|
||||
### Expected Scale
|
||||
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Concurrent Projects | 50+ |
|
||||
| Concurrent Agent Instances | 200+ |
|
||||
| Background Jobs/minute | 500+ |
|
||||
| SSE Connections | 200+ |
|
||||
|
||||
---
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
### Local Development
|
||||
|
||||
```
|
||||
docker-compose up
|
||||
├── PostgreSQL (+ pgvector)
|
||||
├── Redis
|
||||
├── FastAPI Backend
|
||||
├── Next.js Frontend
|
||||
├── Celery Workers (agent, git, sync queues)
|
||||
├── Celery Beat (scheduler)
|
||||
├── Flower (monitoring)
|
||||
└── MCP Servers (7 containers)
|
||||
```
|
||||
|
||||
### Production
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Load Balancer │
|
||||
└─────────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
┌────────────────────┼────────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ API Instance 1 │ │ API Instance 2 │ │ API Instance N │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
│ │ │
|
||||
└────────────────────┼────────────────────┘
|
||||
│
|
||||
┌────────────────────┼────────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ PostgreSQL │ │ Redis Cluster │ │ Celery Workers │
|
||||
│ (Primary + │ │ │ │ (Auto-scaled) │
|
||||
│ Replicas) │ │ │ │ │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Implementation Roadmap](./IMPLEMENTATION_ROADMAP.md)
|
||||
- [Architecture Deep Analysis](./ARCHITECTURE_DEEP_ANALYSIS.md)
|
||||
- [ADRs](../adrs/) - All architecture decision records
|
||||
- [Spikes](../spikes/) - Research documents
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Full ADR List
|
||||
|
||||
1. [ADR-001: MCP Integration Architecture](../adrs/ADR-001-mcp-integration-architecture.md)
|
||||
2. [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md)
|
||||
3. [ADR-003: Background Task Architecture](../adrs/ADR-003-background-task-architecture.md)
|
||||
4. [ADR-004: LLM Provider Abstraction](../adrs/ADR-004-llm-provider-abstraction.md)
|
||||
5. [ADR-005: Technology Stack Selection](../adrs/ADR-005-tech-stack-selection.md)
|
||||
6. [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md)
|
||||
7. [ADR-007: Agentic Framework Selection](../adrs/ADR-007-agentic-framework-selection.md)
|
||||
8. [ADR-008: Knowledge Base and RAG](../adrs/ADR-008-knowledge-base-rag.md)
|
||||
9. [ADR-009: Agent Communication Protocol](../adrs/ADR-009-agent-communication-protocol.md)
|
||||
10. [ADR-010: Workflow State Machine](../adrs/ADR-010-workflow-state-machine.md)
|
||||
11. [ADR-011: Issue Synchronization](../adrs/ADR-011-issue-synchronization.md)
|
||||
12. [ADR-012: Cost Tracking](../adrs/ADR-012-cost-tracking.md)
|
||||
13. [ADR-013: Audit Logging](../adrs/ADR-013-audit-logging.md)
|
||||
14. [ADR-014: Client Approval Flow](../adrs/ADR-014-client-approval-flow.md)
|
||||
|
||||
---
|
||||
|
||||
*This document serves as the authoritative architecture reference for Syndarix.*
|
||||
@@ -17,9 +17,11 @@ This roadmap outlines the phased implementation approach for Syndarix, prioritiz
|
||||
|
||||
### 0.1 Repository Setup
|
||||
- [x] Fork PragmaStack to Syndarix
|
||||
- [x] Create spike backlog in Gitea
|
||||
- [x] Create spike backlog in Gitea (12 issues)
|
||||
- [x] Complete architecture documentation
|
||||
- [ ] Rebrand codebase (Issue #13 - in progress)
|
||||
- [x] Complete all spike research (SPIKE-001 through SPIKE-012)
|
||||
- [x] Create all ADRs (ADR-001 through ADR-014)
|
||||
- [x] Rebrand codebase (all URLs, names, configs updated)
|
||||
- [ ] Configure CI/CD pipelines
|
||||
- [ ] Set up development environment documentation
|
||||
|
||||
@@ -31,9 +33,12 @@ This roadmap outlines the phased implementation approach for Syndarix, prioritiz
|
||||
- [ ] Set up Docker Compose for local development
|
||||
|
||||
### Deliverables
|
||||
- Fully branded Syndarix repository
|
||||
- Working local development environment
|
||||
- CI/CD pipeline running tests
|
||||
- [x] Fully branded Syndarix repository
|
||||
- [x] Complete architecture documentation (ARCHITECTURE.md)
|
||||
- [x] All spike research completed (12 spikes)
|
||||
- [x] All ADRs documented (14 ADRs)
|
||||
- [ ] Working local development environment (Docker Compose)
|
||||
- [ ] CI/CD pipeline running tests
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user