docs: add architecture decision records (ADRs) for key technical choices
- Added the following ADRs to `docs/adrs/` directory: - ADR-001: MCP Integration Architecture - ADR-002: Real-time Communication Architecture - ADR-003: Background Task Architecture - ADR-004: LLM Provider Abstraction - ADR-005: Technology Stack Selection - Each ADR details the context, decision drivers, considered options, final decisions, and implementation plans. - Documentation aligns technical choices with architecture principles and system requirements for Syndarix.
This commit is contained in:
179
docs/adrs/ADR-003-background-task-architecture.md
Normal file
179
docs/adrs/ADR-003-background-task-architecture.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# ADR-003: Background Task Architecture
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-004
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix requires background task processing for:
|
||||
- Agent actions (LLM calls, code generation)
|
||||
- Git operations (clone, commit, push, PR creation)
|
||||
- External synchronization (issue sync with Gitea/GitHub/GitLab)
|
||||
- CI/CD pipeline triggers
|
||||
- Long-running workflows (sprints, story implementation)
|
||||
|
||||
These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring.
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Reliability:** Tasks must complete even if workers restart
|
||||
- **Visibility:** Progress tracking for long-running operations
|
||||
- **Scalability:** Handle concurrent agent operations
|
||||
- **Rate Limiting:** Respect LLM API rate limits
|
||||
- **Async Compatibility:** Work with async FastAPI
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: FastAPI BackgroundTasks
|
||||
Use FastAPI's built-in background tasks.
|
||||
|
||||
**Pros:**
|
||||
- Simple, no additional infrastructure
|
||||
- Direct async integration
|
||||
|
||||
**Cons:**
|
||||
- No persistence (lost on restart)
|
||||
- No retry mechanism
|
||||
- No distributed workers
|
||||
|
||||
### Option 2: Celery + Redis (Selected)
|
||||
Use Celery for task queue with Redis as broker/backend.
|
||||
|
||||
**Pros:**
|
||||
- Mature, battle-tested
|
||||
- Persistent task queue
|
||||
- Built-in retry with backoff
|
||||
- Distributed workers
|
||||
- Task chaining and workflows
|
||||
- Monitoring with Flower
|
||||
|
||||
**Cons:**
|
||||
- Additional infrastructure
|
||||
- Sync-only task execution (bridge needed for async)
|
||||
|
||||
### Option 3: Dramatiq + Redis
|
||||
Use Dramatiq as a simpler Celery alternative.
|
||||
|
||||
**Pros:**
|
||||
- Simpler API than Celery
|
||||
- Good async support
|
||||
|
||||
**Cons:**
|
||||
- Less mature ecosystem
|
||||
- Fewer monitoring tools
|
||||
|
||||
### Option 4: ARQ (Async Redis Queue)
|
||||
Use ARQ for native async task processing.
|
||||
|
||||
**Pros:**
|
||||
- Native async
|
||||
- Simple API
|
||||
|
||||
**Cons:**
|
||||
- Less feature-rich
|
||||
- Smaller community
|
||||
|
||||
## Decision
|
||||
|
||||
**Adopt Option 2: Celery + Redis.**
|
||||
|
||||
Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Queue Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Redis (Broker + Backend) │
|
||||
├─────────────┬─────────────┬─────────────────────┤
|
||||
│ agent_queue │ git_queue │ sync_queue │
|
||||
│ (prefetch=1)│ (prefetch=4)│ (prefetch=4) │
|
||||
└──────┬──────┴──────┬──────┴──────────┬──────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ Agent │ │ Git │ │ Sync │
|
||||
│ Workers │ │ Workers │ │ Workers │
|
||||
└─────────┘ └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
### Queue Configuration
|
||||
|
||||
| Queue | Prefetch | Concurrency | Purpose |
|
||||
|-------|----------|-------------|---------|
|
||||
| `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) |
|
||||
| `git_queue` | 4 | 8 | Git operations |
|
||||
| `sync_queue` | 4 | 4 | External sync |
|
||||
| `cicd_queue` | 4 | 4 | Pipeline operations |
|
||||
|
||||
### Task Patterns
|
||||
|
||||
**Progress Reporting:**
|
||||
```python
|
||||
@celery_app.task(bind=True)
|
||||
def implement_story(self, story_id: str, agent_id: str, project_id: str):
|
||||
for i, step in enumerate(steps):
|
||||
self.update_state(
|
||||
state="PROGRESS",
|
||||
meta={"current": i + 1, "total": len(steps)}
|
||||
)
|
||||
# Publish SSE event for real-time UI update
|
||||
event_bus.publish(f"project:{project_id}", {
|
||||
"type": "agent_progress",
|
||||
"step": i + 1,
|
||||
"total": len(steps)
|
||||
})
|
||||
execute_step(step)
|
||||
```
|
||||
|
||||
**Task Chaining:**
|
||||
```python
|
||||
workflow = chain(
|
||||
analyze_requirements.s(story_id),
|
||||
design_solution.s(),
|
||||
implement_code.s(),
|
||||
run_tests.s(),
|
||||
create_pr.s()
|
||||
)
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
- **Flower:** Web UI for task monitoring (port 5555)
|
||||
- **Prometheus:** Metrics export for alerting
|
||||
- **Dead Letter Queue:** Failed tasks for investigation
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Reliable task execution with persistence
|
||||
- Automatic retry with exponential backoff
|
||||
- Progress tracking for long operations
|
||||
- Distributed workers for scalability
|
||||
- Rich monitoring and debugging tools
|
||||
|
||||
### Negative
|
||||
- Additional infrastructure (Redis, workers)
|
||||
- Celery is synchronous (event_loop bridge for async calls)
|
||||
- Learning curve for task patterns
|
||||
|
||||
### Mitigation
|
||||
- Use existing Redis instance (already needed for SSE)
|
||||
- Wrap async calls with `asyncio.run()` or `sync_to_async`
|
||||
- Document common task patterns
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-304: Long-running implementation workflow
|
||||
- NFR-102: 500+ background jobs per minute
|
||||
- NFR-402: Task reliability and fault tolerance
|
||||
|
||||
---
|
||||
|
||||
*This ADR supersedes any previous decisions regarding background task processing.*
|
||||
Reference in New Issue
Block a user