- Added the following ADRs to `docs/adrs/` directory: - ADR-001: MCP Integration Architecture - ADR-002: Real-time Communication Architecture - ADR-003: Background Task Architecture - ADR-004: LLM Provider Abstraction - ADR-005: Technology Stack Selection - Each ADR details the context, decision drivers, considered options, final decisions, and implementation plans. - Documentation aligns technical choices with architecture principles and system requirements for Syndarix.
180 lines
5.1 KiB
Markdown
180 lines
5.1 KiB
Markdown
# ADR-003: Background Task Architecture
|
|
|
|
**Status:** Accepted
|
|
**Date:** 2025-12-29
|
|
**Deciders:** Architecture Team
|
|
**Related Spikes:** SPIKE-004
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Syndarix requires background task processing for:
|
|
- Agent actions (LLM calls, code generation)
|
|
- Git operations (clone, commit, push, PR creation)
|
|
- External synchronization (issue sync with Gitea/GitHub/GitLab)
|
|
- CI/CD pipeline triggers
|
|
- Long-running workflows (sprints, story implementation)
|
|
|
|
These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring.
|
|
|
|
## Decision Drivers
|
|
|
|
- **Reliability:** Tasks must complete even if workers restart
|
|
- **Visibility:** Progress tracking for long-running operations
|
|
- **Scalability:** Handle concurrent agent operations
|
|
- **Rate Limiting:** Respect LLM API rate limits
|
|
- **Async Compatibility:** Work with async FastAPI
|
|
|
|
## Considered Options
|
|
|
|
### Option 1: FastAPI BackgroundTasks
|
|
Use FastAPI's built-in background tasks.
|
|
|
|
**Pros:**
|
|
- Simple, no additional infrastructure
|
|
- Direct async integration
|
|
|
|
**Cons:**
|
|
- No persistence (lost on restart)
|
|
- No retry mechanism
|
|
- No distributed workers
|
|
|
|
### Option 2: Celery + Redis (Selected)
|
|
Use Celery for task queue with Redis as broker/backend.
|
|
|
|
**Pros:**
|
|
- Mature, battle-tested
|
|
- Persistent task queue
|
|
- Built-in retry with backoff
|
|
- Distributed workers
|
|
- Task chaining and workflows
|
|
- Monitoring with Flower
|
|
|
|
**Cons:**
|
|
- Additional infrastructure
|
|
- Sync-only task execution (bridge needed for async)
|
|
|
|
### Option 3: Dramatiq + Redis
|
|
Use Dramatiq as a simpler Celery alternative.
|
|
|
|
**Pros:**
|
|
- Simpler API than Celery
|
|
- Good async support
|
|
|
|
**Cons:**
|
|
- Less mature ecosystem
|
|
- Fewer monitoring tools
|
|
|
|
### Option 4: ARQ (Async Redis Queue)
|
|
Use ARQ for native async task processing.
|
|
|
|
**Pros:**
|
|
- Native async
|
|
- Simple API
|
|
|
|
**Cons:**
|
|
- Less feature-rich
|
|
- Smaller community
|
|
|
|
## Decision
|
|
|
|
**Adopt Option 2: Celery + Redis.**
|
|
|
|
Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend.
|
|
|
|
## Implementation
|
|
|
|
### Queue Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────┐
|
|
│ Redis (Broker + Backend) │
|
|
├─────────────┬─────────────┬─────────────────────┤
|
|
│ agent_queue │ git_queue │ sync_queue │
|
|
│ (prefetch=1)│ (prefetch=4)│ (prefetch=4) │
|
|
└──────┬──────┴──────┬──────┴──────────┬──────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────┐ ┌─────────┐ ┌─────────┐
|
|
│ Agent │ │ Git │ │ Sync │
|
|
│ Workers │ │ Workers │ │ Workers │
|
|
└─────────┘ └─────────┘ └─────────┘
|
|
```
|
|
|
|
### Queue Configuration
|
|
|
|
| Queue | Prefetch | Concurrency | Purpose |
|
|
|-------|----------|-------------|---------|
|
|
| `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) |
|
|
| `git_queue` | 4 | 8 | Git operations |
|
|
| `sync_queue` | 4 | 4 | External sync |
|
|
| `cicd_queue` | 4 | 4 | Pipeline operations |
|
|
|
|
### Task Patterns
|
|
|
|
**Progress Reporting:**
|
|
```python
|
|
@celery_app.task(bind=True)
|
|
def implement_story(self, story_id: str, agent_id: str, project_id: str):
|
|
for i, step in enumerate(steps):
|
|
self.update_state(
|
|
state="PROGRESS",
|
|
meta={"current": i + 1, "total": len(steps)}
|
|
)
|
|
# Publish SSE event for real-time UI update
|
|
event_bus.publish(f"project:{project_id}", {
|
|
"type": "agent_progress",
|
|
"step": i + 1,
|
|
"total": len(steps)
|
|
})
|
|
execute_step(step)
|
|
```
|
|
|
|
**Task Chaining:**
|
|
```python
|
|
workflow = chain(
|
|
analyze_requirements.s(story_id),
|
|
design_solution.s(),
|
|
implement_code.s(),
|
|
run_tests.s(),
|
|
create_pr.s()
|
|
)
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
- **Flower:** Web UI for task monitoring (port 5555)
|
|
- **Prometheus:** Metrics export for alerting
|
|
- **Dead Letter Queue:** Failed tasks for investigation
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- Reliable task execution with persistence
|
|
- Automatic retry with exponential backoff
|
|
- Progress tracking for long operations
|
|
- Distributed workers for scalability
|
|
- Rich monitoring and debugging tools
|
|
|
|
### Negative
|
|
- Additional infrastructure (Redis, workers)
|
|
- Celery is synchronous (event_loop bridge for async calls)
|
|
- Learning curve for task patterns
|
|
|
|
### Mitigation
|
|
- Use existing Redis instance (already needed for SSE)
|
|
- Wrap async calls with `asyncio.run()` or `sync_to_async`
|
|
- Document common task patterns
|
|
|
|
## Compliance
|
|
|
|
This decision aligns with:
|
|
- FR-304: Long-running implementation workflow
|
|
- NFR-102: 500+ background jobs per minute
|
|
- NFR-402: Task reliability and fault tolerance
|
|
|
|
---
|
|
|
|
*This ADR supersedes any previous decisions regarding background task processing.*
|