# ADR-003: Background Task Architecture

**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-004

---

## Context

Syndarix requires background task processing for:
- Agent actions (LLM calls, code generation)
- Git operations (clone, commit, push, PR creation)
- External synchronization (issue sync with Gitea/GitHub/GitLab)
- CI/CD pipeline triggers
- Long-running workflows (sprints, story implementation)

These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring.

## Decision Drivers

- **Reliability:** Tasks must complete even if workers restart
- **Visibility:** Progress tracking for long-running operations
- **Scalability:** Handle concurrent agent operations
- **Rate Limiting:** Respect LLM API rate limits
- **Async Compatibility:** Work with async FastAPI

## Considered Options

### Option 1: FastAPI BackgroundTasks
Use FastAPI's built-in background tasks.

**Pros:**
- Simple, no additional infrastructure
- Direct async integration

**Cons:**
- No persistence (lost on restart)
- No retry mechanism
- No distributed workers

### Option 2: Celery + Redis (Selected)
Use Celery for task queue with Redis as broker/backend.

**Pros:**
- Mature, battle-tested
- Persistent task queue
- Built-in retry with backoff
- Distributed workers
- Task chaining and workflows
- Monitoring with Flower

**Cons:**
- Additional infrastructure
- Sync-only task execution (bridge needed for async)

### Option 3: Dramatiq + Redis
Use Dramatiq as a simpler Celery alternative.

**Pros:**
- Simpler API than Celery
- Good async support

**Cons:**
- Less mature ecosystem
- Fewer monitoring tools

### Option 4: ARQ (Async Redis Queue)
Use ARQ for native async task processing.

**Pros:**
- Native async
- Simple API

**Cons:**
- Less feature-rich
- Smaller community

## Decision

**Adopt Option 2: Celery + Redis.**

Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend.

## Implementation

### Queue Architecture

```
┌─────────────────────────────────────────────────┐
│                 Redis (Broker + Backend)         │
├─────────────┬─────────────┬─────────────────────┤
│ agent_queue │  git_queue  │     sync_queue      │
│ (prefetch=1)│ (prefetch=4)│    (prefetch=4)     │
└──────┬──────┴──────┬──────┴──────────┬──────────┘
       │             │                 │
       ▼             ▼                 ▼
  ┌─────────┐  ┌─────────┐       ┌─────────┐
  │ Agent   │  │  Git    │       │  Sync   │
  │ Workers │  │ Workers │       │ Workers │
  └─────────┘  └─────────┘       └─────────┘
```

### Queue Configuration

| Queue | Prefetch | Concurrency | Purpose |
|-------|----------|-------------|---------|
| `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) |
| `git_queue` | 4 | 8 | Git operations |
| `sync_queue` | 4 | 4 | External sync |
| `cicd_queue` | 4 | 4 | Pipeline operations |

### Task Patterns

**Progress Reporting:**
```python
@celery_app.task(bind=True)
def implement_story(self, story_id: str, agent_id: str, project_id: str):
    for i, step in enumerate(steps):
        self.update_state(
            state="PROGRESS",
            meta={"current": i + 1, "total": len(steps)}
        )
        # Publish SSE event for real-time UI update
        event_bus.publish(f"project:{project_id}", {
            "type": "agent_progress",
            "step": i + 1,
            "total": len(steps)
        })
        execute_step(step)
```

**Task Chaining:**
```python
workflow = chain(
    analyze_requirements.s(story_id),
    design_solution.s(),
    implement_code.s(),
    run_tests.s(),
    create_pr.s()
)
```

### Monitoring

- **Flower:** Web UI for task monitoring (port 5555)
- **Prometheus:** Metrics export for alerting
- **Dead Letter Queue:** Failed tasks for investigation

## Consequences

### Positive
- Reliable task execution with persistence
- Automatic retry with exponential backoff
- Progress tracking for long operations
- Distributed workers for scalability
- Rich monitoring and debugging tools

### Negative
- Additional infrastructure (Redis, workers)
- Celery is synchronous (event_loop bridge for async calls)
- Learning curve for task patterns

### Mitigation
- Use existing Redis instance (already needed for SSE)
- Wrap async calls with `asyncio.run()` or `sync_to_async`
- Document common task patterns

## Compliance

This decision aligns with:
- FR-304: Long-running implementation workflow
- NFR-102: 500+ background jobs per minute
- NFR-402: Task reliability and fault tolerance

---

*This ADR supersedes any previous decisions regarding background task processing.*