# ADR-003: Background Task Architecture **Status:** Accepted **Date:** 2025-12-29 **Deciders:** Architecture Team **Related Spikes:** SPIKE-004 --- ## Context Syndarix requires background task processing for: - Agent actions (LLM calls, code generation) - Git operations (clone, commit, push, PR creation) - External synchronization (issue sync with Gitea/GitHub/GitLab) - CI/CD pipeline triggers - Long-running workflows (sprints, story implementation) These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring. ## Decision Drivers - **Reliability:** Tasks must complete even if workers restart - **Visibility:** Progress tracking for long-running operations - **Scalability:** Handle concurrent agent operations - **Rate Limiting:** Respect LLM API rate limits - **Async Compatibility:** Work with async FastAPI ## Considered Options ### Option 1: FastAPI BackgroundTasks Use FastAPI's built-in background tasks. **Pros:** - Simple, no additional infrastructure - Direct async integration **Cons:** - No persistence (lost on restart) - No retry mechanism - No distributed workers ### Option 2: Celery + Redis (Selected) Use Celery for task queue with Redis as broker/backend. **Pros:** - Mature, battle-tested - Persistent task queue - Built-in retry with backoff - Distributed workers - Task chaining and workflows - Monitoring with Flower **Cons:** - Additional infrastructure - Sync-only task execution (bridge needed for async) ### Option 3: Dramatiq + Redis Use Dramatiq as a simpler Celery alternative. **Pros:** - Simpler API than Celery - Good async support **Cons:** - Less mature ecosystem - Fewer monitoring tools ### Option 4: ARQ (Async Redis Queue) Use ARQ for native async task processing. **Pros:** - Native async - Simple API **Cons:** - Less feature-rich - Smaller community ## Decision **Adopt Option 2: Celery + Redis.** Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend. ## Implementation ### Queue Architecture ``` ┌─────────────────────────────────────────────────┐ │ Redis (Broker + Backend) │ ├─────────────┬─────────────┬─────────────────────┤ │ agent_queue │ git_queue │ sync_queue │ │ (prefetch=1)│ (prefetch=4)│ (prefetch=4) │ └──────┬──────┴──────┬──────┴──────────┬──────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Agent │ │ Git │ │ Sync │ │ Workers │ │ Workers │ │ Workers │ └─────────┘ └─────────┘ └─────────┘ ``` ### Queue Configuration | Queue | Prefetch | Concurrency | Purpose | |-------|----------|-------------|---------| | `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) | | `git_queue` | 4 | 8 | Git operations | | `sync_queue` | 4 | 4 | External sync | | `cicd_queue` | 4 | 4 | Pipeline operations | ### Task Patterns **Progress Reporting:** ```python @celery_app.task(bind=True) def implement_story(self, story_id: str, agent_id: str, project_id: str): for i, step in enumerate(steps): self.update_state( state="PROGRESS", meta={"current": i + 1, "total": len(steps)} ) # Publish SSE event for real-time UI update event_bus.publish(f"project:{project_id}", { "type": "agent_progress", "step": i + 1, "total": len(steps) }) execute_step(step) ``` **Task Chaining:** ```python workflow = chain( analyze_requirements.s(story_id), design_solution.s(), implement_code.s(), run_tests.s(), create_pr.s() ) ``` ### Monitoring - **Flower:** Web UI for task monitoring (port 5555) - **Prometheus:** Metrics export for alerting - **Dead Letter Queue:** Failed tasks for investigation ## Consequences ### Positive - Reliable task execution with persistence - Automatic retry with exponential backoff - Progress tracking for long operations - Distributed workers for scalability - Rich monitoring and debugging tools ### Negative - Additional infrastructure (Redis, workers) - Celery is synchronous (event_loop bridge for async calls) - Learning curve for task patterns ### Mitigation - Use existing Redis instance (already needed for SSE) - Wrap async calls with `asyncio.run()` or `sync_to_async` - Document common task patterns ## Compliance This decision aligns with: - FR-304: Long-running implementation workflow - NFR-102: 500+ background jobs per minute - NFR-402: Task reliability and fault tolerance --- *This ADR supersedes any previous decisions regarding background task processing.*