docs: add architecture decision records (ADRs) for key technical choices

- Added the following ADRs to `docs/adrs/` directory: - ADR-001: MCP Integration Architecture - ADR-002: Real-time Communication Architecture - ADR-003: Background Task Architecture - ADR-004: LLM Provider Abstraction - ADR-005: Technology Stack Selection - Each ADR details the context, decision drivers, considered options, final decisions, and implementation plans. - Documentation aligns technical choices with architecture principles and system requirements for Syndarix.
2025-12-29 13:16:02 +01:00
parent a6a336b66e
commit 6e3cdebbfb
7 changed files with 1565 additions and 0 deletions
--- a/docs/adrs/ADR-003-background-task-architecture.md
+++ b/docs/adrs/ADR-003-background-task-architecture.md
@@ -0,0 +1,179 @@
+# ADR-003: Background Task Architecture
+
+**Status:** Accepted
+**Date:** 2025-12-29
+**Deciders:** Architecture Team
+**Related Spikes:** SPIKE-004
+
+---
+
+## Context
+
+Syndarix requires background task processing for:
+- Agent actions (LLM calls, code generation)
+- Git operations (clone, commit, push, PR creation)
+- External synchronization (issue sync with Gitea/GitHub/GitLab)
+- CI/CD pipeline triggers
+- Long-running workflows (sprints, story implementation)
+
+These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring.
+
+## Decision Drivers
+
+- **Reliability:** Tasks must complete even if workers restart
+- **Visibility:** Progress tracking for long-running operations
+- **Scalability:** Handle concurrent agent operations
+- **Rate Limiting:** Respect LLM API rate limits
+- **Async Compatibility:** Work with async FastAPI
+
+## Considered Options
+
+### Option 1: FastAPI BackgroundTasks
+Use FastAPI's built-in background tasks.
+
+**Pros:**
+- Simple, no additional infrastructure
+- Direct async integration
+
+**Cons:**
+- No persistence (lost on restart)
+- No retry mechanism
+- No distributed workers
+
+### Option 2: Celery + Redis (Selected)
+Use Celery for task queue with Redis as broker/backend.
+
+**Pros:**
+- Mature, battle-tested
+- Persistent task queue
+- Built-in retry with backoff
+- Distributed workers
+- Task chaining and workflows
+- Monitoring with Flower
+
+**Cons:**
+- Additional infrastructure
+- Sync-only task execution (bridge needed for async)
+
+### Option 3: Dramatiq + Redis
+Use Dramatiq as a simpler Celery alternative.
+
+**Pros:**
+- Simpler API than Celery
+- Good async support
+
+**Cons:**
+- Less mature ecosystem
+- Fewer monitoring tools
+
+### Option 4: ARQ (Async Redis Queue)
+Use ARQ for native async task processing.
+
+**Pros:**
+- Native async
+- Simple API
+
+**Cons:**
+- Less feature-rich
+- Smaller community
+
+## Decision
+
+**Adopt Option 2: Celery + Redis.**
+
+Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend.
+
+## Implementation
+
+### Queue Architecture
+
+```
+┌─────────────────────────────────────────────────┐
+│                 Redis (Broker + Backend)         │
+├─────────────┬─────────────┬─────────────────────┤
+│ agent_queue │  git_queue  │     sync_queue      │
+│ (prefetch=1)│ (prefetch=4)│    (prefetch=4)     │
+└──────┬──────┴──────┬──────┴──────────┬──────────┘
+       │             │                 │
+       ▼             ▼                 ▼
+  ┌─────────┐  ┌─────────┐       ┌─────────┐
+  │ Agent   │  │  Git    │       │  Sync   │
+  │ Workers │  │ Workers │       │ Workers │
+  └─────────┘  └─────────┘       └─────────┘
+```
+
+### Queue Configuration
+
+| Queue | Prefetch | Concurrency | Purpose |
+|-------|----------|-------------|---------|
+| `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) |
+| `git_queue` | 4 | 8 | Git operations |
+| `sync_queue` | 4 | 4 | External sync |
+| `cicd_queue` | 4 | 4 | Pipeline operations |
+
+### Task Patterns
+
+**Progress Reporting:**
+```python
+@celery_app.task(bind=True)
+def implement_story(self, story_id: str, agent_id: str, project_id: str):
+    for i, step in enumerate(steps):
+        self.update_state(
+            state="PROGRESS",
+            meta={"current": i + 1, "total": len(steps)}
+        )
+        # Publish SSE event for real-time UI update
+        event_bus.publish(f"project:{project_id}", {
+            "type": "agent_progress",
+            "step": i + 1,
+            "total": len(steps)
+        })
+        execute_step(step)
+```
+
+**Task Chaining:**
+```python
+workflow = chain(
+    analyze_requirements.s(story_id),
+    design_solution.s(),
+    implement_code.s(),
+    run_tests.s(),
+    create_pr.s()
+)
+```
+
+### Monitoring
+
+- **Flower:** Web UI for task monitoring (port 5555)
+- **Prometheus:** Metrics export for alerting
+- **Dead Letter Queue:** Failed tasks for investigation
+
+## Consequences
+
+### Positive
+- Reliable task execution with persistence
+- Automatic retry with exponential backoff
+- Progress tracking for long operations
+- Distributed workers for scalability
+- Rich monitoring and debugging tools
+
+### Negative
+- Additional infrastructure (Redis, workers)
+- Celery is synchronous (event_loop bridge for async calls)
+- Learning curve for task patterns
+
+### Mitigation
+- Use existing Redis instance (already needed for SSE)
+- Wrap async calls with `asyncio.run()` or `sync_to_async`
+- Document common task patterns
+
+## Compliance
+
+This decision aligns with:
+- FR-304: Long-running implementation workflow
+- NFR-102: 500+ background jobs per minute
+- NFR-402: Task reliability and fault tolerance
+
+---
+
+*This ADR supersedes any previous decisions regarding background task processing.*