syndarix/docs/adrs/ADR-004-llm-provider-abstraction.md

# ADR-004: LLM Provider Abstraction

**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-005

---

## Context

Syndarix agents require access to large language models (LLMs) from multiple providers:
- **Anthropic** (Claude Opus 4.5) - Primary provider, highest reasoning capability
- **Google** (Gemini 3 Pro/Flash) - Strong multimodal, fast inference
- **OpenAI** (GPT 5.1 Codex max) - Code generation specialist
- **Alibaba** (Qwen3-235B) - Cost-effective alternative
- **DeepSeek** (V3.2) - Open-weights, self-hostable option

We need a unified abstraction layer that provides:
- Consistent API across providers
- Automatic failover on errors
- Usage tracking and cost management
- Rate limiting compliance

## Decision Drivers

- **Reliability:** Automatic failover on provider outages
- **Cost Control:** Track and limit API spending
- **Flexibility:** Easy to add/swap providers
- **Consistency:** Single interface for all agents
- **Async Support:** Compatible with async FastAPI

## Considered Options

### Option 1: Direct Provider SDKs
Use Anthropic and OpenAI SDKs directly with custom abstraction.

**Pros:**
- Full control over implementation
- No external dependencies

**Cons:**
- Significant development effort
- Must maintain failover logic
- Must track token costs manually

### Option 2: LiteLLM (Selected)
Use LiteLLM as unified abstraction layer.

**Pros:**
- Unified API for 100+ providers
- Built-in failover and routing
- Automatic token counting
- Cost tracking built-in
- Redis caching support
- Active community

**Cons:**
- External dependency
- May lag behind provider SDK updates

### Option 3: LangChain
Use LangChain's LLM abstraction.

**Pros:**
- Large ecosystem
- Many integrations

**Cons:**
- Heavy dependency
- Overkill for just LLM abstraction
- Complexity overhead

## Decision

**Adopt Option 2: LiteLLM for unified LLM provider abstraction.**

LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.

## Implementation

### Model Groups

| Group Name | Use Case | Primary Model | Fallback Chain |
|------------|----------|---------------|----------------|
| `high-reasoning` | Complex analysis, architecture | Claude Opus 4.5 | GPT 5.1 Codex max → Gemini 3 Pro |
| `code-generation` | Code writing, refactoring | GPT 5.1 Codex max | Claude Opus 4.5 → DeepSeek V3.2 |
| `fast-response` | Quick tasks, simple queries | Gemini 3 Flash | Qwen3-235B → DeepSeek V3.2 |
| `cost-optimized` | High-volume, non-critical | Qwen3-235B | DeepSeek V3.2 (self-hosted) |
| `self-hosted` | Privacy-sensitive, air-gapped | DeepSeek V3.2 | Qwen3-235B |

### Failover Chain (Primary)

```
Claude Opus 4.5 (Anthropic)
         │
         ▼ (on failure/rate limit)
    GPT 5.1 Codex max (OpenAI)
         │
         ▼ (on failure/rate limit)
    Gemini 3 Pro (Google)
         │
         ▼ (on failure/rate limit)
    Qwen3-235B (Alibaba/Self-hosted)
         │
         ▼ (on failure)
    DeepSeek V3.2 (Self-hosted)
         │
         ▼ (all failed)
    Error with exponential backoff retry
```

### LLM Gateway Service

```python
class LLMGateway:
    def __init__(self):
        self.router = Router(
            model_list=model_list,
            fallbacks=[
                {"high-reasoning": ["high-reasoning", "local-fallback"]},
            ],
            routing_strategy="latency-based-routing",
            num_retries=3,
        )

    async def complete(
        self,
        agent_id: str,
        project_id: str,
        messages: list[dict],
        model_preference: str = "high-reasoning",
    ) -> dict:
        response = await self.router.acompletion(
            model=model_preference,
            messages=messages,
        )
        await self._track_usage(agent_id, project_id, response)
        return response
```

### Cost Tracking

| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|-------|----------------------|------------------------|-------|
| Claude Opus 4.5 | $15.00 | $75.00 | Highest reasoning capability |
| GPT 5.1 Codex max | $12.00 | $60.00 | Code generation specialist |
| Gemini 3 Pro | $3.50 | $10.50 | Strong multimodal |
| Gemini 3 Flash | $0.35 | $1.05 | Fast inference |
| Qwen3-235B | $2.00 | $6.00 | Cost-effective (or self-host: $0) |
| DeepSeek V3.2 | $0.00 | $0.00 | Self-hosted, open weights |

### Agent Type Mapping

| Agent Type | Model Preference | Rationale |
|------------|------------------|-----------|
| Product Owner | high-reasoning | Complex requirements analysis needs Claude Opus 4.5 |
| Software Architect | high-reasoning | Architecture decisions need top-tier reasoning |
| Software Engineer | code-generation | GPT 5.1 Codex max optimized for code |
| QA Engineer | code-generation | Test code generation |
| DevOps Engineer | fast-response | Config generation (Gemini 3 Flash) |
| Project Manager | fast-response | Status updates, quick responses |
| Business Analyst | high-reasoning | Document analysis needs strong reasoning |

### Caching Strategy

- **Redis-backed cache** for repeated queries
- **TTL:** 1 hour for general queries
- **Skip cache:** For context-dependent generation
- **Cache key:** Hash of (model, messages, temperature)

## Consequences

### Positive
- Single interface for all LLM operations
- Automatic failover improves reliability
- Built-in cost tracking and budgeting
- Easy to add new providers
- Caching reduces API costs

### Negative
- Dependency on LiteLLM library
- May lag behind provider SDK features
- Additional abstraction layer

### Mitigation
- Pin LiteLLM version, test before upgrades
- Direct SDK access available if needed
- Monitor LiteLLM updates for breaking changes

## Compliance

This decision aligns with:
- FR-101: Agent type model configuration
- NFR-103: Agent response time targets
- NFR-402: Failover requirements
- TR-001: LLM API unavailability mitigation

---

*This ADR supersedes any previous decisions regarding LLM integration.*