# ADR-004: LLM Provider Abstraction **Status:** Accepted **Date:** 2025-12-29 **Deciders:** Architecture Team **Related Spikes:** SPIKE-005 --- ## Context Syndarix agents require access to large language models (LLMs) from multiple providers: - **Anthropic** (Claude Opus 4.5) - Primary provider, highest reasoning capability - **Google** (Gemini 3 Pro/Flash) - Strong multimodal, fast inference - **OpenAI** (GPT 5.1 Codex max) - Code generation specialist - **Alibaba** (Qwen3-235B) - Cost-effective alternative - **DeepSeek** (V3.2) - Open-weights, self-hostable option We need a unified abstraction layer that provides: - Consistent API across providers - Automatic failover on errors - Usage tracking and cost management - Rate limiting compliance ## Decision Drivers - **Reliability:** Automatic failover on provider outages - **Cost Control:** Track and limit API spending - **Flexibility:** Easy to add/swap providers - **Consistency:** Single interface for all agents - **Async Support:** Compatible with async FastAPI ## Considered Options ### Option 1: Direct Provider SDKs Use Anthropic and OpenAI SDKs directly with custom abstraction. **Pros:** - Full control over implementation - No external dependencies **Cons:** - Significant development effort - Must maintain failover logic - Must track token costs manually ### Option 2: LiteLLM (Selected) Use LiteLLM as unified abstraction layer. **Pros:** - Unified API for 100+ providers - Built-in failover and routing - Automatic token counting - Cost tracking built-in - Redis caching support - Active community **Cons:** - External dependency - May lag behind provider SDK updates ### Option 3: LangChain Use LangChain's LLM abstraction. **Pros:** - Large ecosystem - Many integrations **Cons:** - Heavy dependency - Overkill for just LLM abstraction - Complexity overhead ## Decision **Adopt Option 2: LiteLLM for unified LLM provider abstraction.** LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead. ## Implementation ### Model Groups | Group Name | Use Case | Primary Model | Fallback Chain | |------------|----------|---------------|----------------| | `high-reasoning` | Complex analysis, architecture | Claude Opus 4.5 | GPT 5.1 Codex max → Gemini 3 Pro | | `code-generation` | Code writing, refactoring | GPT 5.1 Codex max | Claude Opus 4.5 → DeepSeek V3.2 | | `fast-response` | Quick tasks, simple queries | Gemini 3 Flash | Qwen3-235B → DeepSeek V3.2 | | `cost-optimized` | High-volume, non-critical | Qwen3-235B | DeepSeek V3.2 (self-hosted) | | `self-hosted` | Privacy-sensitive, air-gapped | DeepSeek V3.2 | Qwen3-235B | ### Failover Chain (Primary) ``` Claude Opus 4.5 (Anthropic) │ ▼ (on failure/rate limit) GPT 5.1 Codex max (OpenAI) │ ▼ (on failure/rate limit) Gemini 3 Pro (Google) │ ▼ (on failure/rate limit) Qwen3-235B (Alibaba/Self-hosted) │ ▼ (on failure) DeepSeek V3.2 (Self-hosted) │ ▼ (all failed) Error with exponential backoff retry ``` ### LLM Gateway Service ```python class LLMGateway: def __init__(self): self.router = Router( model_list=model_list, fallbacks=[ {"high-reasoning": ["high-reasoning", "local-fallback"]}, ], routing_strategy="latency-based-routing", num_retries=3, ) async def complete( self, agent_id: str, project_id: str, messages: list[dict], model_preference: str = "high-reasoning", ) -> dict: response = await self.router.acompletion( model=model_preference, messages=messages, ) await self._track_usage(agent_id, project_id, response) return response ``` ### Cost Tracking | Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes | |-------|----------------------|------------------------|-------| | Claude Opus 4.5 | $15.00 | $75.00 | Highest reasoning capability | | GPT 5.1 Codex max | $12.00 | $60.00 | Code generation specialist | | Gemini 3 Pro | $3.50 | $10.50 | Strong multimodal | | Gemini 3 Flash | $0.35 | $1.05 | Fast inference | | Qwen3-235B | $2.00 | $6.00 | Cost-effective (or self-host: $0) | | DeepSeek V3.2 | $0.00 | $0.00 | Self-hosted, open weights | ### Agent Type Mapping | Agent Type | Model Preference | Rationale | |------------|------------------|-----------| | Product Owner | high-reasoning | Complex requirements analysis needs Claude Opus 4.5 | | Software Architect | high-reasoning | Architecture decisions need top-tier reasoning | | Software Engineer | code-generation | GPT 5.1 Codex max optimized for code | | QA Engineer | code-generation | Test code generation | | DevOps Engineer | fast-response | Config generation (Gemini 3 Flash) | | Project Manager | fast-response | Status updates, quick responses | | Business Analyst | high-reasoning | Document analysis needs strong reasoning | ### Caching Strategy - **Redis-backed cache** for repeated queries - **TTL:** 1 hour for general queries - **Skip cache:** For context-dependent generation - **Cache key:** Hash of (model, messages, temperature) ## Consequences ### Positive - Single interface for all LLM operations - Automatic failover improves reliability - Built-in cost tracking and budgeting - Easy to add new providers - Caching reduces API costs ### Negative - Dependency on LiteLLM library - May lag behind provider SDK features - Additional abstraction layer ### Mitigation - Pin LiteLLM version, test before upgrades - Direct SDK access available if needed - Monitor LiteLLM updates for breaking changes ## Compliance This decision aligns with: - FR-101: Agent type model configuration - NFR-103: Agent response time targets - NFR-402: Failover requirements - TR-001: LLM API unavailability mitigation --- *This ADR supersedes any previous decisions regarding LLM integration.*