forked from cardosofelipe/fast-next-template
## Model Stack Updates (User's Actual Models) Updated all documentation to reflect production models: - Claude Opus 4.5 (primary reasoning) - GPT 5.1 Codex max (code generation specialist) - Gemini 3 Pro/Flash (multimodal, fast inference) - Qwen3-235B (cost-effective, self-hostable) - DeepSeek V3.2 (self-hosted, open weights) ### Files Updated: - ADR-004: Full model groups, failover chains, cost tables - ADR-007: Code example with correct model identifiers - ADR-012: Cost tracking with new model prices - ARCHITECTURE.md: Model groups, failover diagram - IMPLEMENTATION_ROADMAP.md: External services list ## Architecture Diagram Updates - Added LangGraph Runtime to orchestration layer - Added technology labels (Type-Instance, transitions) ## Self-Hostability Table Expanded Added entries for: - LangGraph (MIT) - transitions (MIT) - DeepSeek V3.2 (MIT) - Qwen3-235B (Apache 2.0) ## Metric Alignments - Response time: Split into API (<200ms) and Agent (<10s/<60s) - Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing - Added concurrent projects (10+) and agents (50+) metrics ## Infrastructure Updates - Celery workers: 4-8 instances (was 2-4) across 4 queues - MCP servers: Clarified Phase 2 + Phase 5 deployment - Sync interval: Clarified 60s fallback + 15min reconciliation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
202 lines
5.9 KiB
Markdown
202 lines
5.9 KiB
Markdown
# ADR-004: LLM Provider Abstraction
|
|
|
|
**Status:** Accepted
|
|
**Date:** 2025-12-29
|
|
**Deciders:** Architecture Team
|
|
**Related Spikes:** SPIKE-005
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Syndarix agents require access to large language models (LLMs) from multiple providers:
|
|
- **Anthropic** (Claude Opus 4.5) - Primary provider, highest reasoning capability
|
|
- **Google** (Gemini 3 Pro/Flash) - Strong multimodal, fast inference
|
|
- **OpenAI** (GPT 5.1 Codex max) - Code generation specialist
|
|
- **Alibaba** (Qwen3-235B) - Cost-effective alternative
|
|
- **DeepSeek** (V3.2) - Open-weights, self-hostable option
|
|
|
|
We need a unified abstraction layer that provides:
|
|
- Consistent API across providers
|
|
- Automatic failover on errors
|
|
- Usage tracking and cost management
|
|
- Rate limiting compliance
|
|
|
|
## Decision Drivers
|
|
|
|
- **Reliability:** Automatic failover on provider outages
|
|
- **Cost Control:** Track and limit API spending
|
|
- **Flexibility:** Easy to add/swap providers
|
|
- **Consistency:** Single interface for all agents
|
|
- **Async Support:** Compatible with async FastAPI
|
|
|
|
## Considered Options
|
|
|
|
### Option 1: Direct Provider SDKs
|
|
Use Anthropic and OpenAI SDKs directly with custom abstraction.
|
|
|
|
**Pros:**
|
|
- Full control over implementation
|
|
- No external dependencies
|
|
|
|
**Cons:**
|
|
- Significant development effort
|
|
- Must maintain failover logic
|
|
- Must track token costs manually
|
|
|
|
### Option 2: LiteLLM (Selected)
|
|
Use LiteLLM as unified abstraction layer.
|
|
|
|
**Pros:**
|
|
- Unified API for 100+ providers
|
|
- Built-in failover and routing
|
|
- Automatic token counting
|
|
- Cost tracking built-in
|
|
- Redis caching support
|
|
- Active community
|
|
|
|
**Cons:**
|
|
- External dependency
|
|
- May lag behind provider SDK updates
|
|
|
|
### Option 3: LangChain
|
|
Use LangChain's LLM abstraction.
|
|
|
|
**Pros:**
|
|
- Large ecosystem
|
|
- Many integrations
|
|
|
|
**Cons:**
|
|
- Heavy dependency
|
|
- Overkill for just LLM abstraction
|
|
- Complexity overhead
|
|
|
|
## Decision
|
|
|
|
**Adopt Option 2: LiteLLM for unified LLM provider abstraction.**
|
|
|
|
LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.
|
|
|
|
## Implementation
|
|
|
|
### Model Groups
|
|
|
|
| Group Name | Use Case | Primary Model | Fallback Chain |
|
|
|------------|----------|---------------|----------------|
|
|
| `high-reasoning` | Complex analysis, architecture | Claude Opus 4.5 | GPT 5.1 Codex max → Gemini 3 Pro |
|
|
| `code-generation` | Code writing, refactoring | GPT 5.1 Codex max | Claude Opus 4.5 → DeepSeek V3.2 |
|
|
| `fast-response` | Quick tasks, simple queries | Gemini 3 Flash | Qwen3-235B → DeepSeek V3.2 |
|
|
| `cost-optimized` | High-volume, non-critical | Qwen3-235B | DeepSeek V3.2 (self-hosted) |
|
|
| `self-hosted` | Privacy-sensitive, air-gapped | DeepSeek V3.2 | Qwen3-235B |
|
|
|
|
### Failover Chain (Primary)
|
|
|
|
```
|
|
Claude Opus 4.5 (Anthropic)
|
|
│
|
|
▼ (on failure/rate limit)
|
|
GPT 5.1 Codex max (OpenAI)
|
|
│
|
|
▼ (on failure/rate limit)
|
|
Gemini 3 Pro (Google)
|
|
│
|
|
▼ (on failure/rate limit)
|
|
Qwen3-235B (Alibaba/Self-hosted)
|
|
│
|
|
▼ (on failure)
|
|
DeepSeek V3.2 (Self-hosted)
|
|
│
|
|
▼ (all failed)
|
|
Error with exponential backoff retry
|
|
```
|
|
|
|
### LLM Gateway Service
|
|
|
|
```python
|
|
class LLMGateway:
|
|
def __init__(self):
|
|
self.router = Router(
|
|
model_list=model_list,
|
|
fallbacks=[
|
|
{"high-reasoning": ["high-reasoning", "local-fallback"]},
|
|
],
|
|
routing_strategy="latency-based-routing",
|
|
num_retries=3,
|
|
)
|
|
|
|
async def complete(
|
|
self,
|
|
agent_id: str,
|
|
project_id: str,
|
|
messages: list[dict],
|
|
model_preference: str = "high-reasoning",
|
|
) -> dict:
|
|
response = await self.router.acompletion(
|
|
model=model_preference,
|
|
messages=messages,
|
|
)
|
|
await self._track_usage(agent_id, project_id, response)
|
|
return response
|
|
```
|
|
|
|
### Cost Tracking
|
|
|
|
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|
|
|-------|----------------------|------------------------|-------|
|
|
| Claude Opus 4.5 | $15.00 | $75.00 | Highest reasoning capability |
|
|
| GPT 5.1 Codex max | $12.00 | $60.00 | Code generation specialist |
|
|
| Gemini 3 Pro | $3.50 | $10.50 | Strong multimodal |
|
|
| Gemini 3 Flash | $0.35 | $1.05 | Fast inference |
|
|
| Qwen3-235B | $2.00 | $6.00 | Cost-effective (or self-host: $0) |
|
|
| DeepSeek V3.2 | $0.00 | $0.00 | Self-hosted, open weights |
|
|
|
|
### Agent Type Mapping
|
|
|
|
| Agent Type | Model Preference | Rationale |
|
|
|------------|------------------|-----------|
|
|
| Product Owner | high-reasoning | Complex requirements analysis needs Claude Opus 4.5 |
|
|
| Software Architect | high-reasoning | Architecture decisions need top-tier reasoning |
|
|
| Software Engineer | code-generation | GPT 5.1 Codex max optimized for code |
|
|
| QA Engineer | code-generation | Test code generation |
|
|
| DevOps Engineer | fast-response | Config generation (Gemini 3 Flash) |
|
|
| Project Manager | fast-response | Status updates, quick responses |
|
|
| Business Analyst | high-reasoning | Document analysis needs strong reasoning |
|
|
|
|
### Caching Strategy
|
|
|
|
- **Redis-backed cache** for repeated queries
|
|
- **TTL:** 1 hour for general queries
|
|
- **Skip cache:** For context-dependent generation
|
|
- **Cache key:** Hash of (model, messages, temperature)
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- Single interface for all LLM operations
|
|
- Automatic failover improves reliability
|
|
- Built-in cost tracking and budgeting
|
|
- Easy to add new providers
|
|
- Caching reduces API costs
|
|
|
|
### Negative
|
|
- Dependency on LiteLLM library
|
|
- May lag behind provider SDK features
|
|
- Additional abstraction layer
|
|
|
|
### Mitigation
|
|
- Pin LiteLLM version, test before upgrades
|
|
- Direct SDK access available if needed
|
|
- Monitor LiteLLM updates for breaking changes
|
|
|
|
## Compliance
|
|
|
|
This decision aligns with:
|
|
- FR-101: Agent type model configuration
|
|
- NFR-103: Agent response time targets
|
|
- NFR-402: Failover requirements
|
|
- TR-001: LLM API unavailability mitigation
|
|
|
|
---
|
|
|
|
*This ADR supersedes any previous decisions regarding LLM integration.*
|