forked from cardosofelipe/fast-next-template
- Added the following ADRs to `docs/adrs/` directory: - ADR-001: MCP Integration Architecture - ADR-002: Real-time Communication Architecture - ADR-003: Background Task Architecture - ADR-004: LLM Provider Abstraction - ADR-005: Technology Stack Selection - Each ADR details the context, decision drivers, considered options, final decisions, and implementation plans. - Documentation aligns technical choices with architecture principles and system requirements for Syndarix.
4.9 KiB
4.9 KiB
ADR-004: LLM Provider Abstraction
Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-005
Context
Syndarix agents require access to large language models (LLMs) from multiple providers:
- Anthropic (Claude) - Primary provider
- OpenAI (GPT-4) - Fallback provider
- Local models (Ollama/Llama) - Cost optimization, privacy
We need a unified abstraction layer that provides:
- Consistent API across providers
- Automatic failover on errors
- Usage tracking and cost management
- Rate limiting compliance
Decision Drivers
- Reliability: Automatic failover on provider outages
- Cost Control: Track and limit API spending
- Flexibility: Easy to add/swap providers
- Consistency: Single interface for all agents
- Async Support: Compatible with async FastAPI
Considered Options
Option 1: Direct Provider SDKs
Use Anthropic and OpenAI SDKs directly with custom abstraction.
Pros:
- Full control over implementation
- No external dependencies
Cons:
- Significant development effort
- Must maintain failover logic
- Must track token costs manually
Option 2: LiteLLM (Selected)
Use LiteLLM as unified abstraction layer.
Pros:
- Unified API for 100+ providers
- Built-in failover and routing
- Automatic token counting
- Cost tracking built-in
- Redis caching support
- Active community
Cons:
- External dependency
- May lag behind provider SDK updates
Option 3: LangChain
Use LangChain's LLM abstraction.
Pros:
- Large ecosystem
- Many integrations
Cons:
- Heavy dependency
- Overkill for just LLM abstraction
- Complexity overhead
Decision
Adopt Option 2: LiteLLM for unified LLM provider abstraction.
LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.
Implementation
Model Groups
| Group Name | Use Case | Primary Model | Fallback |
|---|---|---|---|
high-reasoning |
Complex analysis, architecture | Claude 3.5 Sonnet | GPT-4 Turbo |
fast-response |
Quick tasks, simple queries | Claude 3 Haiku | GPT-4o Mini |
cost-optimized |
High-volume, non-critical | Local Llama 3 | Claude 3 Haiku |
Failover Chain
Claude 3.5 Sonnet (Anthropic)
│
▼ (on failure)
GPT-4 Turbo (OpenAI)
│
▼ (on failure)
Llama 3 (Ollama/Local)
│
▼ (on failure)
Error with retry
LLM Gateway Service
class LLMGateway:
def __init__(self):
self.router = Router(
model_list=model_list,
fallbacks=[
{"high-reasoning": ["high-reasoning", "local-fallback"]},
],
routing_strategy="latency-based-routing",
num_retries=3,
)
async def complete(
self,
agent_id: str,
project_id: str,
messages: list[dict],
model_preference: str = "high-reasoning",
) -> dict:
response = await self.router.acompletion(
model=model_preference,
messages=messages,
)
await self._track_usage(agent_id, project_id, response)
return response
Cost Tracking
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| GPT-4 Turbo | $10.00 | $30.00 |
| GPT-4o Mini | $0.15 | $0.60 |
| Ollama (local) | $0.00 | $0.00 |
Agent Type Mapping
| Agent Type | Model Preference | Rationale |
|---|---|---|
| Product Owner | high-reasoning | Complex requirements analysis |
| Software Architect | high-reasoning | Architecture decisions |
| Software Engineer | high-reasoning | Code generation |
| QA Engineer | fast-response | Test case generation |
| DevOps Engineer | fast-response | Config generation |
| Project Manager | fast-response | Status updates |
Caching Strategy
- Redis-backed cache for repeated queries
- TTL: 1 hour for general queries
- Skip cache: For context-dependent generation
- Cache key: Hash of (model, messages, temperature)
Consequences
Positive
- Single interface for all LLM operations
- Automatic failover improves reliability
- Built-in cost tracking and budgeting
- Easy to add new providers
- Caching reduces API costs
Negative
- Dependency on LiteLLM library
- May lag behind provider SDK features
- Additional abstraction layer
Mitigation
- Pin LiteLLM version, test before upgrades
- Direct SDK access available if needed
- Monitor LiteLLM updates for breaking changes
Compliance
This decision aligns with:
- FR-101: Agent type model configuration
- NFR-103: Agent response time targets
- NFR-402: Failover requirements
- TR-001: LLM API unavailability mitigation
This ADR supersedes any previous decisions regarding LLM integration.