forked from cardosofelipe/pragma-stack
docs: add architecture decision records (ADRs) for key technical choices
- Added the following ADRs to `docs/adrs/` directory: - ADR-001: MCP Integration Architecture - ADR-002: Real-time Communication Architecture - ADR-003: Background Task Architecture - ADR-004: LLM Provider Abstraction - ADR-005: Technology Stack Selection - Each ADR details the context, decision drivers, considered options, final decisions, and implementation plans. - Documentation aligns technical choices with architecture principles and system requirements for Syndarix.
This commit is contained in:
189
docs/adrs/ADR-004-llm-provider-abstraction.md
Normal file
189
docs/adrs/ADR-004-llm-provider-abstraction.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# ADR-004: LLM Provider Abstraction
|
||||
|
||||
**Status:** Accepted
|
||||
**Date:** 2025-12-29
|
||||
**Deciders:** Architecture Team
|
||||
**Related Spikes:** SPIKE-005
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Syndarix agents require access to large language models (LLMs) from multiple providers:
|
||||
- **Anthropic** (Claude) - Primary provider
|
||||
- **OpenAI** (GPT-4) - Fallback provider
|
||||
- **Local models** (Ollama/Llama) - Cost optimization, privacy
|
||||
|
||||
We need a unified abstraction layer that provides:
|
||||
- Consistent API across providers
|
||||
- Automatic failover on errors
|
||||
- Usage tracking and cost management
|
||||
- Rate limiting compliance
|
||||
|
||||
## Decision Drivers
|
||||
|
||||
- **Reliability:** Automatic failover on provider outages
|
||||
- **Cost Control:** Track and limit API spending
|
||||
- **Flexibility:** Easy to add/swap providers
|
||||
- **Consistency:** Single interface for all agents
|
||||
- **Async Support:** Compatible with async FastAPI
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: Direct Provider SDKs
|
||||
Use Anthropic and OpenAI SDKs directly with custom abstraction.
|
||||
|
||||
**Pros:**
|
||||
- Full control over implementation
|
||||
- No external dependencies
|
||||
|
||||
**Cons:**
|
||||
- Significant development effort
|
||||
- Must maintain failover logic
|
||||
- Must track token costs manually
|
||||
|
||||
### Option 2: LiteLLM (Selected)
|
||||
Use LiteLLM as unified abstraction layer.
|
||||
|
||||
**Pros:**
|
||||
- Unified API for 100+ providers
|
||||
- Built-in failover and routing
|
||||
- Automatic token counting
|
||||
- Cost tracking built-in
|
||||
- Redis caching support
|
||||
- Active community
|
||||
|
||||
**Cons:**
|
||||
- External dependency
|
||||
- May lag behind provider SDK updates
|
||||
|
||||
### Option 3: LangChain
|
||||
Use LangChain's LLM abstraction.
|
||||
|
||||
**Pros:**
|
||||
- Large ecosystem
|
||||
- Many integrations
|
||||
|
||||
**Cons:**
|
||||
- Heavy dependency
|
||||
- Overkill for just LLM abstraction
|
||||
- Complexity overhead
|
||||
|
||||
## Decision
|
||||
|
||||
**Adopt Option 2: LiteLLM for unified LLM provider abstraction.**
|
||||
|
||||
LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.
|
||||
|
||||
## Implementation
|
||||
|
||||
### Model Groups
|
||||
|
||||
| Group Name | Use Case | Primary Model | Fallback |
|
||||
|------------|----------|---------------|----------|
|
||||
| `high-reasoning` | Complex analysis, architecture | Claude 3.5 Sonnet | GPT-4 Turbo |
|
||||
| `fast-response` | Quick tasks, simple queries | Claude 3 Haiku | GPT-4o Mini |
|
||||
| `cost-optimized` | High-volume, non-critical | Local Llama 3 | Claude 3 Haiku |
|
||||
|
||||
### Failover Chain
|
||||
|
||||
```
|
||||
Claude 3.5 Sonnet (Anthropic)
|
||||
│
|
||||
▼ (on failure)
|
||||
GPT-4 Turbo (OpenAI)
|
||||
│
|
||||
▼ (on failure)
|
||||
Llama 3 (Ollama/Local)
|
||||
│
|
||||
▼ (on failure)
|
||||
Error with retry
|
||||
```
|
||||
|
||||
### LLM Gateway Service
|
||||
|
||||
```python
|
||||
class LLMGateway:
|
||||
def __init__(self):
|
||||
self.router = Router(
|
||||
model_list=model_list,
|
||||
fallbacks=[
|
||||
{"high-reasoning": ["high-reasoning", "local-fallback"]},
|
||||
],
|
||||
routing_strategy="latency-based-routing",
|
||||
num_retries=3,
|
||||
)
|
||||
|
||||
async def complete(
|
||||
self,
|
||||
agent_id: str,
|
||||
project_id: str,
|
||||
messages: list[dict],
|
||||
model_preference: str = "high-reasoning",
|
||||
) -> dict:
|
||||
response = await self.router.acompletion(
|
||||
model=model_preference,
|
||||
messages=messages,
|
||||
)
|
||||
await self._track_usage(agent_id, project_id, response)
|
||||
return response
|
||||
```
|
||||
|
||||
### Cost Tracking
|
||||
|
||||
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|
||||
|-------|----------------------|------------------------|
|
||||
| Claude 3.5 Sonnet | $3.00 | $15.00 |
|
||||
| Claude 3 Haiku | $0.25 | $1.25 |
|
||||
| GPT-4 Turbo | $10.00 | $30.00 |
|
||||
| GPT-4o Mini | $0.15 | $0.60 |
|
||||
| Ollama (local) | $0.00 | $0.00 |
|
||||
|
||||
### Agent Type Mapping
|
||||
|
||||
| Agent Type | Model Preference | Rationale |
|
||||
|------------|------------------|-----------|
|
||||
| Product Owner | high-reasoning | Complex requirements analysis |
|
||||
| Software Architect | high-reasoning | Architecture decisions |
|
||||
| Software Engineer | high-reasoning | Code generation |
|
||||
| QA Engineer | fast-response | Test case generation |
|
||||
| DevOps Engineer | fast-response | Config generation |
|
||||
| Project Manager | fast-response | Status updates |
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
- **Redis-backed cache** for repeated queries
|
||||
- **TTL:** 1 hour for general queries
|
||||
- **Skip cache:** For context-dependent generation
|
||||
- **Cache key:** Hash of (model, messages, temperature)
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Single interface for all LLM operations
|
||||
- Automatic failover improves reliability
|
||||
- Built-in cost tracking and budgeting
|
||||
- Easy to add new providers
|
||||
- Caching reduces API costs
|
||||
|
||||
### Negative
|
||||
- Dependency on LiteLLM library
|
||||
- May lag behind provider SDK features
|
||||
- Additional abstraction layer
|
||||
|
||||
### Mitigation
|
||||
- Pin LiteLLM version, test before upgrades
|
||||
- Direct SDK access available if needed
|
||||
- Monitor LiteLLM updates for breaking changes
|
||||
|
||||
## Compliance
|
||||
|
||||
This decision aligns with:
|
||||
- FR-101: Agent type model configuration
|
||||
- NFR-103: Agent response time targets
|
||||
- NFR-402: Failover requirements
|
||||
- TR-001: LLM API unavailability mitigation
|
||||
|
||||
---
|
||||
|
||||
*This ADR supersedes any previous decisions regarding LLM integration.*
|
||||
Reference in New Issue
Block a user