Files
syndarix/docs/adrs/ADR-004-llm-provider-abstraction.md
Felipe Cardoso 88cf4e0abc feat: Update to production model stack and fix remaining inconsistencies
## Model Stack Updates (User's Actual Models)

Updated all documentation to reflect production models:
- Claude Opus 4.5 (primary reasoning)
- GPT 5.1 Codex max (code generation specialist)
- Gemini 3 Pro/Flash (multimodal, fast inference)
- Qwen3-235B (cost-effective, self-hostable)
- DeepSeek V3.2 (self-hosted, open weights)

### Files Updated:
- ADR-004: Full model groups, failover chains, cost tables
- ADR-007: Code example with correct model identifiers
- ADR-012: Cost tracking with new model prices
- ARCHITECTURE.md: Model groups, failover diagram
- IMPLEMENTATION_ROADMAP.md: External services list

## Architecture Diagram Updates

- Added LangGraph Runtime to orchestration layer
- Added technology labels (Type-Instance, transitions)

## Self-Hostability Table Expanded

Added entries for:
- LangGraph (MIT)
- transitions (MIT)
- DeepSeek V3.2 (MIT)
- Qwen3-235B (Apache 2.0)

## Metric Alignments

- Response time: Split into API (<200ms) and Agent (<10s/<60s)
- Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing
- Added concurrent projects (10+) and agents (50+) metrics

## Infrastructure Updates

- Celery workers: 4-8 instances (was 2-4) across 4 queues
- MCP servers: Clarified Phase 2 + Phase 5 deployment
- Sync interval: Clarified 60s fallback + 15min reconciliation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 23:35:51 +01:00

5.9 KiB

ADR-004: LLM Provider Abstraction

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-005


Context

Syndarix agents require access to large language models (LLMs) from multiple providers:

  • Anthropic (Claude Opus 4.5) - Primary provider, highest reasoning capability
  • Google (Gemini 3 Pro/Flash) - Strong multimodal, fast inference
  • OpenAI (GPT 5.1 Codex max) - Code generation specialist
  • Alibaba (Qwen3-235B) - Cost-effective alternative
  • DeepSeek (V3.2) - Open-weights, self-hostable option

We need a unified abstraction layer that provides:

  • Consistent API across providers
  • Automatic failover on errors
  • Usage tracking and cost management
  • Rate limiting compliance

Decision Drivers

  • Reliability: Automatic failover on provider outages
  • Cost Control: Track and limit API spending
  • Flexibility: Easy to add/swap providers
  • Consistency: Single interface for all agents
  • Async Support: Compatible with async FastAPI

Considered Options

Option 1: Direct Provider SDKs

Use Anthropic and OpenAI SDKs directly with custom abstraction.

Pros:

  • Full control over implementation
  • No external dependencies

Cons:

  • Significant development effort
  • Must maintain failover logic
  • Must track token costs manually

Option 2: LiteLLM (Selected)

Use LiteLLM as unified abstraction layer.

Pros:

  • Unified API for 100+ providers
  • Built-in failover and routing
  • Automatic token counting
  • Cost tracking built-in
  • Redis caching support
  • Active community

Cons:

  • External dependency
  • May lag behind provider SDK updates

Option 3: LangChain

Use LangChain's LLM abstraction.

Pros:

  • Large ecosystem
  • Many integrations

Cons:

  • Heavy dependency
  • Overkill for just LLM abstraction
  • Complexity overhead

Decision

Adopt Option 2: LiteLLM for unified LLM provider abstraction.

LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.

Implementation

Model Groups

Group Name Use Case Primary Model Fallback Chain
high-reasoning Complex analysis, architecture Claude Opus 4.5 GPT 5.1 Codex max → Gemini 3 Pro
code-generation Code writing, refactoring GPT 5.1 Codex max Claude Opus 4.5 → DeepSeek V3.2
fast-response Quick tasks, simple queries Gemini 3 Flash Qwen3-235B → DeepSeek V3.2
cost-optimized High-volume, non-critical Qwen3-235B DeepSeek V3.2 (self-hosted)
self-hosted Privacy-sensitive, air-gapped DeepSeek V3.2 Qwen3-235B

Failover Chain (Primary)

Claude Opus 4.5 (Anthropic)
         │
         ▼ (on failure/rate limit)
    GPT 5.1 Codex max (OpenAI)
         │
         ▼ (on failure/rate limit)
    Gemini 3 Pro (Google)
         │
         ▼ (on failure/rate limit)
    Qwen3-235B (Alibaba/Self-hosted)
         │
         ▼ (on failure)
    DeepSeek V3.2 (Self-hosted)
         │
         ▼ (all failed)
    Error with exponential backoff retry

LLM Gateway Service

class LLMGateway:
    def __init__(self):
        self.router = Router(
            model_list=model_list,
            fallbacks=[
                {"high-reasoning": ["high-reasoning", "local-fallback"]},
            ],
            routing_strategy="latency-based-routing",
            num_retries=3,
        )

    async def complete(
        self,
        agent_id: str,
        project_id: str,
        messages: list[dict],
        model_preference: str = "high-reasoning",
    ) -> dict:
        response = await self.router.acompletion(
            model=model_preference,
            messages=messages,
        )
        await self._track_usage(agent_id, project_id, response)
        return response

Cost Tracking

Model Input (per 1M tokens) Output (per 1M tokens) Notes
Claude Opus 4.5 $15.00 $75.00 Highest reasoning capability
GPT 5.1 Codex max $12.00 $60.00 Code generation specialist
Gemini 3 Pro $3.50 $10.50 Strong multimodal
Gemini 3 Flash $0.35 $1.05 Fast inference
Qwen3-235B $2.00 $6.00 Cost-effective (or self-host: $0)
DeepSeek V3.2 $0.00 $0.00 Self-hosted, open weights

Agent Type Mapping

Agent Type Model Preference Rationale
Product Owner high-reasoning Complex requirements analysis needs Claude Opus 4.5
Software Architect high-reasoning Architecture decisions need top-tier reasoning
Software Engineer code-generation GPT 5.1 Codex max optimized for code
QA Engineer code-generation Test code generation
DevOps Engineer fast-response Config generation (Gemini 3 Flash)
Project Manager fast-response Status updates, quick responses
Business Analyst high-reasoning Document analysis needs strong reasoning

Caching Strategy

  • Redis-backed cache for repeated queries
  • TTL: 1 hour for general queries
  • Skip cache: For context-dependent generation
  • Cache key: Hash of (model, messages, temperature)

Consequences

Positive

  • Single interface for all LLM operations
  • Automatic failover improves reliability
  • Built-in cost tracking and budgeting
  • Easy to add new providers
  • Caching reduces API costs

Negative

  • Dependency on LiteLLM library
  • May lag behind provider SDK features
  • Additional abstraction layer

Mitigation

  • Pin LiteLLM version, test before upgrades
  • Direct SDK access available if needed
  • Monitor LiteLLM updates for breaking changes

Compliance

This decision aligns with:

  • FR-101: Agent type model configuration
  • NFR-103: Agent response time targets
  • NFR-402: Failover requirements
  • TR-001: LLM API unavailability mitigation

This ADR supersedes any previous decisions regarding LLM integration.