forked from cardosofelipe/fast-next-template

Files

Felipe Cardoso 88cf4e0abc feat: Update to production model stack and fix remaining inconsistencies

## Model Stack Updates (User's Actual Models)

Updated all documentation to reflect production models:
- Claude Opus 4.5 (primary reasoning)
- GPT 5.1 Codex max (code generation specialist)
- Gemini 3 Pro/Flash (multimodal, fast inference)
- Qwen3-235B (cost-effective, self-hostable)
- DeepSeek V3.2 (self-hosted, open weights)

### Files Updated:
- ADR-004: Full model groups, failover chains, cost tables
- ADR-007: Code example with correct model identifiers
- ADR-012: Cost tracking with new model prices
- ARCHITECTURE.md: Model groups, failover diagram
- IMPLEMENTATION_ROADMAP.md: External services list

## Architecture Diagram Updates

- Added LangGraph Runtime to orchestration layer
- Added technology labels (Type-Instance, transitions)

## Self-Hostability Table Expanded

Added entries for:
- LangGraph (MIT)
- transitions (MIT)
- DeepSeek V3.2 (MIT)
- Qwen3-235B (Apache 2.0)

## Metric Alignments

- Response time: Split into API (<200ms) and Agent (<10s/<60s)
- Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing
- Added concurrent projects (10+) and agents (50+) metrics

## Infrastructure Updates

- Celery workers: 4-8 instances (was 2-4) across 4 queues
- MCP servers: Clarified Phase 2 + Phase 5 deployment
- Sync interval: Clarified 60s fallback + 15min reconciliation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-29 23:35:51 +01:00

5.9 KiB

Raw Permalink Blame History

ADR-004: LLM Provider Abstraction

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-005

Context

Syndarix agents require access to large language models (LLMs) from multiple providers:

Anthropic (Claude Opus 4.5) - Primary provider, highest reasoning capability
Google (Gemini 3 Pro/Flash) - Strong multimodal, fast inference
OpenAI (GPT 5.1 Codex max) - Code generation specialist
Alibaba (Qwen3-235B) - Cost-effective alternative
DeepSeek (V3.2) - Open-weights, self-hostable option

We need a unified abstraction layer that provides:

Consistent API across providers
Automatic failover on errors
Usage tracking and cost management
Rate limiting compliance

Decision Drivers

Reliability: Automatic failover on provider outages
Cost Control: Track and limit API spending
Flexibility: Easy to add/swap providers
Consistency: Single interface for all agents
Async Support: Compatible with async FastAPI

Considered Options

Option 1: Direct Provider SDKs

Use Anthropic and OpenAI SDKs directly with custom abstraction.

Pros:

Full control over implementation
No external dependencies

Cons:

Significant development effort
Must maintain failover logic
Must track token costs manually

Option 2: LiteLLM (Selected)

Use LiteLLM as unified abstraction layer.

Pros:

Unified API for 100+ providers
Built-in failover and routing
Automatic token counting
Cost tracking built-in
Redis caching support
Active community

Cons:

External dependency
May lag behind provider SDK updates

Option 3: LangChain

Use LangChain's LLM abstraction.

Pros:

Large ecosystem
Many integrations

Cons:

Heavy dependency
Overkill for just LLM abstraction
Complexity overhead

Decision

Adopt Option 2: LiteLLM for unified LLM provider abstraction.

LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.

Implementation

Model Groups

Group Name	Use Case	Primary Model	Fallback Chain
`high-reasoning`	Complex analysis, architecture	Claude Opus 4.5	GPT 5.1 Codex max → Gemini 3 Pro
`code-generation`	Code writing, refactoring	GPT 5.1 Codex max	Claude Opus 4.5 → DeepSeek V3.2
`fast-response`	Quick tasks, simple queries	Gemini 3 Flash	Qwen3-235B → DeepSeek V3.2
`cost-optimized`	High-volume, non-critical	Qwen3-235B	DeepSeek V3.2 (self-hosted)
`self-hosted`	Privacy-sensitive, air-gapped	DeepSeek V3.2	Qwen3-235B

Failover Chain (Primary)

Claude Opus 4.5 (Anthropic)
         │
         ▼ (on failure/rate limit)
    GPT 5.1 Codex max (OpenAI)
         │
         ▼ (on failure/rate limit)
    Gemini 3 Pro (Google)
         │
         ▼ (on failure/rate limit)
    Qwen3-235B (Alibaba/Self-hosted)
         │
         ▼ (on failure)
    DeepSeek V3.2 (Self-hosted)
         │
         ▼ (all failed)
    Error with exponential backoff retry

LLM Gateway Service

class LLMGateway:
    def __init__(self):
        self.router = Router(
            model_list=model_list,
            fallbacks=[
                {"high-reasoning": ["high-reasoning", "local-fallback"]},
            ],
            routing_strategy="latency-based-routing",
            num_retries=3,
        )

    async def complete(
        self,
        agent_id: str,
        project_id: str,
        messages: list[dict],
        model_preference: str = "high-reasoning",
    ) -> dict:
        response = await self.router.acompletion(
            model=model_preference,
            messages=messages,
        )
        await self._track_usage(agent_id, project_id, response)
        return response

Cost Tracking

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
Claude Opus 4.5	$15.00	$75.00	Highest reasoning capability
GPT 5.1 Codex max	$12.00	$60.00	Code generation specialist
Gemini 3 Pro	$3.50	$10.50	Strong multimodal
Gemini 3 Flash	$0.35	$1.05	Fast inference
Qwen3-235B	$2.00	$6.00	Cost-effective (or self-host: $0)
DeepSeek V3.2	$0.00	$0.00	Self-hosted, open weights

Agent Type Mapping

Agent Type	Model Preference	Rationale
Product Owner	high-reasoning	Complex requirements analysis needs Claude Opus 4.5
Software Architect	high-reasoning	Architecture decisions need top-tier reasoning
Software Engineer	code-generation	GPT 5.1 Codex max optimized for code
QA Engineer	code-generation	Test code generation
DevOps Engineer	fast-response	Config generation (Gemini 3 Flash)
Project Manager	fast-response	Status updates, quick responses
Business Analyst	high-reasoning	Document analysis needs strong reasoning

Caching Strategy

Redis-backed cache for repeated queries
TTL: 1 hour for general queries
Skip cache: For context-dependent generation
Cache key: Hash of (model, messages, temperature)

Consequences

Positive

Single interface for all LLM operations
Automatic failover improves reliability
Built-in cost tracking and budgeting
Easy to add new providers
Caching reduces API costs

Negative

Dependency on LiteLLM library
May lag behind provider SDK features
Additional abstraction layer

Mitigation

Pin LiteLLM version, test before upgrades
Direct SDK access available if needed
Monitor LiteLLM updates for breaking changes

Compliance

This decision aligns with:

FR-101: Agent type model configuration
NFR-103: Agent response time targets
NFR-402: Failover requirements
TR-001: LLM API unavailability mitigation

This ADR supersedes any previous decisions regarding LLM integration.

5.9 KiB Raw Permalink Blame History

ADR-004: LLM Provider Abstraction

Context

Decision Drivers

Considered Options

Option 1: Direct Provider SDKs

Option 2: LiteLLM (Selected)

Option 3: LangChain

Decision

Implementation

Model Groups

Failover Chain (Primary)

LLM Gateway Service

Cost Tracking

Agent Type Mapping

Caching Strategy

Consequences

Positive

Negative

Mitigation

Compliance

5.9 KiB

Raw Permalink Blame History