feat(mcp): Implement LLM Gateway MCP Server #56

New Issue

cardosofelipe · 2026-01-03T01:24:54Z

cardosofelipe commented

2026-01-03 01:24:54 +00:00

Summary

Implement the LLM Gateway MCP server that provides unified access to multiple LLM providers with intelligent routing, failover, and cost tracking. This is the highest priority MCP server as all agent interactions depend on it.

Sub-Tasks

1. Project Setup

Initialize FastMCP project in mcp-servers/llm-gateway/
Create pyproject.toml with dependencies
Add fastmcp>=0.4.0, litellm>=1.50.0, redis>=5.0.0
Create Docker configuration (Dockerfile, .dockerignore)
Add to docker-compose.dev.yml
Create README.md with setup instructions

2. LiteLLM Integration (`providers.py`)

Configure Anthropic provider (Claude Opus 4.5, Sonnet 4, Haiku 3.5)
Configure OpenAI provider (GPT-5.1 variants, Codex)
Configure Google provider (Gemini 3 Pro, Flash)
Configure DeepSeek provider (V3.2, Coder)
Add provider health check functions
Implement provider-specific error handling
Add rate limit tracking per provider
Support custom API base URLs (for proxies)

3. Model Group Configuration (`models.py`)

Define reasoning group: Claude Opus 4.5 → GPT-5.1 → Gemini 3 Pro
Define code group: Claude Sonnet 4 → Codex Max → DeepSeek Coder
Define fast group: Claude Haiku 3.5 → GPT-5.1 Mini → Gemini Flash
Define vision group: Claude Opus 4.5 → GPT-5.1 Vision → Gemini Pro Vision
Define embedding group: text-embedding-3-large → ada-002
Add model metadata (context window, pricing, capabilities)
Create model selection logic based on task requirements
Support custom model groups via configuration

4. Failover Chain (`failover.py`)

Implement primary model attempt with configurable timeout
Add automatic fallback to secondary model on failure
Add tertiary fallback as last resort
Track failure counts per model
Implement circuit breaker (5 failures = 30s cooldown)
Add circuit breaker half-open state for recovery testing
Log failover events with reason
Emit failover metrics to EventBus

5. Routing Logic (`routing.py`)

Create ModelRouter class
Implement model selection by group name
Add capability-based routing (vision, long-context, etc.)
Implement cost-aware routing (prefer cheaper when equivalent)
Add latency-aware routing (prefer faster models)
Support explicit model override in requests
Add request queuing for rate-limited providers

6. Cost Tracking (`cost_tracking.py`)

Define cost per 1K tokens for each model (input/output)
Create UsageRecord model with all cost fields
Implement real-time cost calculation per request
Store usage in Redis with TTL (30 days)
Aggregate costs by project_id
Aggregate costs by agent_id
Aggregate costs by user_id
Create cost report generation functions
Emit cost events to EventBus for real-time dashboard

7. Token Usage Logging (`usage.py`)

Count input tokens using tiktoken
Count output tokens from response
Log usage to structured logger
Store usage records in Redis
Create usage analytics functions
Add token budget enforcement (optional per-request limit)
Implement project-level token budgets

8. MCP Tools Implementation (`server.py`)

Implement complete tool (non-streaming)
- Accept messages, model_group, max_tokens, temperature
- Accept project_id, agent_id for tracking
- Return completion with usage stats
Implement stream_complete tool (streaming)
- Yield tokens as they arrive
- Handle streaming errors gracefully
- Calculate final usage after stream completes
Implement get_usage tool
- Filter by project_id, date range
- Return aggregated token counts and costs
Implement health_check tool
- Check each provider status
- Return latency per provider
- Return circuit breaker states
Implement list_models tool
- Return available models with capabilities
- Include current health status

9. Streaming Support (`streaming.py`)

Create async generator for token streaming
Handle partial response chunks
Implement stream cancellation
Add timeout for stalled streams
Buffer tokens for word-level streaming
Handle provider-specific streaming formats

10. Error Handling

Create LLMError base exception
Create ProviderError for provider failures
Create RateLimitError for rate limiting
Create ContextLengthError for token overflow
Create ContentFilterError for blocked content
Map provider errors to standard errors
Add retry logic for transient errors

11. Configuration

Create config.yaml for provider settings
Support environment variable overrides for API keys
Add model group configuration file
Add cost configuration file (updateable)
Implement hot-reload for configuration changes

12. Docker & Deployment

Create optimized Dockerfile (multi-stage build)
Add health check endpoint for container orchestration
Configure environment variables in docker-compose
Add resource limits (memory, CPU)
Create startup/shutdown hooks

13. Testing

Create unit tests for providers.py
Create unit tests for failover.py
Create unit tests for routing.py
Create unit tests for cost_tracking.py
Create integration tests with LiteLLM mocks
Create E2E test with real API (manual, not CI)
Add load testing script
Achieve >90% code coverage

14. Documentation

Document all MCP tools with examples
Create provider setup guide
Document failover behavior
Add cost tracking documentation
Create troubleshooting guide

Technical Specifications

MCP Tools

@mcp.tool()
async def complete(
    messages: list[Message],
    model_group: str = "reasoning",
    max_tokens: int = 4096,
    temperature: float = 0.7,
    project_id: str = None,
    agent_id: str = None,
) -> CompletionResult:
    """Generate LLM completion with automatic failover."""

@mcp.tool()
async def stream_complete(
    messages: list[Message],
    model_group: str = "reasoning",
    max_tokens: int = 4096,
    temperature: float = 0.7,
    project_id: str = None,
    agent_id: str = None,
) -> AsyncGenerator[StreamChunk, None]:
    """Stream LLM completion tokens."""

@mcp.tool()
async def get_usage(
    project_id: str = None,
    start_date: str = None,
    end_date: str = None,
) -> UsageReport:
    """Get token and cost usage report."""

@mcp.tool()
async def health_check() -> ProviderHealth:
    """Check health of all LLM providers."""

@mcp.tool()
async def list_models() -> list[ModelInfo]:
    """List available models with capabilities."""

Model Pricing (per 1K tokens)

Model	Input	Output
Claude Opus 4.5	$0.015	$0.075
Claude Sonnet 4	$0.003	$0.015
Claude Haiku 3.5	$0.00025	$0.00125
GPT-5.1 Turbo	$0.01	$0.03
Gemini 3 Pro	$0.00125	$0.005
DeepSeek V3.2	$0.0001	$0.0002

Acceptance Criteria

LiteLLM integration working with all 4 providers
Model group routing selects appropriate models
Failover chain activates on provider failure
Circuit breaker prevents cascading failures
Cost tracking logs all token usage accurately
Streaming completions work correctly
Health check reports provider status
All MCP tools documented and working
Unit tests >90% coverage
Integration tests with mock providers
Docker container builds and runs
Documentation complete

Dependencies

Depends on: #55 (MCP Client Infrastructure)
Blocks: Phase 3 Agent Orchestration, #57 (Knowledge Base needs embeddings)

Assignable To

backend-engineer agent

## Summary Implement the LLM Gateway MCP server that provides unified access to multiple LLM providers with intelligent routing, failover, and cost tracking. This is the **highest priority** MCP server as all agent interactions depend on it. ## Sub-Tasks ### 1. Project Setup - [ ] Initialize FastMCP project in `mcp-servers/llm-gateway/` - [ ] Create `pyproject.toml` with dependencies - [ ] Add `fastmcp>=0.4.0`, `litellm>=1.50.0`, `redis>=5.0.0` - [ ] Create Docker configuration (`Dockerfile`, `.dockerignore`) - [ ] Add to `docker-compose.dev.yml` - [ ] Create `README.md` with setup instructions ### 2. LiteLLM Integration (`providers.py`) - [ ] Configure Anthropic provider (Claude Opus 4.5, Sonnet 4, Haiku 3.5) - [ ] Configure OpenAI provider (GPT-5.1 variants, Codex) - [ ] Configure Google provider (Gemini 3 Pro, Flash) - [ ] Configure DeepSeek provider (V3.2, Coder) - [ ] Add provider health check functions - [ ] Implement provider-specific error handling - [ ] Add rate limit tracking per provider - [ ] Support custom API base URLs (for proxies) ### 3. Model Group Configuration (`models.py`) - [ ] Define `reasoning` group: Claude Opus 4.5 → GPT-5.1 → Gemini 3 Pro - [ ] Define `code` group: Claude Sonnet 4 → Codex Max → DeepSeek Coder - [ ] Define `fast` group: Claude Haiku 3.5 → GPT-5.1 Mini → Gemini Flash - [ ] Define `vision` group: Claude Opus 4.5 → GPT-5.1 Vision → Gemini Pro Vision - [ ] Define `embedding` group: text-embedding-3-large → ada-002 - [ ] Add model metadata (context window, pricing, capabilities) - [ ] Create model selection logic based on task requirements - [ ] Support custom model groups via configuration ### 4. Failover Chain (`failover.py`) - [ ] Implement primary model attempt with configurable timeout - [ ] Add automatic fallback to secondary model on failure - [ ] Add tertiary fallback as last resort - [ ] Track failure counts per model - [ ] Implement circuit breaker (5 failures = 30s cooldown) - [ ] Add circuit breaker half-open state for recovery testing - [ ] Log failover events with reason - [ ] Emit failover metrics to EventBus ### 5. Routing Logic (`routing.py`) - [ ] Create `ModelRouter` class - [ ] Implement model selection by group name - [ ] Add capability-based routing (vision, long-context, etc.) - [ ] Implement cost-aware routing (prefer cheaper when equivalent) - [ ] Add latency-aware routing (prefer faster models) - [ ] Support explicit model override in requests - [ ] Add request queuing for rate-limited providers ### 6. Cost Tracking (`cost_tracking.py`) - [ ] Define cost per 1K tokens for each model (input/output) - [ ] Create `UsageRecord` model with all cost fields - [ ] Implement real-time cost calculation per request - [ ] Store usage in Redis with TTL (30 days) - [ ] Aggregate costs by project_id - [ ] Aggregate costs by agent_id - [ ] Aggregate costs by user_id - [ ] Create cost report generation functions - [ ] Emit cost events to EventBus for real-time dashboard ### 7. Token Usage Logging (`usage.py`) - [ ] Count input tokens using tiktoken - [ ] Count output tokens from response - [ ] Log usage to structured logger - [ ] Store usage records in Redis - [ ] Create usage analytics functions - [ ] Add token budget enforcement (optional per-request limit) - [ ] Implement project-level token budgets ### 8. MCP Tools Implementation (`server.py`) - [ ] Implement `complete` tool (non-streaming) - [ ] Accept messages, model_group, max_tokens, temperature - [ ] Accept project_id, agent_id for tracking - [ ] Return completion with usage stats - [ ] Implement `stream_complete` tool (streaming) - [ ] Yield tokens as they arrive - [ ] Handle streaming errors gracefully - [ ] Calculate final usage after stream completes - [ ] Implement `get_usage` tool - [ ] Filter by project_id, date range - [ ] Return aggregated token counts and costs - [ ] Implement `health_check` tool - [ ] Check each provider status - [ ] Return latency per provider - [ ] Return circuit breaker states - [ ] Implement `list_models` tool - [ ] Return available models with capabilities - [ ] Include current health status ### 9. Streaming Support (`streaming.py`) - [ ] Create async generator for token streaming - [ ] Handle partial response chunks - [ ] Implement stream cancellation - [ ] Add timeout for stalled streams - [ ] Buffer tokens for word-level streaming - [ ] Handle provider-specific streaming formats ### 10. Error Handling - [ ] Create `LLMError` base exception - [ ] Create `ProviderError` for provider failures - [ ] Create `RateLimitError` for rate limiting - [ ] Create `ContextLengthError` for token overflow - [ ] Create `ContentFilterError` for blocked content - [ ] Map provider errors to standard errors - [ ] Add retry logic for transient errors ### 11. Configuration - [ ] Create `config.yaml` for provider settings - [ ] Support environment variable overrides for API keys - [ ] Add model group configuration file - [ ] Add cost configuration file (updateable) - [ ] Implement hot-reload for configuration changes ### 12. Docker & Deployment - [ ] Create optimized `Dockerfile` (multi-stage build) - [ ] Add health check endpoint for container orchestration - [ ] Configure environment variables in docker-compose - [ ] Add resource limits (memory, CPU) - [ ] Create startup/shutdown hooks ### 13. Testing - [ ] Create unit tests for `providers.py` - [ ] Create unit tests for `failover.py` - [ ] Create unit tests for `routing.py` - [ ] Create unit tests for `cost_tracking.py` - [ ] Create integration tests with LiteLLM mocks - [ ] Create E2E test with real API (manual, not CI) - [ ] Add load testing script - [ ] Achieve >90% code coverage ### 14. Documentation - [ ] Document all MCP tools with examples - [ ] Create provider setup guide - [ ] Document failover behavior - [ ] Add cost tracking documentation - [ ] Create troubleshooting guide ## Technical Specifications ### MCP Tools ```python @mcp.tool() async def complete( messages: list[Message], model_group: str = "reasoning", max_tokens: int = 4096, temperature: float = 0.7, project_id: str = None, agent_id: str = None, ) -> CompletionResult: """Generate LLM completion with automatic failover.""" @mcp.tool() async def stream_complete( messages: list[Message], model_group: str = "reasoning", max_tokens: int = 4096, temperature: float = 0.7, project_id: str = None, agent_id: str = None, ) -> AsyncGenerator[StreamChunk, None]: """Stream LLM completion tokens.""" @mcp.tool() async def get_usage( project_id: str = None, start_date: str = None, end_date: str = None, ) -> UsageReport: """Get token and cost usage report.""" @mcp.tool() async def health_check() -> ProviderHealth: """Check health of all LLM providers.""" @mcp.tool() async def list_models() -> list[ModelInfo]: """List available models with capabilities.""" ``` ### Model Pricing (per 1K tokens) | Model | Input | Output | |-------|-------|--------| | Claude Opus 4.5 | $0.015 | $0.075 | | Claude Sonnet 4 | $0.003 | $0.015 | | Claude Haiku 3.5 | $0.00025 | $0.00125 | | GPT-5.1 Turbo | $0.01 | $0.03 | | Gemini 3 Pro | $0.00125 | $0.005 | | DeepSeek V3.2 | $0.0001 | $0.0002 | ## Acceptance Criteria - [ ] LiteLLM integration working with all 4 providers - [ ] Model group routing selects appropriate models - [ ] Failover chain activates on provider failure - [ ] Circuit breaker prevents cascading failures - [ ] Cost tracking logs all token usage accurately - [ ] Streaming completions work correctly - [ ] Health check reports provider status - [ ] All MCP tools documented and working - [ ] Unit tests >90% coverage - [ ] Integration tests with mock providers - [ ] Docker container builds and runs - [ ] Documentation complete ## Dependencies - Depends on: #55 (MCP Client Infrastructure) - Blocks: Phase 3 Agent Orchestration, #57 (Knowledge Base needs embeddings) ## Assignable To backend-engineer agent

cardosofelipe referenced this issue

2026-01-03 01:24:54 +00:00

feat(mcp): Implement Knowledge Base MCP Server #57

cardosofelipe added the mcp phase-2 priority:high labels 2026-01-03 01:25:45 +00:00

cardosofelipe referenced this issue

2026-01-03 01:26:17 +00:00

[EPIC] Phase 2: MCP Integration #60

cardosofelipe referenced this issue

2026-01-03 08:58:02 +00:00

feat(backend): Implement MCP Client Infrastructure #55

cardosofelipe referenced this issue

2026-01-03 09:07:52 +00:00

feat(mcp): Context Management Engine #61

cardosofelipe referenced this issue

2026-01-03 09:09:22 +00:00

feat(mcp): Agent Memory System #62

cardosofelipe referenced this issue from a commit

2026-01-03 19:31:28 +00:00

feat(llm-gateway): implement LLM Gateway MCP Server (#56)

cardosofelipe referenced a pull request that will close this issue

2026-01-03 19:31:42 +00:00

feat(llm-gateway): implement LLM Gateway MCP Server (#56) #71

cardosofelipe commented

2026-01-03 19:32:01 +00:00

Implementation complete! PR #71 is ready for review.

Summary:

All 4 MCP tools implemented: chat_completion, list_models, get_usage, count_tokens
Multi-provider failover with circuit breaker pattern
Redis-based cost tracking per project/agent
209 tests passing with 92.35% coverage
Multi-sweep code review completed (all 5 sweeps passed)

PR: #71

Implementation complete! PR #71 is ready for review. **Summary:** - All 4 MCP tools implemented: `chat_completion`, `list_models`, `get_usage`, `count_tokens` - Multi-provider failover with circuit breaker pattern - Redis-based cost tracking per project/agent - 209 tests passing with 92.35% coverage - Multi-sweep code review completed (all 5 sweeps passed) PR: https://gitea.pragmazest.com/cardosofelipe/syndarix/pulls/71

cardosofelipe referenced this issue from a commit

2026-01-03 19:56:43 +00:00

Merge pull request #71 from feature/56-llm-gateway-mcp-server

cardosofelipe closed this issue

2026-01-03 19:56:45 +00:00

cardosofelipe referenced this issue

2026-01-05 00:17:15 +00:00

feat(memory): #62-11 Component Integration #97

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cardosofelipe/syndarix#56

feat(mcp): Implement LLM Gateway MCP Server #56

Summary

Sub-Tasks

1. Project Setup

2. LiteLLM Integration (providers.py)

3. Model Group Configuration (models.py)

4. Failover Chain (failover.py)

5. Routing Logic (routing.py)

6. Cost Tracking (cost_tracking.py)

7. Token Usage Logging (usage.py)

8. MCP Tools Implementation (server.py)

9. Streaming Support (streaming.py)

10. Error Handling

11. Configuration

12. Docker & Deployment

13. Testing

14. Documentation

Technical Specifications

MCP Tools

Model Pricing (per 1K tokens)

Acceptance Criteria

Dependencies

Assignable To

2. LiteLLM Integration (`providers.py`)

3. Model Group Configuration (`models.py`)

4. Failover Chain (`failover.py`)

5. Routing Logic (`routing.py`)

6. Cost Tracking (`cost_tracking.py`)

7. Token Usage Logging (`usage.py`)

8. MCP Tools Implementation (`server.py`)

9. Streaming Support (`streaming.py`)