forked from cardosofelipe/fast-next-template
## Model Stack Updates (User's Actual Models) Updated all documentation to reflect production models: - Claude Opus 4.5 (primary reasoning) - GPT 5.1 Codex max (code generation specialist) - Gemini 3 Pro/Flash (multimodal, fast inference) - Qwen3-235B (cost-effective, self-hostable) - DeepSeek V3.2 (self-hosted, open weights) ### Files Updated: - ADR-004: Full model groups, failover chains, cost tables - ADR-007: Code example with correct model identifiers - ADR-012: Cost tracking with new model prices - ARCHITECTURE.md: Model groups, failover diagram - IMPLEMENTATION_ROADMAP.md: External services list ## Architecture Diagram Updates - Added LangGraph Runtime to orchestration layer - Added technology labels (Type-Instance, transitions) ## Self-Hostability Table Expanded Added entries for: - LangGraph (MIT) - transitions (MIT) - DeepSeek V3.2 (MIT) - Qwen3-235B (Apache 2.0) ## Metric Alignments - Response time: Split into API (<200ms) and Agent (<10s/<60s) - Cost per project: Adjusted to $100/sprint for Opus 4.5 pricing - Added concurrent projects (10+) and agents (50+) metrics ## Infrastructure Updates - Celery workers: 4-8 instances (was 2-4) across 4 queues - MCP servers: Clarified Phase 2 + Phase 5 deployment - Sync interval: Clarified 60s fallback + 15min reconciliation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
205 lines
6.3 KiB
Markdown
205 lines
6.3 KiB
Markdown
# ADR-012: Cost Tracking and Budget Management
|
|
|
|
**Status:** Accepted
|
|
**Date:** 2025-12-29
|
|
**Deciders:** Architecture Team
|
|
**Related Spikes:** SPIKE-010
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Syndarix agents make potentially expensive LLM API calls. Without proper cost tracking and budget enforcement, projects could incur unexpected charges. We need:
|
|
- Real-time cost visibility
|
|
- Per-project budget enforcement
|
|
- Cost optimization strategies
|
|
- Historical analytics
|
|
|
|
## Decision Drivers
|
|
|
|
- **Visibility:** Real-time cost tracking per agent/project
|
|
- **Control:** Budget enforcement with soft/hard limits
|
|
- **Optimization:** Identify and reduce unnecessary costs
|
|
- **Attribution:** Clear cost allocation for billing
|
|
|
|
## Decision
|
|
|
|
**Implement multi-layered cost tracking** using:
|
|
1. **LiteLLM Callbacks** for real-time usage capture
|
|
2. **Redis** for budget enforcement
|
|
3. **PostgreSQL** for persistent analytics
|
|
4. **SSE Events** for dashboard updates
|
|
|
|
## Implementation
|
|
|
|
### Cost Attribution Hierarchy
|
|
|
|
```
|
|
Organization (Billing Entity)
|
|
└── Project (Cost Center)
|
|
└── Sprint (Time-bounded Budget)
|
|
└── Agent Instance (Worker)
|
|
└── LLM Request (Atomic Cost Unit)
|
|
```
|
|
|
|
### LiteLLM Callback
|
|
|
|
```python
|
|
from litellm.integrations.custom_logger import CustomLogger
|
|
|
|
class SyndarixCostLogger(CustomLogger):
|
|
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
|
|
agent_id = kwargs.get("metadata", {}).get("agent_id")
|
|
project_id = kwargs.get("metadata", {}).get("project_id")
|
|
model = kwargs.get("model")
|
|
cost = kwargs.get("response_cost", 0)
|
|
usage = response_obj.usage
|
|
|
|
# Real-time budget check (Redis)
|
|
await self.budget_service.increment(
|
|
project_id=project_id,
|
|
cost=cost,
|
|
tokens=usage.total_tokens
|
|
)
|
|
|
|
# Persistent record (async queue to PostgreSQL)
|
|
await self.usage_queue.enqueue({
|
|
"agent_id": agent_id,
|
|
"project_id": project_id,
|
|
"model": model,
|
|
"prompt_tokens": usage.prompt_tokens,
|
|
"completion_tokens": usage.completion_tokens,
|
|
"cost_usd": cost,
|
|
"timestamp": datetime.utcnow()
|
|
})
|
|
|
|
# Check budget status
|
|
budget_status = await self.budget_service.check_status(project_id)
|
|
if budget_status == "exceeded":
|
|
await self.notify_budget_exceeded(project_id)
|
|
```
|
|
|
|
### Budget Enforcement
|
|
|
|
```python
|
|
class BudgetService:
|
|
async def check_budget(self, project_id: str) -> BudgetStatus:
|
|
"""Check current budget status."""
|
|
budget = await self.get_budget(project_id)
|
|
usage = await self.redis.get(f"cost:{project_id}:daily")
|
|
|
|
percentage = (usage / budget.daily_limit) * 100
|
|
|
|
if percentage >= 100 and budget.enforcement == "hard":
|
|
return BudgetStatus.BLOCKED
|
|
elif percentage >= 100:
|
|
return BudgetStatus.EXCEEDED
|
|
elif percentage >= 80:
|
|
return BudgetStatus.WARNING
|
|
elif percentage >= 50:
|
|
return BudgetStatus.APPROACHING
|
|
else:
|
|
return BudgetStatus.OK
|
|
|
|
async def enforce(self, project_id: str) -> bool:
|
|
"""Returns True if request should proceed."""
|
|
status = await self.check_budget(project_id)
|
|
|
|
if status == BudgetStatus.BLOCKED:
|
|
raise BudgetExceededException(project_id)
|
|
|
|
if status in [BudgetStatus.EXCEEDED, BudgetStatus.WARNING]:
|
|
# Auto-downgrade to cheaper model
|
|
await self.set_model_override(project_id, "cost-optimized")
|
|
|
|
return True
|
|
```
|
|
|
|
### Database Schema
|
|
|
|
```sql
|
|
CREATE TABLE token_usage (
|
|
id UUID PRIMARY KEY,
|
|
agent_id UUID,
|
|
project_id UUID NOT NULL,
|
|
model VARCHAR(100) NOT NULL,
|
|
prompt_tokens INTEGER NOT NULL,
|
|
completion_tokens INTEGER NOT NULL,
|
|
total_tokens INTEGER NOT NULL,
|
|
cost_usd DECIMAL(10, 6) NOT NULL,
|
|
timestamp TIMESTAMPTZ NOT NULL
|
|
);
|
|
|
|
CREATE TABLE project_budgets (
|
|
id UUID PRIMARY KEY,
|
|
project_id UUID NOT NULL UNIQUE,
|
|
daily_limit_usd DECIMAL(10, 2) DEFAULT 50.00,
|
|
weekly_limit_usd DECIMAL(10, 2) DEFAULT 250.00,
|
|
monthly_limit_usd DECIMAL(10, 2) DEFAULT 1000.00,
|
|
enforcement VARCHAR(20) DEFAULT 'soft', -- 'soft', 'hard'
|
|
alert_thresholds JSONB DEFAULT '[50, 80, 100]'
|
|
);
|
|
|
|
-- Materialized view for analytics
|
|
CREATE MATERIALIZED VIEW daily_cost_summary AS
|
|
SELECT
|
|
project_id,
|
|
DATE(timestamp) as date,
|
|
SUM(cost_usd) as total_cost,
|
|
SUM(total_tokens) as total_tokens,
|
|
COUNT(*) as request_count
|
|
FROM token_usage
|
|
GROUP BY project_id, DATE(timestamp);
|
|
```
|
|
|
|
### Cost Model Prices
|
|
|
|
| Model | Input ($/1M) | Output ($/1M) | Notes |
|
|
|-------|-------------|---------------|-------|
|
|
| Claude Opus 4.5 | $15.00 | $75.00 | Highest reasoning capability |
|
|
| GPT 5.1 Codex max | $12.00 | $60.00 | Code generation specialist |
|
|
| Gemini 3 Pro | $3.50 | $10.50 | Strong multimodal |
|
|
| Gemini 3 Flash | $0.35 | $1.05 | Fast inference |
|
|
| Qwen3-235B | $2.00 | $6.00 | Cost-effective (or $0 self-hosted) |
|
|
| DeepSeek V3.2 | $0.00 | $0.00 | Self-hosted, open weights |
|
|
|
|
### Cost Optimization Strategies
|
|
|
|
| Strategy | Savings | Implementation |
|
|
|----------|---------|----------------|
|
|
| Semantic caching | 15-30% | Redis cache for repeated queries |
|
|
| Model cascading | 60-80% | Start with Gemini Flash, escalate to Opus |
|
|
| Prompt compression | 10-20% | Remove redundant context |
|
|
| Self-hosted fallback | 100% for some | DeepSeek V3.2/Qwen3 for non-critical tasks |
|
|
| Task-appropriate routing | 40-60% | Route code tasks to GPT 5.1 Codex, simple to Flash |
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- Complete cost visibility at all levels
|
|
- Automatic budget enforcement
|
|
- Cost optimization reduces spend significantly
|
|
- Real-time dashboard updates
|
|
|
|
### Negative
|
|
- Redis dependency for real-time tracking
|
|
- Additional complexity in LLM gateway
|
|
|
|
### Mitigation
|
|
- Redis already required for other features
|
|
- Clear separation of concerns in cost tracking module
|
|
|
|
## Compliance
|
|
|
|
This decision aligns with:
|
|
- FR-801: Real-time cost tracking
|
|
- FR-802: Budget configuration (soft/hard limits)
|
|
- FR-803: Budget alerts
|
|
- FR-804: Cost analytics
|
|
- NFR-602: Logging and monitoring (cost observability)
|
|
- BR-002: Cost overruns from API usage (risk mitigation)
|
|
|
|
---
|
|
|
|
*This ADR establishes the cost tracking and budget management architecture for Syndarix.*
|