[SPIKE-010] Cost Tracking & Budget Management #10

Closed
opened 2025-12-29 03:51:02 +00:00 by cardosofelipe · 1 comment

Objective

Design system for tracking LLM API costs and enforcing budget limits.

Key Questions

  1. How do we capture token usage per API call?
  2. How do we attribute costs to projects/agents?
  3. How do we set and enforce budget limits?
  4. What alerts/notifications for budget thresholds?
  5. How do we handle cost optimization (model selection)?

Metrics to Track

  • Tokens (input/output) per call
  • Cost per call (based on model pricing)
  • Aggregated cost per: project, agent, sprint, day
  • Trend analysis

Research Areas

  • LLM provider pricing APIs
  • Token counting before/after calls
  • Time-series storage for cost data
  • Alert/notification patterns

Expected Deliverables

  • Cost tracking schema
  • Budget enforcement logic
  • Dashboard/reporting design
  • Alert configuration
  • ADR documenting the approach

Acceptance Criteria

  • All LLM calls tracked with cost
  • Costs attributable to project/agent
  • Budget limits enforced
  • Alerts trigger at thresholds

Labels

spike, architecture, observability

## Objective Design system for tracking LLM API costs and enforcing budget limits. ## Key Questions 1. How do we capture token usage per API call? 2. How do we attribute costs to projects/agents? 3. How do we set and enforce budget limits? 4. What alerts/notifications for budget thresholds? 5. How do we handle cost optimization (model selection)? ## Metrics to Track - Tokens (input/output) per call - Cost per call (based on model pricing) - Aggregated cost per: project, agent, sprint, day - Trend analysis ## Research Areas - [ ] LLM provider pricing APIs - [ ] Token counting before/after calls - [ ] Time-series storage for cost data - [ ] Alert/notification patterns ## Expected Deliverables - Cost tracking schema - Budget enforcement logic - Dashboard/reporting design - Alert configuration - ADR documenting the approach ## Acceptance Criteria - [ ] All LLM calls tracked with cost - [ ] Costs attributable to project/agent - [ ] Budget limits enforced - [ ] Alerts trigger at thresholds ## Labels `spike`, `architecture`, `observability`
Author
Owner

SPIKE-010: Cost Tracking Research Completed

The comprehensive spike document has been created at docs/spikes/SPIKE-010-cost-tracking.md.

Executive Summary

Syndarix requires comprehensive LLM cost tracking to manage expenses across multiple providers (Anthropic, OpenAI, local Ollama). The research recommends a multi-layered cost tracking architecture:

  1. LiteLLM Callbacks for real-time usage capture at the gateway level
  2. PostgreSQL for persistent usage records with time-series aggregation
  3. Redis for real-time budget enforcement and rate limiting
  4. Celery Beat for scheduled budget checks and alert processing
  5. SSE Events for real-time dashboard updates

Key Findings

Area Recommendation
Token Tracking LiteLLM provides built-in response.usage and kwargs["response_cost"]
Cost Attribution Hierarchical: Organization > Project > Sprint > Agent Instance > Request
Budget Enforcement Soft limits with alerts for weekly/monthly; Hard limits for daily budgets
Cost Optimization 60-80% savings possible through caching, cascading, and compression

Cost Optimization Strategies

  1. Semantic Caching (15-30% savings): Cache responses by semantic similarity using vector embeddings
  2. Model Cascading (up to 87% savings): Route 90% of queries to cheaper models, escalate only when needed
  3. Prompt Compression (up to 80% savings): Use LLMLingua for intelligent prompt compression with minimal quality loss

Database Schema Highlights

  • token_usage: Individual LLM request records with full attribution
  • budgets: Configurable daily/weekly/monthly budgets with soft/hard limits
  • budget_alerts: Alert tracking with severity levels and acknowledgment
  • daily_cost_summaries: Materialized aggregations for fast reporting

LLM Pricing Reference (per 1M tokens)

Model Input Output
Claude 3.5 Sonnet $3.00 $15.00
Claude 3 Haiku $0.25 $1.25
GPT-4 Turbo $10.00 $30.00
GPT-4o-mini $0.15 $0.60
Local Ollama $0.00 $0.00

Implementation Roadmap

  • Phase 1 (Week 1-2): Core infrastructure (schema, Redis, callbacks)
  • Phase 2 (Week 2-3): Budget management and alerts
  • Phase 3 (Week 3-4): Cost optimization (caching, cascading)
  • Phase 4 (Week 4-5): Reporting dashboard
  • Phase 5 (Week 5-6): Testing and documentation
  • SPIKE-005: LLM Provider Abstraction (LiteLLM baseline)
  • SPIKE-003: Real-time Updates (SSE architecture)
  • SPIKE-004: Celery + Redis Integration (background tasks)

Next Steps:

  1. Create ADR-010 based on these findings
  2. Begin Phase 1 implementation
  3. Define specific budget limits for Syndarix projects

Full details available in SPIKE-010-cost-tracking.md.

## SPIKE-010: Cost Tracking Research Completed The comprehensive spike document has been created at `docs/spikes/SPIKE-010-cost-tracking.md`. ### Executive Summary Syndarix requires comprehensive LLM cost tracking to manage expenses across multiple providers (Anthropic, OpenAI, local Ollama). The research recommends a **multi-layered cost tracking architecture**: 1. **LiteLLM Callbacks** for real-time usage capture at the gateway level 2. **PostgreSQL** for persistent usage records with time-series aggregation 3. **Redis** for real-time budget enforcement and rate limiting 4. **Celery Beat** for scheduled budget checks and alert processing 5. **SSE Events** for real-time dashboard updates ### Key Findings | Area | Recommendation | |------|----------------| | Token Tracking | LiteLLM provides built-in `response.usage` and `kwargs["response_cost"]` | | Cost Attribution | Hierarchical: Organization > Project > Sprint > Agent Instance > Request | | Budget Enforcement | Soft limits with alerts for weekly/monthly; Hard limits for daily budgets | | Cost Optimization | 60-80% savings possible through caching, cascading, and compression | ### Cost Optimization Strategies 1. **Semantic Caching** (15-30% savings): Cache responses by semantic similarity using vector embeddings 2. **Model Cascading** (up to 87% savings): Route 90% of queries to cheaper models, escalate only when needed 3. **Prompt Compression** (up to 80% savings): Use LLMLingua for intelligent prompt compression with minimal quality loss ### Database Schema Highlights - `token_usage`: Individual LLM request records with full attribution - `budgets`: Configurable daily/weekly/monthly budgets with soft/hard limits - `budget_alerts`: Alert tracking with severity levels and acknowledgment - `daily_cost_summaries`: Materialized aggregations for fast reporting ### LLM Pricing Reference (per 1M tokens) | Model | Input | Output | |-------|-------|--------| | Claude 3.5 Sonnet | $3.00 | $15.00 | | Claude 3 Haiku | $0.25 | $1.25 | | GPT-4 Turbo | $10.00 | $30.00 | | GPT-4o-mini | $0.15 | $0.60 | | Local Ollama | $0.00 | $0.00 | ### Implementation Roadmap - **Phase 1 (Week 1-2)**: Core infrastructure (schema, Redis, callbacks) - **Phase 2 (Week 2-3)**: Budget management and alerts - **Phase 3 (Week 3-4)**: Cost optimization (caching, cascading) - **Phase 4 (Week 4-5)**: Reporting dashboard - **Phase 5 (Week 5-6)**: Testing and documentation ### Related Spikes - SPIKE-005: LLM Provider Abstraction (LiteLLM baseline) - SPIKE-003: Real-time Updates (SSE architecture) - SPIKE-004: Celery + Redis Integration (background tasks) --- **Next Steps:** 1. Create ADR-010 based on these findings 2. Begin Phase 1 implementation 3. Define specific budget limits for Syndarix projects Full details available in [SPIKE-010-cost-tracking.md](docs/spikes/SPIKE-010-cost-tracking.md).
Sign in to join this conversation.