feat(mcp): Observability & Tracing Platform #66

Open
opened 2026-01-03 09:14:46 +00:00 by cardosofelipe · 0 comments

Overview

Implement a comprehensive observability platform specifically designed for AI/LLM systems. Traditional observability tools don't capture what matters for AI - we need to trace decisions, understand why agents chose certain actions, debug prompt/response patterns, and visualize agent behavior.

Parent Epic

  • Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

  • AI failures are hard to debug - "why did it do that?"
  • No visibility into agent decision-making process
  • Token usage is invisible until the bill arrives
  • No way to trace a request across LLM calls, tools, and memory
  • Performance bottlenecks are hidden
  • No correlation between inputs and outputs for debugging

The Solution

An AI-native observability platform that:

  1. Traces every decision: From user input to final output
  2. Captures reasoning: Why did the agent choose X over Y?
  3. Tracks costs: Real-time token and dollar tracking
  4. Visualizes workflows: See agent behavior graphically
  5. Alerts on anomalies: Detect unusual patterns early

Implementation Sub-Tasks

1. Project Setup & Architecture

  • Create backend/src/mcp_core/observability/ directory
  • Create __init__.py with public API exports
  • Create platform.py with ObservabilityPlatform class
  • Create config.py with Pydantic settings
  • Define observability standards
  • Integrate with OpenTelemetry
  • Write architecture decision record (ADR)

2. Distributed Tracing

  • Create tracing/tracer.py with tracing infrastructure
  • Implement OpenTelemetry integration
  • Create trace context propagation
  • Implement span creation for:
    • Agent session lifecycle
    • LLM calls (request → response)
    • Tool invocations
    • Memory operations
    • Context assembly
    • Safety checks
  • Add trace attributes (model, tokens, cost, etc.)
  • Implement trace sampling for high-volume
  • Create trace export to Jaeger/Zipkin
  • Write tracing tests

3. LLM Call Logging

  • Create logging/llm_logger.py with LLM logging
  • Log full request (prompt, context, tools)
  • Log full response (content, tool calls, usage)
  • Implement log redaction for sensitive data
  • Create log indexing for search
  • Implement log retention policies
  • Create log export functionality
  • Add log compression
  • Write logging tests

4. Decision Tracing

  • Create decisions/tracer.py with decision tracing
  • Capture tool selection decisions (why this tool?)
  • Capture parameter generation decisions
  • Capture response formatting decisions
  • Create decision graph visualization
  • Implement decision replay (step through)
  • Add decision annotations for debugging
  • Create decision patterns analysis
  • Write decision tracing tests

5. Token & Cost Tracking

  • Create costs/tracker.py with cost tracking
  • Implement real-time token counting per request
  • Calculate cost per request (model-specific pricing)
  • Create cost aggregation by:
    • Agent
    • Project
    • User
    • Task type
    • Time period
  • Implement cost forecasting
  • Create cost anomaly detection
  • Add cost alerts
  • Write cost tracking tests

6. Performance Profiling

  • Create profiling/profiler.py with profiling
  • Profile LLM latency (time to first token, total time)
  • Profile tool execution time
  • Profile context assembly time
  • Profile memory operations
  • Profile embedding generation
  • Create latency histograms
  • Implement slow request detection
  • Add performance baselines
  • Write profiling tests

7. Agent Behavior Visualization

  • Create visualization/behavior.py with visualization
  • Generate agent interaction diagrams
  • Create tool usage heatmaps
  • Generate conversation flow diagrams
  • Create error pattern visualizations
  • Implement real-time dashboard widgets
  • Create exportable reports
  • Add interactive exploration
  • Write visualization tests

8. Metrics Collection

  • Create metrics/collector.py with metrics collection
  • Implement Prometheus metrics:
    • llm_requests_total (by model, status)
    • llm_tokens_total (input, output, by model)
    • llm_cost_dollars_total (by model)
    • llm_latency_seconds (histogram)
    • tool_invocations_total (by tool, status)
    • tool_latency_seconds (histogram)
    • agent_sessions_total (by type, status)
    • agent_task_duration_seconds (histogram)
    • memory_operations_total (by type)
    • context_tokens_used (histogram)
    • safety_checks_total (by result)
  • Create custom metric registration
  • Implement metric aggregation
  • Write metric tests

9. Dashboards

  • Create dashboards/ directory with dashboard definitions
  • Create Grafana dashboard: Overview
  • Create Grafana dashboard: LLM Performance
  • Create Grafana dashboard: Cost Analysis
  • Create Grafana dashboard: Agent Behavior
  • Create Grafana dashboard: Tool Usage
  • Create Grafana dashboard: Errors & Alerts
  • Implement dashboard templates
  • Add dashboard provisioning
  • Write dashboard tests

10. Alerting

  • Create alerting/manager.py with alert management
  • Define alert rules:
    • High error rate (>5% failures)
    • High latency (P95 > threshold)
    • Cost spike (>2x normal)
    • Loop detection
    • Rate limit approaching
    • Agent stuck (no progress)
  • Implement alert routing (email, Slack, webhook)
  • Create alert silencing
  • Implement alert grouping
  • Add runbook links
  • Write alerting tests

11. Log Aggregation

  • Create logs/aggregator.py with log aggregation
  • Implement structured logging (JSON)
  • Create correlation IDs across services
  • Implement log levels with filtering
  • Create log search and query
  • Implement log streaming
  • Add log archival
  • Write log aggregation tests

12. Request Replay

  • Create replay/player.py with request replay
  • Implement request capture with full context
  • Create replay mode with mocked dependencies
  • Implement step-by-step replay
  • Create replay diff (compare old vs new output)
  • Add replay annotations
  • Write replay tests

13. Anomaly Detection

  • Create anomaly/detector.py with anomaly detection
  • Implement baseline learning
  • Detect anomalies in:
    • Response latency
    • Token usage patterns
    • Error rate patterns
    • Tool usage patterns
    • Cost patterns
  • Create anomaly alerts
  • Implement anomaly investigation tools
  • Write anomaly detection tests

14. Debugging Tools

  • Create debug/tools.py with debugging utilities
  • Implement request inspector (view full context)
  • Create response analyzer (parse and explain)
  • Implement diff tool (compare requests/responses)
  • Create timeline view (chronological events)
  • Implement filter tools (find specific patterns)
  • Add export functionality
  • Write debugging tool tests

15. MCP Integration

  • Create get_trace tool - Retrieve trace by ID
  • Create search_traces tool - Search traces by criteria
  • Create get_agent_stats tool - Get agent statistics
  • Create get_cost_summary tool - Get cost breakdown
  • Create get_errors tool - Get recent errors
  • Create trigger_alert tool - Manually trigger alert
  • Write MCP tool tests

16. Data Retention & Privacy

  • Create retention/manager.py with retention policies
  • Implement configurable retention periods
  • Create data anonymization
  • Implement PII redaction in logs/traces
  • Create data export for compliance
  • Add GDPR-compliant deletion
  • Write retention tests

17. Testing

  • Write unit tests for all components
  • Write integration tests for full platform
  • Write performance tests (overhead measurement)
  • Create end-to-end observability tests
  • Achieve >90% code coverage
  • Create regression test suite

18. Documentation

  • Write README with platform overview
  • Document tracing concepts
  • Document metrics reference
  • Document dashboard usage
  • Document alerting configuration
  • Create debugging guide
  • Add troubleshooting guide
  • Create best practices

Technical Specifications

Trace Structure

class Trace(BaseModel):
    trace_id: str
    span_id: str
    parent_span_id: str | None
    
    # Timing
    start_time: datetime
    end_time: datetime | None
    duration_ms: float | None
    
    # Identity
    service: str  # "agent", "llm-gateway", "knowledge-base"
    operation: str  # "llm_call", "tool_invoke", "memory_retrieve"
    
    # Context
    agent_id: str | None
    session_id: str | None
    project_id: str | None
    
    # LLM-specific (if applicable)
    llm_model: str | None
    llm_tokens_input: int | None
    llm_tokens_output: int | None
    llm_cost_usd: float | None
    
    # Tool-specific (if applicable)
    tool_name: str | None
    tool_success: bool | None
    
    # Status
    status: Literal["ok", "error"]
    error_message: str | None
    
    # Attributes
    attributes: dict
    events: list[TraceEvent]

Decision Log Structure

class Decision(BaseModel):
    decision_id: str
    trace_id: str
    timestamp: datetime
    
    # What was decided
    decision_type: Literal[
        "tool_selection",
        "parameter_generation",
        "response_formatting",
        "error_recovery",
        "memory_retrieval"
    ]
    
    # Options considered
    options: list[str]
    selected: str
    
    # Reasoning (if available)
    reasoning: str | None
    confidence: float | None
    
    # Context at decision time
    context_summary: str
    relevant_memory: list[str]

Cost Tracking Schema

class CostRecord(BaseModel):
    record_id: str
    timestamp: datetime
    
    # Identity
    agent_id: str
    project_id: str
    session_id: str
    
    # Tokens
    tokens_input: int
    tokens_output: int
    tokens_total: int
    
    # Cost
    model: str
    cost_per_input_token: float
    cost_per_output_token: float
    total_cost_usd: float
    
    # Aggregation
    daily_total_usd: float
    monthly_total_usd: float

Dashboard Layout

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Syndarix Observability Dashboard                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────┐  ┌─────────────────────────────┐           │
│  │     Request Volume          │  │     Error Rate              │           │
│  │     ████████████████        │  │     ▂▂▂▃▂▂▂▂���█▂▂           │           │
│  │     ████████████████        │  │                             │           │
│  └─────────────────────────────┘  └─────────────────────────────┘           │
│                                                                              │
│  ┌─────────────────────────────┐  ┌─────────────────────────────┐           │
│  │     Latency P50/P95/P99     │  │     Cost Today              │           │
│  │     ▂▃▄▅▆▇█▇▆▅▄▃▂▃▄        │  │     $127.45 / $200          │           │
│  │     P50: 1.2s P99: 4.5s    │  │     [████████░░] 64%         │           │
│  └─────────────────────────────┘  └─────────────────────────────┘           │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     Token Usage by Model                             │    │
│  │  Claude-3-opus    ████████████████████████████  45%                 │    │
│  │  GPT-4            ██████████████  25%                                │    │
│  │  Claude-3-sonnet  ████████████  20%                                  │    │
│  │  Haiku           ██████  10%                                         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     Recent Traces                                    │    │
│  │  [OK]  agent-123  task_complete  2.3s  1,234 tokens  $0.12          │    │
│  │  [ERR] agent-456  tool_failed   0.5s    892 tokens  $0.08          │    │
│  │  [OK]  agent-789  task_complete  3.1s  2,456 tokens  $0.23          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Acceptance Criteria

  • All LLM calls are traced end-to-end
  • Decision points are captured with reasoning
  • Cost tracking is accurate within 1% of actual billing
  • Dashboards load in <2 seconds
  • Alerts fire within 30 seconds of threshold breach
  • Log search returns results in <1 second
  • Trace overhead <5% of request latency
  • 30-day data retention by default
  • >90% test coverage
  • Documentation complete

Labels

phase-2, mcp, backend, observability, monitoring

Milestone

Phase 2: MCP Integration

## Overview Implement a comprehensive observability platform specifically designed for AI/LLM systems. Traditional observability tools don't capture what matters for AI - we need to trace decisions, understand why agents chose certain actions, debug prompt/response patterns, and visualize agent behavior. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - AI failures are hard to debug - "why did it do that?" - No visibility into agent decision-making process - Token usage is invisible until the bill arrives - No way to trace a request across LLM calls, tools, and memory - Performance bottlenecks are hidden - No correlation between inputs and outputs for debugging ### The Solution An AI-native observability platform that: 1. **Traces every decision**: From user input to final output 2. **Captures reasoning**: Why did the agent choose X over Y? 3. **Tracks costs**: Real-time token and dollar tracking 4. **Visualizes workflows**: See agent behavior graphically 5. **Alerts on anomalies**: Detect unusual patterns early --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/observability/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `platform.py` with `ObservabilityPlatform` class - [ ] Create `config.py` with Pydantic settings - [ ] Define observability standards - [ ] Integrate with OpenTelemetry - [ ] Write architecture decision record (ADR) ### 2. Distributed Tracing - [ ] Create `tracing/tracer.py` with tracing infrastructure - [ ] Implement OpenTelemetry integration - [ ] Create trace context propagation - [ ] Implement span creation for: - [ ] Agent session lifecycle - [ ] LLM calls (request → response) - [ ] Tool invocations - [ ] Memory operations - [ ] Context assembly - [ ] Safety checks - [ ] Add trace attributes (model, tokens, cost, etc.) - [ ] Implement trace sampling for high-volume - [ ] Create trace export to Jaeger/Zipkin - [ ] Write tracing tests ### 3. LLM Call Logging - [ ] Create `logging/llm_logger.py` with LLM logging - [ ] Log full request (prompt, context, tools) - [ ] Log full response (content, tool calls, usage) - [ ] Implement log redaction for sensitive data - [ ] Create log indexing for search - [ ] Implement log retention policies - [ ] Create log export functionality - [ ] Add log compression - [ ] Write logging tests ### 4. Decision Tracing - [ ] Create `decisions/tracer.py` with decision tracing - [ ] Capture tool selection decisions (why this tool?) - [ ] Capture parameter generation decisions - [ ] Capture response formatting decisions - [ ] Create decision graph visualization - [ ] Implement decision replay (step through) - [ ] Add decision annotations for debugging - [ ] Create decision patterns analysis - [ ] Write decision tracing tests ### 5. Token & Cost Tracking - [ ] Create `costs/tracker.py` with cost tracking - [ ] Implement real-time token counting per request - [ ] Calculate cost per request (model-specific pricing) - [ ] Create cost aggregation by: - [ ] Agent - [ ] Project - [ ] User - [ ] Task type - [ ] Time period - [ ] Implement cost forecasting - [ ] Create cost anomaly detection - [ ] Add cost alerts - [ ] Write cost tracking tests ### 6. Performance Profiling - [ ] Create `profiling/profiler.py` with profiling - [ ] Profile LLM latency (time to first token, total time) - [ ] Profile tool execution time - [ ] Profile context assembly time - [ ] Profile memory operations - [ ] Profile embedding generation - [ ] Create latency histograms - [ ] Implement slow request detection - [ ] Add performance baselines - [ ] Write profiling tests ### 7. Agent Behavior Visualization - [ ] Create `visualization/behavior.py` with visualization - [ ] Generate agent interaction diagrams - [ ] Create tool usage heatmaps - [ ] Generate conversation flow diagrams - [ ] Create error pattern visualizations - [ ] Implement real-time dashboard widgets - [ ] Create exportable reports - [ ] Add interactive exploration - [ ] Write visualization tests ### 8. Metrics Collection - [ ] Create `metrics/collector.py` with metrics collection - [ ] Implement Prometheus metrics: - [ ] `llm_requests_total` (by model, status) - [ ] `llm_tokens_total` (input, output, by model) - [ ] `llm_cost_dollars_total` (by model) - [ ] `llm_latency_seconds` (histogram) - [ ] `tool_invocations_total` (by tool, status) - [ ] `tool_latency_seconds` (histogram) - [ ] `agent_sessions_total` (by type, status) - [ ] `agent_task_duration_seconds` (histogram) - [ ] `memory_operations_total` (by type) - [ ] `context_tokens_used` (histogram) - [ ] `safety_checks_total` (by result) - [ ] Create custom metric registration - [ ] Implement metric aggregation - [ ] Write metric tests ### 9. Dashboards - [ ] Create `dashboards/` directory with dashboard definitions - [ ] Create Grafana dashboard: Overview - [ ] Create Grafana dashboard: LLM Performance - [ ] Create Grafana dashboard: Cost Analysis - [ ] Create Grafana dashboard: Agent Behavior - [ ] Create Grafana dashboard: Tool Usage - [ ] Create Grafana dashboard: Errors & Alerts - [ ] Implement dashboard templates - [ ] Add dashboard provisioning - [ ] Write dashboard tests ### 10. Alerting - [ ] Create `alerting/manager.py` with alert management - [ ] Define alert rules: - [ ] High error rate (>5% failures) - [ ] High latency (P95 > threshold) - [ ] Cost spike (>2x normal) - [ ] Loop detection - [ ] Rate limit approaching - [ ] Agent stuck (no progress) - [ ] Implement alert routing (email, Slack, webhook) - [ ] Create alert silencing - [ ] Implement alert grouping - [ ] Add runbook links - [ ] Write alerting tests ### 11. Log Aggregation - [ ] Create `logs/aggregator.py` with log aggregation - [ ] Implement structured logging (JSON) - [ ] Create correlation IDs across services - [ ] Implement log levels with filtering - [ ] Create log search and query - [ ] Implement log streaming - [ ] Add log archival - [ ] Write log aggregation tests ### 12. Request Replay - [ ] Create `replay/player.py` with request replay - [ ] Implement request capture with full context - [ ] Create replay mode with mocked dependencies - [ ] Implement step-by-step replay - [ ] Create replay diff (compare old vs new output) - [ ] Add replay annotations - [ ] Write replay tests ### 13. Anomaly Detection - [ ] Create `anomaly/detector.py` with anomaly detection - [ ] Implement baseline learning - [ ] Detect anomalies in: - [ ] Response latency - [ ] Token usage patterns - [ ] Error rate patterns - [ ] Tool usage patterns - [ ] Cost patterns - [ ] Create anomaly alerts - [ ] Implement anomaly investigation tools - [ ] Write anomaly detection tests ### 14. Debugging Tools - [ ] Create `debug/tools.py` with debugging utilities - [ ] Implement request inspector (view full context) - [ ] Create response analyzer (parse and explain) - [ ] Implement diff tool (compare requests/responses) - [ ] Create timeline view (chronological events) - [ ] Implement filter tools (find specific patterns) - [ ] Add export functionality - [ ] Write debugging tool tests ### 15. MCP Integration - [ ] Create `get_trace` tool - Retrieve trace by ID - [ ] Create `search_traces` tool - Search traces by criteria - [ ] Create `get_agent_stats` tool - Get agent statistics - [ ] Create `get_cost_summary` tool - Get cost breakdown - [ ] Create `get_errors` tool - Get recent errors - [ ] Create `trigger_alert` tool - Manually trigger alert - [ ] Write MCP tool tests ### 16. Data Retention & Privacy - [ ] Create `retention/manager.py` with retention policies - [ ] Implement configurable retention periods - [ ] Create data anonymization - [ ] Implement PII redaction in logs/traces - [ ] Create data export for compliance - [ ] Add GDPR-compliant deletion - [ ] Write retention tests ### 17. Testing - [ ] Write unit tests for all components - [ ] Write integration tests for full platform - [ ] Write performance tests (overhead measurement) - [ ] Create end-to-end observability tests - [ ] Achieve >90% code coverage - [ ] Create regression test suite ### 18. Documentation - [ ] Write README with platform overview - [ ] Document tracing concepts - [ ] Document metrics reference - [ ] Document dashboard usage - [ ] Document alerting configuration - [ ] Create debugging guide - [ ] Add troubleshooting guide - [ ] Create best practices --- ## Technical Specifications ### Trace Structure ```python class Trace(BaseModel): trace_id: str span_id: str parent_span_id: str | None # Timing start_time: datetime end_time: datetime | None duration_ms: float | None # Identity service: str # "agent", "llm-gateway", "knowledge-base" operation: str # "llm_call", "tool_invoke", "memory_retrieve" # Context agent_id: str | None session_id: str | None project_id: str | None # LLM-specific (if applicable) llm_model: str | None llm_tokens_input: int | None llm_tokens_output: int | None llm_cost_usd: float | None # Tool-specific (if applicable) tool_name: str | None tool_success: bool | None # Status status: Literal["ok", "error"] error_message: str | None # Attributes attributes: dict events: list[TraceEvent] ``` ### Decision Log Structure ```python class Decision(BaseModel): decision_id: str trace_id: str timestamp: datetime # What was decided decision_type: Literal[ "tool_selection", "parameter_generation", "response_formatting", "error_recovery", "memory_retrieval" ] # Options considered options: list[str] selected: str # Reasoning (if available) reasoning: str | None confidence: float | None # Context at decision time context_summary: str relevant_memory: list[str] ``` ### Cost Tracking Schema ```python class CostRecord(BaseModel): record_id: str timestamp: datetime # Identity agent_id: str project_id: str session_id: str # Tokens tokens_input: int tokens_output: int tokens_total: int # Cost model: str cost_per_input_token: float cost_per_output_token: float total_cost_usd: float # Aggregation daily_total_usd: float monthly_total_usd: float ``` ### Dashboard Layout ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Syndarix Observability Dashboard │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────┐ ┌─────────────────────────────┐ │ │ │ Request Volume │ │ Error Rate │ │ │ │ ████████████████ │ │ ▂▂▂▃▂▂▂▂���█▂▂ │ │ │ │ ████████████████ │ │ │ │ │ └─────────────────────────────┘ └─────────────────────────────┘ │ │ │ │ ┌─────────────────────────────┐ ┌─────────────────────────────┐ │ │ │ Latency P50/P95/P99 │ │ Cost Today │ │ │ │ ▂▃▄▅▆▇█▇▆▅▄▃▂▃▄ │ │ $127.45 / $200 │ │ │ │ P50: 1.2s P99: 4.5s │ │ [████████░░] 64% │ │ │ └─────────────────────────────┘ └─────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Token Usage by Model │ │ │ │ Claude-3-opus ████████████████████████████ 45% │ │ │ │ GPT-4 ██████████████ 25% │ │ │ │ Claude-3-sonnet ████████████ 20% │ │ │ │ Haiku ██████ 10% │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Recent Traces │ │ │ │ [OK] agent-123 task_complete 2.3s 1,234 tokens $0.12 │ │ │ │ [ERR] agent-456 tool_failed 0.5s 892 tokens $0.08 │ │ │ │ [OK] agent-789 task_complete 3.1s 2,456 tokens $0.23 │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` --- ## Acceptance Criteria - [ ] All LLM calls are traced end-to-end - [ ] Decision points are captured with reasoning - [ ] Cost tracking is accurate within 1% of actual billing - [ ] Dashboards load in <2 seconds - [ ] Alerts fire within 30 seconds of threshold breach - [ ] Log search returns results in <1 second - [ ] Trace overhead <5% of request latency - [ ] 30-day data retention by default - [ ] >90% test coverage - [ ] Documentation complete --- ## Labels `phase-2`, `mcp`, `backend`, `observability`, `monitoring` ## Milestone Phase 2: MCP Integration
Sign in to join this conversation.