feat(mcp): Observability & Tracing Platform #66

New Issue

cardosofelipe · 2026-01-03T09:14:46Z

cardosofelipe commented

2026-01-03 09:14:46 +00:00

Overview

Implement a comprehensive observability platform specifically designed for AI/LLM systems. Traditional observability tools don't capture what matters for AI - we need to trace decisions, understand why agents chose certain actions, debug prompt/response patterns, and visualize agent behavior.

Parent Epic

Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

AI failures are hard to debug - "why did it do that?"
No visibility into agent decision-making process
Token usage is invisible until the bill arrives
No way to trace a request across LLM calls, tools, and memory
Performance bottlenecks are hidden
No correlation between inputs and outputs for debugging

The Solution

An AI-native observability platform that:

Traces every decision: From user input to final output
Captures reasoning: Why did the agent choose X over Y?
Tracks costs: Real-time token and dollar tracking
Visualizes workflows: See agent behavior graphically
Alerts on anomalies: Detect unusual patterns early

Implementation Sub-Tasks

1. Project Setup & Architecture

Create backend/src/mcp_core/observability/ directory
Create __init__.py with public API exports
Create platform.py with ObservabilityPlatform class
Create config.py with Pydantic settings
Define observability standards
Integrate with OpenTelemetry
Write architecture decision record (ADR)

2. Distributed Tracing

Create tracing/tracer.py with tracing infrastructure
Implement OpenTelemetry integration
Create trace context propagation
Implement span creation for:
- Agent session lifecycle
- LLM calls (request → response)
- Tool invocations
- Memory operations
- Context assembly
- Safety checks
Add trace attributes (model, tokens, cost, etc.)
Implement trace sampling for high-volume
Create trace export to Jaeger/Zipkin
Write tracing tests

3. LLM Call Logging

Create logging/llm_logger.py with LLM logging
Log full request (prompt, context, tools)
Log full response (content, tool calls, usage)
Implement log redaction for sensitive data
Create log indexing for search
Implement log retention policies
Create log export functionality
Add log compression
Write logging tests

4. Decision Tracing

Create decisions/tracer.py with decision tracing
Capture tool selection decisions (why this tool?)
Capture parameter generation decisions
Capture response formatting decisions
Create decision graph visualization
Implement decision replay (step through)
Add decision annotations for debugging
Create decision patterns analysis
Write decision tracing tests

5. Token & Cost Tracking

Create costs/tracker.py with cost tracking
Implement real-time token counting per request
Calculate cost per request (model-specific pricing)
Create cost aggregation by:
- Agent
- Project
- User
- Task type
- Time period
Implement cost forecasting
Create cost anomaly detection
Add cost alerts
Write cost tracking tests

6. Performance Profiling

Create profiling/profiler.py with profiling
Profile LLM latency (time to first token, total time)
Profile tool execution time
Profile context assembly time
Profile memory operations
Profile embedding generation
Create latency histograms
Implement slow request detection
Add performance baselines
Write profiling tests

7. Agent Behavior Visualization

Create visualization/behavior.py with visualization
Generate agent interaction diagrams
Create tool usage heatmaps
Generate conversation flow diagrams
Create error pattern visualizations
Implement real-time dashboard widgets
Create exportable reports
Add interactive exploration
Write visualization tests

8. Metrics Collection

Create metrics/collector.py with metrics collection
Implement Prometheus metrics:
- llm_requests_total (by model, status)
- llm_tokens_total (input, output, by model)
- llm_cost_dollars_total (by model)
- llm_latency_seconds (histogram)
- tool_invocations_total (by tool, status)
- tool_latency_seconds (histogram)
- agent_sessions_total (by type, status)
- agent_task_duration_seconds (histogram)
- memory_operations_total (by type)
- context_tokens_used (histogram)
- safety_checks_total (by result)
Create custom metric registration
Implement metric aggregation
Write metric tests

9. Dashboards

Create dashboards/ directory with dashboard definitions
Create Grafana dashboard: Overview
Create Grafana dashboard: LLM Performance
Create Grafana dashboard: Cost Analysis
Create Grafana dashboard: Agent Behavior
Create Grafana dashboard: Tool Usage
Create Grafana dashboard: Errors & Alerts
Implement dashboard templates
Add dashboard provisioning
Write dashboard tests

10. Alerting

Create alerting/manager.py with alert management
Define alert rules:
- High error rate (>5% failures)
- High latency (P95 > threshold)
- Cost spike (>2x normal)
- Loop detection
- Rate limit approaching
- Agent stuck (no progress)
Implement alert routing (email, Slack, webhook)
Create alert silencing
Implement alert grouping
Add runbook links
Write alerting tests

11. Log Aggregation

Create logs/aggregator.py with log aggregation
Implement structured logging (JSON)
Create correlation IDs across services
Implement log levels with filtering
Create log search and query
Implement log streaming
Add log archival
Write log aggregation tests

12. Request Replay

Create replay/player.py with request replay
Implement request capture with full context
Create replay mode with mocked dependencies
Implement step-by-step replay
Create replay diff (compare old vs new output)
Add replay annotations
Write replay tests

13. Anomaly Detection

Create anomaly/detector.py with anomaly detection
Implement baseline learning
Detect anomalies in:
- Response latency
- Token usage patterns
- Error rate patterns
- Tool usage patterns
- Cost patterns
Create anomaly alerts
Implement anomaly investigation tools
Write anomaly detection tests

14. Debugging Tools

Create debug/tools.py with debugging utilities
Implement request inspector (view full context)
Create response analyzer (parse and explain)
Implement diff tool (compare requests/responses)
Create timeline view (chronological events)
Implement filter tools (find specific patterns)
Add export functionality
Write debugging tool tests

15. MCP Integration

Create get_trace tool - Retrieve trace by ID
Create search_traces tool - Search traces by criteria
Create get_agent_stats tool - Get agent statistics
Create get_cost_summary tool - Get cost breakdown
Create get_errors tool - Get recent errors
Create trigger_alert tool - Manually trigger alert
Write MCP tool tests

16. Data Retention & Privacy

Create retention/manager.py with retention policies
Implement configurable retention periods
Create data anonymization
Implement PII redaction in logs/traces
Create data export for compliance
Add GDPR-compliant deletion
Write retention tests

17. Testing

Write unit tests for all components
Write integration tests for full platform
Write performance tests (overhead measurement)
Create end-to-end observability tests
Achieve >90% code coverage
Create regression test suite

18. Documentation

Write README with platform overview
Document tracing concepts
Document metrics reference
Document dashboard usage
Document alerting configuration
Create debugging guide
Add troubleshooting guide
Create best practices

Technical Specifications

Trace Structure

class Trace(BaseModel):
    trace_id: str
    span_id: str
    parent_span_id: str | None
    
    # Timing
    start_time: datetime
    end_time: datetime | None
    duration_ms: float | None
    
    # Identity
    service: str  # "agent", "llm-gateway", "knowledge-base"
    operation: str  # "llm_call", "tool_invoke", "memory_retrieve"
    
    # Context
    agent_id: str | None
    session_id: str | None
    project_id: str | None
    
    # LLM-specific (if applicable)
    llm_model: str | None
    llm_tokens_input: int | None
    llm_tokens_output: int | None
    llm_cost_usd: float | None
    
    # Tool-specific (if applicable)
    tool_name: str | None
    tool_success: bool | None
    
    # Status
    status: Literal["ok", "error"]
    error_message: str | None
    
    # Attributes
    attributes: dict
    events: list[TraceEvent]

Decision Log Structure

class Decision(BaseModel):
    decision_id: str
    trace_id: str
    timestamp: datetime
    
    # What was decided
    decision_type: Literal[
        "tool_selection",
        "parameter_generation",
        "response_formatting",
        "error_recovery",
        "memory_retrieval"
    ]
    
    # Options considered
    options: list[str]
    selected: str
    
    # Reasoning (if available)
    reasoning: str | None
    confidence: float | None
    
    # Context at decision time
    context_summary: str
    relevant_memory: list[str]

Cost Tracking Schema

class CostRecord(BaseModel):
    record_id: str
    timestamp: datetime
    
    # Identity
    agent_id: str
    project_id: str
    session_id: str
    
    # Tokens
    tokens_input: int
    tokens_output: int
    tokens_total: int
    
    # Cost
    model: str
    cost_per_input_token: float
    cost_per_output_token: float
    total_cost_usd: float
    
    # Aggregation
    daily_total_usd: float
    monthly_total_usd: float

Dashboard Layout

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Syndarix Observability Dashboard                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────┐  ┌─────────────────────────────┐           │
│  │     Request Volume          │  │     Error Rate              │           │
│  │     ████████████████        │  │     ▂▂▂▃▂▂▂▂���█▂▂           │           │
│  │     ████████████████        │  │                             │           │
│  └─────────────────────────────┘  └─────────────────────────────┘           │
│                                                                              │
│  ┌─────────────────────────────┐  ┌─────────────────────────────┐           │
│  │     Latency P50/P95/P99     │  │     Cost Today              │           │
│  │     ▂▃▄▅▆▇█▇▆▅▄▃▂▃▄        │  │     $127.45 / $200          │           │
│  │     P50: 1.2s P99: 4.5s    │  │     [████████░░] 64%         │           │
│  └─────────────────────────────┘  └─────────────────────────────┘           │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     Token Usage by Model                             │    │
│  │  Claude-3-opus    ████████████████████████████  45%                 │    │
│  │  GPT-4            ██████████████  25%                                │    │
│  │  Claude-3-sonnet  ████████████  20%                                  │    │
│  │  Haiku           ██████  10%                                         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                     Recent Traces                                    │    │
│  │  [OK]  agent-123  task_complete  2.3s  1,234 tokens  $0.12          │    │
│  │  [ERR] agent-456  tool_failed   0.5s    892 tokens  $0.08          │    │
│  │  [OK]  agent-789  task_complete  3.1s  2,456 tokens  $0.23          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Acceptance Criteria

All LLM calls are traced end-to-end
Decision points are captured with reasoning
Cost tracking is accurate within 1% of actual billing
Dashboards load in <2 seconds
Alerts fire within 30 seconds of threshold breach
Log search returns results in <1 second
Trace overhead <5% of request latency
30-day data retention by default
>90% test coverage
Documentation complete

Labels

phase-2, mcp, backend, observability, monitoring

Milestone

Phase 2: MCP Integration

## Overview Implement a comprehensive observability platform specifically designed for AI/LLM systems. Traditional observability tools don't capture what matters for AI - we need to trace decisions, understand why agents chose certain actions, debug prompt/response patterns, and visualize agent behavior. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - AI failures are hard to debug - "why did it do that?" - No visibility into agent decision-making process - Token usage is invisible until the bill arrives - No way to trace a request across LLM calls, tools, and memory - Performance bottlenecks are hidden - No correlation between inputs and outputs for debugging ### The Solution An AI-native observability platform that: 1. **Traces every decision**: From user input to final output 2. **Captures reasoning**: Why did the agent choose X over Y? 3. **Tracks costs**: Real-time token and dollar tracking 4. **Visualizes workflows**: See agent behavior graphically 5. **Alerts on anomalies**: Detect unusual patterns early --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/observability/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `platform.py` with `ObservabilityPlatform` class - [ ] Create `config.py` with Pydantic settings - [ ] Define observability standards - [ ] Integrate with OpenTelemetry - [ ] Write architecture decision record (ADR) ### 2. Distributed Tracing - [ ] Create `tracing/tracer.py` with tracing infrastructure - [ ] Implement OpenTelemetry integration - [ ] Create trace context propagation - [ ] Implement span creation for: - [ ] Agent session lifecycle - [ ] LLM calls (request → response) - [ ] Tool invocations - [ ] Memory operations - [ ] Context assembly - [ ] Safety checks - [ ] Add trace attributes (model, tokens, cost, etc.) - [ ] Implement trace sampling for high-volume - [ ] Create trace export to Jaeger/Zipkin - [ ] Write tracing tests ### 3. LLM Call Logging - [ ] Create `logging/llm_logger.py` with LLM logging - [ ] Log full request (prompt, context, tools) - [ ] Log full response (content, tool calls, usage) - [ ] Implement log redaction for sensitive data - [ ] Create log indexing for search - [ ] Implement log retention policies - [ ] Create log export functionality - [ ] Add log compression - [ ] Write logging tests ### 4. Decision Tracing - [ ] Create `decisions/tracer.py` with decision tracing - [ ] Capture tool selection decisions (why this tool?) - [ ] Capture parameter generation decisions - [ ] Capture response formatting decisions - [ ] Create decision graph visualization - [ ] Implement decision replay (step through) - [ ] Add decision annotations for debugging - [ ] Create decision patterns analysis - [ ] Write decision tracing tests ### 5. Token & Cost Tracking - [ ] Create `costs/tracker.py` with cost tracking - [ ] Implement real-time token counting per request - [ ] Calculate cost per request (model-specific pricing) - [ ] Create cost aggregation by: - [ ] Agent - [ ] Project - [ ] User - [ ] Task type - [ ] Time period - [ ] Implement cost forecasting - [ ] Create cost anomaly detection - [ ] Add cost alerts - [ ] Write cost tracking tests ### 6. Performance Profiling - [ ] Create `profiling/profiler.py` with profiling - [ ] Profile LLM latency (time to first token, total time) - [ ] Profile tool execution time - [ ] Profile context assembly time - [ ] Profile memory operations - [ ] Profile embedding generation - [ ] Create latency histograms - [ ] Implement slow request detection - [ ] Add performance baselines - [ ] Write profiling tests ### 7. Agent Behavior Visualization - [ ] Create `visualization/behavior.py` with visualization - [ ] Generate agent interaction diagrams - [ ] Create tool usage heatmaps - [ ] Generate conversation flow diagrams - [ ] Create error pattern visualizations - [ ] Implement real-time dashboard widgets - [ ] Create exportable reports - [ ] Add interactive exploration - [ ] Write visualization tests ### 8. Metrics Collection - [ ] Create `metrics/collector.py` with metrics collection - [ ] Implement Prometheus metrics: - [ ] `llm_requests_total` (by model, status) - [ ] `llm_tokens_total` (input, output, by model) - [ ] `llm_cost_dollars_total` (by model) - [ ] `llm_latency_seconds` (histogram) - [ ] `tool_invocations_total` (by tool, status) - [ ] `tool_latency_seconds` (histogram) - [ ] `agent_sessions_total` (by type, status) - [ ] `agent_task_duration_seconds` (histogram) - [ ] `memory_operations_total` (by type) - [ ] `context_tokens_used` (histogram) - [ ] `safety_checks_total` (by result) - [ ] Create custom metric registration - [ ] Implement metric aggregation - [ ] Write metric tests ### 9. Dashboards - [ ] Create `dashboards/` directory with dashboard definitions - [ ] Create Grafana dashboard: Overview - [ ] Create Grafana dashboard: LLM Performance - [ ] Create Grafana dashboard: Cost Analysis - [ ] Create Grafana dashboard: Agent Behavior - [ ] Create Grafana dashboard: Tool Usage - [ ] Create Grafana dashboard: Errors & Alerts - [ ] Implement dashboard templates - [ ] Add dashboard provisioning - [ ] Write dashboard tests ### 10. Alerting - [ ] Create `alerting/manager.py` with alert management - [ ] Define alert rules: - [ ] High error rate (>5% failures) - [ ] High latency (P95 > threshold) - [ ] Cost spike (>2x normal) - [ ] Loop detection - [ ] Rate limit approaching - [ ] Agent stuck (no progress) - [ ] Implement alert routing (email, Slack, webhook) - [ ] Create alert silencing - [ ] Implement alert grouping - [ ] Add runbook links - [ ] Write alerting tests ### 11. Log Aggregation - [ ] Create `logs/aggregator.py` with log aggregation - [ ] Implement structured logging (JSON) - [ ] Create correlation IDs across services - [ ] Implement log levels with filtering - [ ] Create log search and query - [ ] Implement log streaming - [ ] Add log archival - [ ] Write log aggregation tests ### 12. Request Replay - [ ] Create `replay/player.py` with request replay - [ ] Implement request capture with full context - [ ] Create replay mode with mocked dependencies - [ ] Implement step-by-step replay - [ ] Create replay diff (compare old vs new output) - [ ] Add replay annotations - [ ] Write replay tests ### 13. Anomaly Detection - [ ] Create `anomaly/detector.py` with anomaly detection - [ ] Implement baseline learning - [ ] Detect anomalies in: - [ ] Response latency - [ ] Token usage patterns - [ ] Error rate patterns - [ ] Tool usage patterns - [ ] Cost patterns - [ ] Create anomaly alerts - [ ] Implement anomaly investigation tools - [ ] Write anomaly detection tests ### 14. Debugging Tools - [ ] Create `debug/tools.py` with debugging utilities - [ ] Implement request inspector (view full context) - [ ] Create response analyzer (parse and explain) - [ ] Implement diff tool (compare requests/responses) - [ ] Create timeline view (chronological events) - [ ] Implement filter tools (find specific patterns) - [ ] Add export functionality - [ ] Write debugging tool tests ### 15. MCP Integration - [ ] Create `get_trace` tool - Retrieve trace by ID - [ ] Create `search_traces` tool - Search traces by criteria - [ ] Create `get_agent_stats` tool - Get agent statistics - [ ] Create `get_cost_summary` tool - Get cost breakdown - [ ] Create `get_errors` tool - Get recent errors - [ ] Create `trigger_alert` tool - Manually trigger alert - [ ] Write MCP tool tests ### 16. Data Retention & Privacy - [ ] Create `retention/manager.py` with retention policies - [ ] Implement configurable retention periods - [ ] Create data anonymization - [ ] Implement PII redaction in logs/traces - [ ] Create data export for compliance - [ ] Add GDPR-compliant deletion - [ ] Write retention tests ### 17. Testing - [ ] Write unit tests for all components - [ ] Write integration tests for full platform - [ ] Write performance tests (overhead measurement) - [ ] Create end-to-end observability tests - [ ] Achieve >90% code coverage - [ ] Create regression test suite ### 18. Documentation - [ ] Write README with platform overview - [ ] Document tracing concepts - [ ] Document metrics reference - [ ] Document dashboard usage - [ ] Document alerting configuration - [ ] Create debugging guide - [ ] Add troubleshooting guide - [ ] Create best practices --- ## Technical Specifications ### Trace Structure ```python class Trace(BaseModel): trace_id: str span_id: str parent_span_id: str | None # Timing start_time: datetime end_time: datetime | None duration_ms: float | None # Identity service: str # "agent", "llm-gateway", "knowledge-base" operation: str # "llm_call", "tool_invoke", "memory_retrieve" # Context agent_id: str | None session_id: str | None project_id: str | None # LLM-specific (if applicable) llm_model: str | None llm_tokens_input: int | None llm_tokens_output: int | None llm_cost_usd: float | None # Tool-specific (if applicable) tool_name: str | None tool_success: bool | None # Status status: Literal["ok", "error"] error_message: str | None # Attributes attributes: dict events: list[TraceEvent] ``` ### Decision Log Structure ```python class Decision(BaseModel): decision_id: str trace_id: str timestamp: datetime # What was decided decision_type: Literal[ "tool_selection", "parameter_generation", "response_formatting", "error_recovery", "memory_retrieval" ] # Options considered options: list[str] selected: str # Reasoning (if available) reasoning: str | None confidence: float | None # Context at decision time context_summary: str relevant_memory: list[str] ``` ### Cost Tracking Schema ```python class CostRecord(BaseModel): record_id: str timestamp: datetime # Identity agent_id: str project_id: str session_id: str # Tokens tokens_input: int tokens_output: int tokens_total: int # Cost model: str cost_per_input_token: float cost_per_output_token: float total_cost_usd: float # Aggregation daily_total_usd: float monthly_total_usd: float ``` ### Dashboard Layout ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Syndarix Observability Dashboard │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────┐ ┌─────────────────────────────┐ │ │ │ Request Volume │ │ Error Rate │ │ │ │ ████████████████ │ │ ▂▂▂▃▂▂▂▂��█▂▂ │ │ │ │ ████████████████ │ │ │ │ │ └─────────────────────────────┘ └─────────────────────────────┘ │ │ │ │ ┌─────────────────────────────┐ ┌─────────────────────────────┐ │ │ │ Latency P50/P95/P99 │ │ Cost Today │ │ │ │ ▂▃▄▅▆▇█▇▆▅▄▃▂▃▄ │ │ $127.45 / $200 │ │ │ │ P50: 1.2s P99: 4.5s │ │ [████████░░] 64% │ │ │ └─────────────────────────────┘ └─────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Token Usage by Model │ │ │ │ Claude-3-opus ████████████████████████████ 45% │ │ │ │ GPT-4 ██████████████ 25% │ │ │ │ Claude-3-sonnet ████████████ 20% │ │ │ │ Haiku ██████ 10% │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Recent Traces │ │ │ │ [OK] agent-123 task_complete 2.3s 1,234 tokens $0.12 │ │ │ │ [ERR] agent-456 tool_failed 0.5s 892 tokens $0.08 │ │ │ │ [OK] agent-789 task_complete 3.1s 2,456 tokens $0.23 │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` --- ## Acceptance Criteria - [ ] All LLM calls are traced end-to-end - [ ] Decision points are captured with reasoning - [ ] Cost tracking is accurate within 1% of actual billing - [ ] Dashboards load in <2 seconds - [ ] Alerts fire within 30 seconds of threshold breach - [ ] Log search returns results in <1 second - [ ] Trace overhead <5% of request latency - [ ] 30-day data retention by default - [ ] >90% test coverage - [ ] Documentation complete --- ## Labels `phase-2`, `mcp`, `backend`, `observability`, `monitoring` ## Milestone Phase 2: MCP Integration

cardosofelipe referenced this issue

2026-01-03 09:18:19 +00:00

[EPIC] Phase 2: MCP Integration #60

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cardosofelipe/syndarix#66