feat(mcp): AI Testing Infrastructure #65

Open
opened 2026-01-03 09:13:24 +00:00 by cardosofelipe · 0 comments

Overview

Implement specialized testing infrastructure for AI/LLM systems that handles the unique challenges of testing non-deterministic behavior. Traditional testing doesn't work for AI - we need deterministic test modes, golden tests, regression suites for prompts, and benchmark frameworks.

Parent Epic

  • Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

  • LLM outputs are non-deterministic - same input can produce different outputs
  • Traditional unit tests fail randomly when testing AI behavior
  • Prompt changes can silently break functionality
  • No way to measure if AI quality is improving or regressing
  • Hard to reproduce issues in development
  • No benchmarks for comparing model/prompt performance

The Solution

A specialized testing framework that:

  1. Deterministic mode: Seeded, reproducible LLM responses for unit tests
  2. Golden tests: Verified input/output pairs that must pass
  3. Semantic testing: Evaluate meaning, not exact text
  4. Regression testing: Detect when prompt changes break behavior
  5. Benchmarks: Measure quality and performance over time

Implementation Sub-Tasks

1. Project Setup & Architecture

  • Create backend/src/mcp_core/testing/ directory
  • Create __init__.py with public API exports
  • Create framework.py with AITestFramework class
  • Create config.py with Pydantic settings
  • Define testing standards and conventions
  • Create pytest plugin for AI testing
  • Write architecture decision record (ADR)

2. Deterministic Mode

  • Create deterministic/mode.py with deterministic testing
  • Implement LLM response mocking infrastructure
  • Create response fixtures with seed support
  • Implement request/response recording
  • Create playback mode from recordings
  • Implement response stubbing by pattern
  • Create deterministic embedding generation
  • Add deterministic random seeds
  • Implement cross-test isolation
  • Write deterministic mode tests

3. Response Recording & Playback

  • Create recording/recorder.py with recording logic
  • Implement automatic recording during development
  • Create recording storage (files, database)
  • Implement request fingerprinting for matching
  • Create recording versioning
  • Implement recording cleanup (remove stale)
  • Create selective recording (by test, by module)
  • Add recording compression
  • Write recording tests

4. Golden Test Framework

  • Create golden/framework.py with golden testing
  • Define golden test schema (input, expected output, criteria)
  • Implement golden test loader from files
  • Create golden test runner
  • Implement exact match evaluation
  • Implement semantic match evaluation (embedding similarity)
  • Implement structured match evaluation (JSON keys/values)
  • Create golden test updating (regenerate expected)
  • Add golden test versioning
  • Create golden test reports
  • Write golden test framework tests

5. Semantic Evaluation

  • Create evaluation/semantic.py with semantic evaluation
  • Implement embedding-based similarity scoring
  • Create semantic equivalence checking
  • Implement key point extraction and matching
  • Create fact consistency checking
  • Implement tone/style matching
  • Create format compliance checking
  • Implement customizable evaluation rubrics
  • Add multi-evaluator aggregation
  • Write semantic evaluation tests

6. LLM-as-Judge Evaluation

  • Create evaluation/llm_judge.py with LLM evaluation
  • Implement evaluation prompt templates
  • Create scoring rubrics for different criteria
  • Implement pairwise comparison
  • Create quality dimensions:
    • Accuracy (factually correct)
    • Relevance (addresses the question)
    • Completeness (covers all aspects)
    • Clarity (easy to understand)
    • Formatting (proper structure)
  • Implement confidence scoring
  • Create explanation generation
  • Add evaluation caching
  • Write LLM judge tests

7. Regression Testing

  • Create regression/detector.py with regression detection
  • Implement baseline creation from passing tests
  • Create comparison with baseline
  • Implement statistical significance testing
  • Create regression alerts
  • Implement automatic baseline updates
  • Create regression history tracking
  • Add regression reports with diff
  • Implement rollback suggestions
  • Write regression tests

8. Benchmark Framework

  • Create benchmark/framework.py with benchmarking
  • Define benchmark schemas (tasks, metrics, thresholds)
  • Create benchmark categories:
    • Tool selection accuracy
    • Parameter generation accuracy
    • Task completion rate
    • Error recovery rate
    • Response quality score
    • Latency benchmarks
    • Token efficiency
  • Implement benchmark runner
  • Create benchmark result storage
  • Implement trend analysis
  • Create benchmark comparison reports
  • Add performance thresholds and alerts
  • Write benchmark tests

9. Test Data Generation

  • Create data/generator.py with test data generation
  • Implement synthetic test case generation
  • Create edge case generation
  • Implement adversarial test case generation
  • Create test data variation (parameterized tests)
  • Implement data augmentation
  • Create realistic mock data generation
  • Add test data validation
  • Write data generation tests

10. Chaos Testing for AI

  • Create chaos/runner.py with chaos testing
  • Implement LLM failure injection
  • Create slow response simulation
  • Implement partial response simulation
  • Create malformed response injection
  • Implement token limit hitting
  • Create rate limit simulation
  • Add chaos test scenarios
  • Write chaos test framework tests

11. A/B Testing Framework

  • Create ab/framework.py with A/B testing
  • Implement variant definition (prompts, models)
  • Create traffic splitting
  • Implement metric collection per variant
  • Create statistical analysis
  • Implement winner detection
  • Create A/B test reporting
  • Add rollout recommendations
  • Write A/B testing tests

12. Test Fixtures & Utilities

  • Create fixtures/llm.py with LLM fixtures
  • Implement mock LLM client
  • Create mock tool responses
  • Implement mock memory/context
  • Create test agent factory
  • Implement test project factory
  • Create assertion helpers
  • Add timing utilities
  • Write fixture tests

13. Coverage Analysis

  • Create coverage/analyzer.py with coverage analysis
  • Implement prompt coverage (which prompts are tested)
  • Create tool coverage (which tools are tested)
  • Implement scenario coverage (which scenarios)
  • Create coverage reports
  • Implement coverage thresholds
  • Add coverage badges
  • Write coverage tests

14. CI/CD Integration

  • Create ci/runner.py with CI integration
  • Implement parallel test execution
  • Create test result aggregation
  • Implement flaky test detection
  • Create test prioritization (run fast tests first)
  • Implement selective test runs (affected tests only)
  • Create CI reporting format
  • Add GitHub/Gitea integration
  • Write CI integration tests

15. Pytest Plugin

  • Create pytest_ai/plugin.py with pytest plugin
  • Implement @pytest.mark.ai_test marker
  • Create @pytest.mark.golden marker
  • Implement @pytest.mark.benchmark marker
  • Create --ai-deterministic flag
  • Implement --ai-record flag
  • Create --ai-playback flag
  • Add --ai-baseline-update flag
  • Implement custom assertions
  • Write plugin tests

16. Metrics & Reporting

  • Add test metrics to Prometheus
  • Track ai_tests_passed_total counter
  • Track ai_tests_failed_total counter
  • Track benchmark_scores gauges
  • Track regression_detected_total counter
  • Create test dashboards
  • Implement test trend visualization
  • Add alerting for test failures
  • Write metrics tests

17. Testing

  • Write unit tests for framework components
  • Write integration tests for full framework
  • Write meta-tests (tests that test the framework)
  • Create documentation tests
  • Achieve >90% code coverage
  • Create regression test suite for framework itself

18. Documentation

  • Write README with framework overview
  • Document testing strategies
  • Document golden test creation
  • Document benchmark creation
  • Document CI/CD integration
  • Create testing best practices guide
  • Add troubleshooting guide
  • Create example test suites

Technical Specifications

Deterministic Mode Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         AI Testing Infrastructure                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Deterministic Layer                              │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │    │
│  │  │  Recording   │  │  Playback    │  │  Stubbing    │               │    │
│  │  │  Mode        │  │  Mode        │  │  Mode        │               │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Evaluation Layer                                 │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │    │
│  │  │  Semantic    │  │  LLM Judge   │  │  Structured  │               │    │
│  │  │  Similarity  │  │  Evaluation  │  │  Matching    │               │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Test Types                                       │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │    │
│  │  │  Golden      │  │  Regression  │  │  Benchmark   │               │    │
│  │  │  Tests       │  │  Tests       │  │  Tests       │               │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Golden Test Schema

class GoldenTest(BaseModel):
    id: str
    name: str
    description: str
    version: str
    
    # Input
    input_prompt: str
    input_context: dict
    input_tools: list[str]
    
    # Expected output
    expected_response: str | None  # For exact match
    expected_tool_calls: list[ToolCall] | None  # For tool use
    expected_key_points: list[str] | None  # For semantic match
    
    # Evaluation criteria
    evaluation_mode: Literal["exact", "semantic", "structured", "llm_judge"]
    similarity_threshold: float = 0.85  # For semantic
    required_fields: list[str] = []  # For structured
    rubric: dict | None = None  # For llm_judge
    
    # Metadata
    tags: list[str]
    created_at: datetime
    last_passed: datetime | None
    pass_rate: float  # Historical

Benchmark Schema

class Benchmark(BaseModel):
    id: str
    name: str
    description: str
    
    # Task definition
    tasks: list[BenchmarkTask]
    
    # Metrics
    metrics: list[str]  # ["accuracy", "latency", "tokens"]
    
    # Thresholds
    thresholds: dict[str, float]  # {"accuracy": 0.9, "latency": 1000}
    
    # Results
    results: list[BenchmarkResult]
    
class BenchmarkTask(BaseModel):
    id: str
    prompt: str
    context: dict
    expected_outcome: str
    max_tokens: int
    max_time_seconds: float

class BenchmarkResult(BaseModel):
    run_id: str
    timestamp: datetime
    model: str
    prompt_version: str
    metrics: dict[str, float]
    passed: bool
    details: dict

Evaluation Rubric Example

QUALITY_RUBRIC = {
    "accuracy": {
        "description": "Response is factually correct",
        "weight": 0.3,
        "criteria": [
            {"score": 5, "description": "Completely accurate, no errors"},
            {"score": 4, "description": "Minor inaccuracies, not material"},
            {"score": 3, "description": "Some inaccuracies, but mostly correct"},
            {"score": 2, "description": "Significant inaccuracies"},
            {"score": 1, "description": "Mostly incorrect"},
        ]
    },
    "relevance": {
        "description": "Response addresses the question",
        "weight": 0.25,
        # ...
    },
    # ...
}

Acceptance Criteria

  • Deterministic mode produces identical results across runs
  • Golden tests catch 100% of prompt regressions
  • Semantic evaluation correlates with human evaluation (>0.8)
  • Benchmarks run in <10 minutes for full suite
  • CI integration works with Gitea Actions
  • >90% of AI functionality has test coverage
  • Regression detection has <5% false positive rate
  • Documentation complete with examples
  • Framework itself has >90% test coverage

Labels

phase-2, mcp, backend, testing, quality

Milestone

Phase 2: MCP Integration

## Overview Implement specialized testing infrastructure for AI/LLM systems that handles the unique challenges of testing non-deterministic behavior. Traditional testing doesn't work for AI - we need deterministic test modes, golden tests, regression suites for prompts, and benchmark frameworks. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - LLM outputs are non-deterministic - same input can produce different outputs - Traditional unit tests fail randomly when testing AI behavior - Prompt changes can silently break functionality - No way to measure if AI quality is improving or regressing - Hard to reproduce issues in development - No benchmarks for comparing model/prompt performance ### The Solution A specialized testing framework that: 1. **Deterministic mode**: Seeded, reproducible LLM responses for unit tests 2. **Golden tests**: Verified input/output pairs that must pass 3. **Semantic testing**: Evaluate meaning, not exact text 4. **Regression testing**: Detect when prompt changes break behavior 5. **Benchmarks**: Measure quality and performance over time --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/testing/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `framework.py` with `AITestFramework` class - [ ] Create `config.py` with Pydantic settings - [ ] Define testing standards and conventions - [ ] Create pytest plugin for AI testing - [ ] Write architecture decision record (ADR) ### 2. Deterministic Mode - [ ] Create `deterministic/mode.py` with deterministic testing - [ ] Implement LLM response mocking infrastructure - [ ] Create response fixtures with seed support - [ ] Implement request/response recording - [ ] Create playback mode from recordings - [ ] Implement response stubbing by pattern - [ ] Create deterministic embedding generation - [ ] Add deterministic random seeds - [ ] Implement cross-test isolation - [ ] Write deterministic mode tests ### 3. Response Recording & Playback - [ ] Create `recording/recorder.py` with recording logic - [ ] Implement automatic recording during development - [ ] Create recording storage (files, database) - [ ] Implement request fingerprinting for matching - [ ] Create recording versioning - [ ] Implement recording cleanup (remove stale) - [ ] Create selective recording (by test, by module) - [ ] Add recording compression - [ ] Write recording tests ### 4. Golden Test Framework - [ ] Create `golden/framework.py` with golden testing - [ ] Define golden test schema (input, expected output, criteria) - [ ] Implement golden test loader from files - [ ] Create golden test runner - [ ] Implement exact match evaluation - [ ] Implement semantic match evaluation (embedding similarity) - [ ] Implement structured match evaluation (JSON keys/values) - [ ] Create golden test updating (regenerate expected) - [ ] Add golden test versioning - [ ] Create golden test reports - [ ] Write golden test framework tests ### 5. Semantic Evaluation - [ ] Create `evaluation/semantic.py` with semantic evaluation - [ ] Implement embedding-based similarity scoring - [ ] Create semantic equivalence checking - [ ] Implement key point extraction and matching - [ ] Create fact consistency checking - [ ] Implement tone/style matching - [ ] Create format compliance checking - [ ] Implement customizable evaluation rubrics - [ ] Add multi-evaluator aggregation - [ ] Write semantic evaluation tests ### 6. LLM-as-Judge Evaluation - [ ] Create `evaluation/llm_judge.py` with LLM evaluation - [ ] Implement evaluation prompt templates - [ ] Create scoring rubrics for different criteria - [ ] Implement pairwise comparison - [ ] Create quality dimensions: - [ ] Accuracy (factually correct) - [ ] Relevance (addresses the question) - [ ] Completeness (covers all aspects) - [ ] Clarity (easy to understand) - [ ] Formatting (proper structure) - [ ] Implement confidence scoring - [ ] Create explanation generation - [ ] Add evaluation caching - [ ] Write LLM judge tests ### 7. Regression Testing - [ ] Create `regression/detector.py` with regression detection - [ ] Implement baseline creation from passing tests - [ ] Create comparison with baseline - [ ] Implement statistical significance testing - [ ] Create regression alerts - [ ] Implement automatic baseline updates - [ ] Create regression history tracking - [ ] Add regression reports with diff - [ ] Implement rollback suggestions - [ ] Write regression tests ### 8. Benchmark Framework - [ ] Create `benchmark/framework.py` with benchmarking - [ ] Define benchmark schemas (tasks, metrics, thresholds) - [ ] Create benchmark categories: - [ ] Tool selection accuracy - [ ] Parameter generation accuracy - [ ] Task completion rate - [ ] Error recovery rate - [ ] Response quality score - [ ] Latency benchmarks - [ ] Token efficiency - [ ] Implement benchmark runner - [ ] Create benchmark result storage - [ ] Implement trend analysis - [ ] Create benchmark comparison reports - [ ] Add performance thresholds and alerts - [ ] Write benchmark tests ### 9. Test Data Generation - [ ] Create `data/generator.py` with test data generation - [ ] Implement synthetic test case generation - [ ] Create edge case generation - [ ] Implement adversarial test case generation - [ ] Create test data variation (parameterized tests) - [ ] Implement data augmentation - [ ] Create realistic mock data generation - [ ] Add test data validation - [ ] Write data generation tests ### 10. Chaos Testing for AI - [ ] Create `chaos/runner.py` with chaos testing - [ ] Implement LLM failure injection - [ ] Create slow response simulation - [ ] Implement partial response simulation - [ ] Create malformed response injection - [ ] Implement token limit hitting - [ ] Create rate limit simulation - [ ] Add chaos test scenarios - [ ] Write chaos test framework tests ### 11. A/B Testing Framework - [ ] Create `ab/framework.py` with A/B testing - [ ] Implement variant definition (prompts, models) - [ ] Create traffic splitting - [ ] Implement metric collection per variant - [ ] Create statistical analysis - [ ] Implement winner detection - [ ] Create A/B test reporting - [ ] Add rollout recommendations - [ ] Write A/B testing tests ### 12. Test Fixtures & Utilities - [ ] Create `fixtures/llm.py` with LLM fixtures - [ ] Implement mock LLM client - [ ] Create mock tool responses - [ ] Implement mock memory/context - [ ] Create test agent factory - [ ] Implement test project factory - [ ] Create assertion helpers - [ ] Add timing utilities - [ ] Write fixture tests ### 13. Coverage Analysis - [ ] Create `coverage/analyzer.py` with coverage analysis - [ ] Implement prompt coverage (which prompts are tested) - [ ] Create tool coverage (which tools are tested) - [ ] Implement scenario coverage (which scenarios) - [ ] Create coverage reports - [ ] Implement coverage thresholds - [ ] Add coverage badges - [ ] Write coverage tests ### 14. CI/CD Integration - [ ] Create `ci/runner.py` with CI integration - [ ] Implement parallel test execution - [ ] Create test result aggregation - [ ] Implement flaky test detection - [ ] Create test prioritization (run fast tests first) - [ ] Implement selective test runs (affected tests only) - [ ] Create CI reporting format - [ ] Add GitHub/Gitea integration - [ ] Write CI integration tests ### 15. Pytest Plugin - [ ] Create `pytest_ai/plugin.py` with pytest plugin - [ ] Implement `@pytest.mark.ai_test` marker - [ ] Create `@pytest.mark.golden` marker - [ ] Implement `@pytest.mark.benchmark` marker - [ ] Create `--ai-deterministic` flag - [ ] Implement `--ai-record` flag - [ ] Create `--ai-playback` flag - [ ] Add `--ai-baseline-update` flag - [ ] Implement custom assertions - [ ] Write plugin tests ### 16. Metrics & Reporting - [ ] Add test metrics to Prometheus - [ ] Track `ai_tests_passed_total` counter - [ ] Track `ai_tests_failed_total` counter - [ ] Track `benchmark_scores` gauges - [ ] Track `regression_detected_total` counter - [ ] Create test dashboards - [ ] Implement test trend visualization - [ ] Add alerting for test failures - [ ] Write metrics tests ### 17. Testing - [ ] Write unit tests for framework components - [ ] Write integration tests for full framework - [ ] Write meta-tests (tests that test the framework) - [ ] Create documentation tests - [ ] Achieve >90% code coverage - [ ] Create regression test suite for framework itself ### 18. Documentation - [ ] Write README with framework overview - [ ] Document testing strategies - [ ] Document golden test creation - [ ] Document benchmark creation - [ ] Document CI/CD integration - [ ] Create testing best practices guide - [ ] Add troubleshooting guide - [ ] Create example test suites --- ## Technical Specifications ### Deterministic Mode Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ AI Testing Infrastructure │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Deterministic Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Recording │ │ Playback │ │ Stubbing │ │ │ │ │ │ Mode │ │ Mode │ │ Mode │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Evaluation Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Semantic │ │ LLM Judge │ │ Structured │ │ │ │ │ │ Similarity │ │ Evaluation │ │ Matching │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Test Types │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Golden │ │ Regression │ │ Benchmark │ │ │ │ │ │ Tests │ │ Tests │ │ Tests │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Golden Test Schema ```python class GoldenTest(BaseModel): id: str name: str description: str version: str # Input input_prompt: str input_context: dict input_tools: list[str] # Expected output expected_response: str | None # For exact match expected_tool_calls: list[ToolCall] | None # For tool use expected_key_points: list[str] | None # For semantic match # Evaluation criteria evaluation_mode: Literal["exact", "semantic", "structured", "llm_judge"] similarity_threshold: float = 0.85 # For semantic required_fields: list[str] = [] # For structured rubric: dict | None = None # For llm_judge # Metadata tags: list[str] created_at: datetime last_passed: datetime | None pass_rate: float # Historical ``` ### Benchmark Schema ```python class Benchmark(BaseModel): id: str name: str description: str # Task definition tasks: list[BenchmarkTask] # Metrics metrics: list[str] # ["accuracy", "latency", "tokens"] # Thresholds thresholds: dict[str, float] # {"accuracy": 0.9, "latency": 1000} # Results results: list[BenchmarkResult] class BenchmarkTask(BaseModel): id: str prompt: str context: dict expected_outcome: str max_tokens: int max_time_seconds: float class BenchmarkResult(BaseModel): run_id: str timestamp: datetime model: str prompt_version: str metrics: dict[str, float] passed: bool details: dict ``` ### Evaluation Rubric Example ```python QUALITY_RUBRIC = { "accuracy": { "description": "Response is factually correct", "weight": 0.3, "criteria": [ {"score": 5, "description": "Completely accurate, no errors"}, {"score": 4, "description": "Minor inaccuracies, not material"}, {"score": 3, "description": "Some inaccuracies, but mostly correct"}, {"score": 2, "description": "Significant inaccuracies"}, {"score": 1, "description": "Mostly incorrect"}, ] }, "relevance": { "description": "Response addresses the question", "weight": 0.25, # ... }, # ... } ``` --- ## Acceptance Criteria - [ ] Deterministic mode produces identical results across runs - [ ] Golden tests catch 100% of prompt regressions - [ ] Semantic evaluation correlates with human evaluation (>0.8) - [ ] Benchmarks run in <10 minutes for full suite - [ ] CI integration works with Gitea Actions - [ ] >90% of AI functionality has test coverage - [ ] Regression detection has <5% false positive rate - [ ] Documentation complete with examples - [ ] Framework itself has >90% test coverage --- ## Labels `phase-2`, `mcp`, `backend`, `testing`, `quality` ## Milestone Phase 2: MCP Integration
Sign in to join this conversation.