feat(mcp): AI Testing Infrastructure #65

New Issue

cardosofelipe · 2026-01-03T09:13:24Z

cardosofelipe commented

2026-01-03 09:13:24 +00:00

Overview

Implement specialized testing infrastructure for AI/LLM systems that handles the unique challenges of testing non-deterministic behavior. Traditional testing doesn't work for AI - we need deterministic test modes, golden tests, regression suites for prompts, and benchmark frameworks.

Parent Epic

Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

LLM outputs are non-deterministic - same input can produce different outputs
Traditional unit tests fail randomly when testing AI behavior
Prompt changes can silently break functionality
No way to measure if AI quality is improving or regressing
Hard to reproduce issues in development
No benchmarks for comparing model/prompt performance

The Solution

A specialized testing framework that:

Deterministic mode: Seeded, reproducible LLM responses for unit tests
Golden tests: Verified input/output pairs that must pass
Semantic testing: Evaluate meaning, not exact text
Regression testing: Detect when prompt changes break behavior
Benchmarks: Measure quality and performance over time

Implementation Sub-Tasks

1. Project Setup & Architecture

Create backend/src/mcp_core/testing/ directory
Create __init__.py with public API exports
Create framework.py with AITestFramework class
Create config.py with Pydantic settings
Define testing standards and conventions
Create pytest plugin for AI testing
Write architecture decision record (ADR)

2. Deterministic Mode

Create deterministic/mode.py with deterministic testing
Implement LLM response mocking infrastructure
Create response fixtures with seed support
Implement request/response recording
Create playback mode from recordings
Implement response stubbing by pattern
Create deterministic embedding generation
Add deterministic random seeds
Implement cross-test isolation
Write deterministic mode tests

3. Response Recording & Playback

Create recording/recorder.py with recording logic
Implement automatic recording during development
Create recording storage (files, database)
Implement request fingerprinting for matching
Create recording versioning
Implement recording cleanup (remove stale)
Create selective recording (by test, by module)
Add recording compression
Write recording tests

4. Golden Test Framework

Create golden/framework.py with golden testing
Define golden test schema (input, expected output, criteria)
Implement golden test loader from files
Create golden test runner
Implement exact match evaluation
Implement semantic match evaluation (embedding similarity)
Implement structured match evaluation (JSON keys/values)
Create golden test updating (regenerate expected)
Add golden test versioning
Create golden test reports
Write golden test framework tests

5. Semantic Evaluation

Create evaluation/semantic.py with semantic evaluation
Implement embedding-based similarity scoring
Create semantic equivalence checking
Implement key point extraction and matching
Create fact consistency checking
Implement tone/style matching
Create format compliance checking
Implement customizable evaluation rubrics
Add multi-evaluator aggregation
Write semantic evaluation tests

6. LLM-as-Judge Evaluation

Create evaluation/llm_judge.py with LLM evaluation
Implement evaluation prompt templates
Create scoring rubrics for different criteria
Implement pairwise comparison
Create quality dimensions:
- Accuracy (factually correct)
- Relevance (addresses the question)
- Completeness (covers all aspects)
- Clarity (easy to understand)
- Formatting (proper structure)
Implement confidence scoring
Create explanation generation
Add evaluation caching
Write LLM judge tests

7. Regression Testing

Create regression/detector.py with regression detection
Implement baseline creation from passing tests
Create comparison with baseline
Implement statistical significance testing
Create regression alerts
Implement automatic baseline updates
Create regression history tracking
Add regression reports with diff
Implement rollback suggestions
Write regression tests

8. Benchmark Framework

Create benchmark/framework.py with benchmarking
Define benchmark schemas (tasks, metrics, thresholds)
Create benchmark categories:
- Tool selection accuracy
- Parameter generation accuracy
- Task completion rate
- Error recovery rate
- Response quality score
- Latency benchmarks
- Token efficiency
Implement benchmark runner
Create benchmark result storage
Implement trend analysis
Create benchmark comparison reports
Add performance thresholds and alerts
Write benchmark tests

9. Test Data Generation

Create data/generator.py with test data generation
Implement synthetic test case generation
Create edge case generation
Implement adversarial test case generation
Create test data variation (parameterized tests)
Implement data augmentation
Create realistic mock data generation
Add test data validation
Write data generation tests

10. Chaos Testing for AI

Create chaos/runner.py with chaos testing
Implement LLM failure injection
Create slow response simulation
Implement partial response simulation
Create malformed response injection
Implement token limit hitting
Create rate limit simulation
Add chaos test scenarios
Write chaos test framework tests

11. A/B Testing Framework

Create ab/framework.py with A/B testing
Implement variant definition (prompts, models)
Create traffic splitting
Implement metric collection per variant
Create statistical analysis
Implement winner detection
Create A/B test reporting
Add rollout recommendations
Write A/B testing tests

12. Test Fixtures & Utilities

Create fixtures/llm.py with LLM fixtures
Implement mock LLM client
Create mock tool responses
Implement mock memory/context
Create test agent factory
Implement test project factory
Create assertion helpers
Add timing utilities
Write fixture tests

13. Coverage Analysis

Create coverage/analyzer.py with coverage analysis
Implement prompt coverage (which prompts are tested)
Create tool coverage (which tools are tested)
Implement scenario coverage (which scenarios)
Create coverage reports
Implement coverage thresholds
Add coverage badges
Write coverage tests

14. CI/CD Integration

Create ci/runner.py with CI integration
Implement parallel test execution
Create test result aggregation
Implement flaky test detection
Create test prioritization (run fast tests first)
Implement selective test runs (affected tests only)
Create CI reporting format
Add GitHub/Gitea integration
Write CI integration tests

15. Pytest Plugin

Create pytest_ai/plugin.py with pytest plugin
Implement @pytest.mark.ai_test marker
Create @pytest.mark.golden marker
Implement @pytest.mark.benchmark marker
Create --ai-deterministic flag
Implement --ai-record flag
Create --ai-playback flag
Add --ai-baseline-update flag
Implement custom assertions
Write plugin tests

16. Metrics & Reporting

Add test metrics to Prometheus
Track ai_tests_passed_total counter
Track ai_tests_failed_total counter
Track benchmark_scores gauges
Track regression_detected_total counter
Create test dashboards
Implement test trend visualization
Add alerting for test failures
Write metrics tests

17. Testing

Write unit tests for framework components
Write integration tests for full framework
Write meta-tests (tests that test the framework)
Create documentation tests
Achieve >90% code coverage
Create regression test suite for framework itself

18. Documentation

Write README with framework overview
Document testing strategies
Document golden test creation
Document benchmark creation
Document CI/CD integration
Create testing best practices guide
Add troubleshooting guide
Create example test suites

Technical Specifications

Deterministic Mode Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         AI Testing Infrastructure                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Deterministic Layer                              │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │    │
│  │  │  Recording   │  │  Playback    │  │  Stubbing    │               │    │
│  │  │  Mode        │  │  Mode        │  │  Mode        │               │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Evaluation Layer                                 │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │    │
│  │  │  Semantic    │  │  LLM Judge   │  │  Structured  │               │    │
│  │  │  Similarity  │  │  Evaluation  │  │  Matching    │               │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Test Types                                       │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │    │
│  │  │  Golden      │  │  Regression  │  │  Benchmark   │               │    │
│  │  │  Tests       │  │  Tests       │  │  Tests       │               │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘               │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Golden Test Schema

class GoldenTest(BaseModel):
    id: str
    name: str
    description: str
    version: str
    
    # Input
    input_prompt: str
    input_context: dict
    input_tools: list[str]
    
    # Expected output
    expected_response: str | None  # For exact match
    expected_tool_calls: list[ToolCall] | None  # For tool use
    expected_key_points: list[str] | None  # For semantic match
    
    # Evaluation criteria
    evaluation_mode: Literal["exact", "semantic", "structured", "llm_judge"]
    similarity_threshold: float = 0.85  # For semantic
    required_fields: list[str] = []  # For structured
    rubric: dict | None = None  # For llm_judge
    
    # Metadata
    tags: list[str]
    created_at: datetime
    last_passed: datetime | None
    pass_rate: float  # Historical

Benchmark Schema

class Benchmark(BaseModel):
    id: str
    name: str
    description: str
    
    # Task definition
    tasks: list[BenchmarkTask]
    
    # Metrics
    metrics: list[str]  # ["accuracy", "latency", "tokens"]
    
    # Thresholds
    thresholds: dict[str, float]  # {"accuracy": 0.9, "latency": 1000}
    
    # Results
    results: list[BenchmarkResult]
    
class BenchmarkTask(BaseModel):
    id: str
    prompt: str
    context: dict
    expected_outcome: str
    max_tokens: int
    max_time_seconds: float

class BenchmarkResult(BaseModel):
    run_id: str
    timestamp: datetime
    model: str
    prompt_version: str
    metrics: dict[str, float]
    passed: bool
    details: dict

Evaluation Rubric Example

QUALITY_RUBRIC = {
    "accuracy": {
        "description": "Response is factually correct",
        "weight": 0.3,
        "criteria": [
            {"score": 5, "description": "Completely accurate, no errors"},
            {"score": 4, "description": "Minor inaccuracies, not material"},
            {"score": 3, "description": "Some inaccuracies, but mostly correct"},
            {"score": 2, "description": "Significant inaccuracies"},
            {"score": 1, "description": "Mostly incorrect"},
        ]
    },
    "relevance": {
        "description": "Response addresses the question",
        "weight": 0.25,
        # ...
    },
    # ...
}

Acceptance Criteria

Deterministic mode produces identical results across runs
Golden tests catch 100% of prompt regressions
Semantic evaluation correlates with human evaluation (>0.8)
Benchmarks run in <10 minutes for full suite
CI integration works with Gitea Actions
>90% of AI functionality has test coverage
Regression detection has <5% false positive rate
Documentation complete with examples
Framework itself has >90% test coverage

Labels

phase-2, mcp, backend, testing, quality

Milestone

Phase 2: MCP Integration

## Overview Implement specialized testing infrastructure for AI/LLM systems that handles the unique challenges of testing non-deterministic behavior. Traditional testing doesn't work for AI - we need deterministic test modes, golden tests, regression suites for prompts, and benchmark frameworks. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - LLM outputs are non-deterministic - same input can produce different outputs - Traditional unit tests fail randomly when testing AI behavior - Prompt changes can silently break functionality - No way to measure if AI quality is improving or regressing - Hard to reproduce issues in development - No benchmarks for comparing model/prompt performance ### The Solution A specialized testing framework that: 1. **Deterministic mode**: Seeded, reproducible LLM responses for unit tests 2. **Golden tests**: Verified input/output pairs that must pass 3. **Semantic testing**: Evaluate meaning, not exact text 4. **Regression testing**: Detect when prompt changes break behavior 5. **Benchmarks**: Measure quality and performance over time --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/testing/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `framework.py` with `AITestFramework` class - [ ] Create `config.py` with Pydantic settings - [ ] Define testing standards and conventions - [ ] Create pytest plugin for AI testing - [ ] Write architecture decision record (ADR) ### 2. Deterministic Mode - [ ] Create `deterministic/mode.py` with deterministic testing - [ ] Implement LLM response mocking infrastructure - [ ] Create response fixtures with seed support - [ ] Implement request/response recording - [ ] Create playback mode from recordings - [ ] Implement response stubbing by pattern - [ ] Create deterministic embedding generation - [ ] Add deterministic random seeds - [ ] Implement cross-test isolation - [ ] Write deterministic mode tests ### 3. Response Recording & Playback - [ ] Create `recording/recorder.py` with recording logic - [ ] Implement automatic recording during development - [ ] Create recording storage (files, database) - [ ] Implement request fingerprinting for matching - [ ] Create recording versioning - [ ] Implement recording cleanup (remove stale) - [ ] Create selective recording (by test, by module) - [ ] Add recording compression - [ ] Write recording tests ### 4. Golden Test Framework - [ ] Create `golden/framework.py` with golden testing - [ ] Define golden test schema (input, expected output, criteria) - [ ] Implement golden test loader from files - [ ] Create golden test runner - [ ] Implement exact match evaluation - [ ] Implement semantic match evaluation (embedding similarity) - [ ] Implement structured match evaluation (JSON keys/values) - [ ] Create golden test updating (regenerate expected) - [ ] Add golden test versioning - [ ] Create golden test reports - [ ] Write golden test framework tests ### 5. Semantic Evaluation - [ ] Create `evaluation/semantic.py` with semantic evaluation - [ ] Implement embedding-based similarity scoring - [ ] Create semantic equivalence checking - [ ] Implement key point extraction and matching - [ ] Create fact consistency checking - [ ] Implement tone/style matching - [ ] Create format compliance checking - [ ] Implement customizable evaluation rubrics - [ ] Add multi-evaluator aggregation - [ ] Write semantic evaluation tests ### 6. LLM-as-Judge Evaluation - [ ] Create `evaluation/llm_judge.py` with LLM evaluation - [ ] Implement evaluation prompt templates - [ ] Create scoring rubrics for different criteria - [ ] Implement pairwise comparison - [ ] Create quality dimensions: - [ ] Accuracy (factually correct) - [ ] Relevance (addresses the question) - [ ] Completeness (covers all aspects) - [ ] Clarity (easy to understand) - [ ] Formatting (proper structure) - [ ] Implement confidence scoring - [ ] Create explanation generation - [ ] Add evaluation caching - [ ] Write LLM judge tests ### 7. Regression Testing - [ ] Create `regression/detector.py` with regression detection - [ ] Implement baseline creation from passing tests - [ ] Create comparison with baseline - [ ] Implement statistical significance testing - [ ] Create regression alerts - [ ] Implement automatic baseline updates - [ ] Create regression history tracking - [ ] Add regression reports with diff - [ ] Implement rollback suggestions - [ ] Write regression tests ### 8. Benchmark Framework - [ ] Create `benchmark/framework.py` with benchmarking - [ ] Define benchmark schemas (tasks, metrics, thresholds) - [ ] Create benchmark categories: - [ ] Tool selection accuracy - [ ] Parameter generation accuracy - [ ] Task completion rate - [ ] Error recovery rate - [ ] Response quality score - [ ] Latency benchmarks - [ ] Token efficiency - [ ] Implement benchmark runner - [ ] Create benchmark result storage - [ ] Implement trend analysis - [ ] Create benchmark comparison reports - [ ] Add performance thresholds and alerts - [ ] Write benchmark tests ### 9. Test Data Generation - [ ] Create `data/generator.py` with test data generation - [ ] Implement synthetic test case generation - [ ] Create edge case generation - [ ] Implement adversarial test case generation - [ ] Create test data variation (parameterized tests) - [ ] Implement data augmentation - [ ] Create realistic mock data generation - [ ] Add test data validation - [ ] Write data generation tests ### 10. Chaos Testing for AI - [ ] Create `chaos/runner.py` with chaos testing - [ ] Implement LLM failure injection - [ ] Create slow response simulation - [ ] Implement partial response simulation - [ ] Create malformed response injection - [ ] Implement token limit hitting - [ ] Create rate limit simulation - [ ] Add chaos test scenarios - [ ] Write chaos test framework tests ### 11. A/B Testing Framework - [ ] Create `ab/framework.py` with A/B testing - [ ] Implement variant definition (prompts, models) - [ ] Create traffic splitting - [ ] Implement metric collection per variant - [ ] Create statistical analysis - [ ] Implement winner detection - [ ] Create A/B test reporting - [ ] Add rollout recommendations - [ ] Write A/B testing tests ### 12. Test Fixtures & Utilities - [ ] Create `fixtures/llm.py` with LLM fixtures - [ ] Implement mock LLM client - [ ] Create mock tool responses - [ ] Implement mock memory/context - [ ] Create test agent factory - [ ] Implement test project factory - [ ] Create assertion helpers - [ ] Add timing utilities - [ ] Write fixture tests ### 13. Coverage Analysis - [ ] Create `coverage/analyzer.py` with coverage analysis - [ ] Implement prompt coverage (which prompts are tested) - [ ] Create tool coverage (which tools are tested) - [ ] Implement scenario coverage (which scenarios) - [ ] Create coverage reports - [ ] Implement coverage thresholds - [ ] Add coverage badges - [ ] Write coverage tests ### 14. CI/CD Integration - [ ] Create `ci/runner.py` with CI integration - [ ] Implement parallel test execution - [ ] Create test result aggregation - [ ] Implement flaky test detection - [ ] Create test prioritization (run fast tests first) - [ ] Implement selective test runs (affected tests only) - [ ] Create CI reporting format - [ ] Add GitHub/Gitea integration - [ ] Write CI integration tests ### 15. Pytest Plugin - [ ] Create `pytest_ai/plugin.py` with pytest plugin - [ ] Implement `@pytest.mark.ai_test` marker - [ ] Create `@pytest.mark.golden` marker - [ ] Implement `@pytest.mark.benchmark` marker - [ ] Create `--ai-deterministic` flag - [ ] Implement `--ai-record` flag - [ ] Create `--ai-playback` flag - [ ] Add `--ai-baseline-update` flag - [ ] Implement custom assertions - [ ] Write plugin tests ### 16. Metrics & Reporting - [ ] Add test metrics to Prometheus - [ ] Track `ai_tests_passed_total` counter - [ ] Track `ai_tests_failed_total` counter - [ ] Track `benchmark_scores` gauges - [ ] Track `regression_detected_total` counter - [ ] Create test dashboards - [ ] Implement test trend visualization - [ ] Add alerting for test failures - [ ] Write metrics tests ### 17. Testing - [ ] Write unit tests for framework components - [ ] Write integration tests for full framework - [ ] Write meta-tests (tests that test the framework) - [ ] Create documentation tests - [ ] Achieve >90% code coverage - [ ] Create regression test suite for framework itself ### 18. Documentation - [ ] Write README with framework overview - [ ] Document testing strategies - [ ] Document golden test creation - [ ] Document benchmark creation - [ ] Document CI/CD integration - [ ] Create testing best practices guide - [ ] Add troubleshooting guide - [ ] Create example test suites --- ## Technical Specifications ### Deterministic Mode Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ AI Testing Infrastructure │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Deterministic Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Recording │ │ Playback │ │ Stubbing │ │ │ │ │ │ Mode │ │ Mode │ │ Mode │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Evaluation Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Semantic │ │ LLM Judge │ │ Structured │ │ │ │ │ │ Similarity │ │ Evaluation │ │ Matching │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Test Types │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Golden │ │ Regression │ │ Benchmark │ │ │ │ │ │ Tests │ │ Tests │ │ Tests │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Golden Test Schema ```python class GoldenTest(BaseModel): id: str name: str description: str version: str # Input input_prompt: str input_context: dict input_tools: list[str] # Expected output expected_response: str | None # For exact match expected_tool_calls: list[ToolCall] | None # For tool use expected_key_points: list[str] | None # For semantic match # Evaluation criteria evaluation_mode: Literal["exact", "semantic", "structured", "llm_judge"] similarity_threshold: float = 0.85 # For semantic required_fields: list[str] = [] # For structured rubric: dict | None = None # For llm_judge # Metadata tags: list[str] created_at: datetime last_passed: datetime | None pass_rate: float # Historical ``` ### Benchmark Schema ```python class Benchmark(BaseModel): id: str name: str description: str # Task definition tasks: list[BenchmarkTask] # Metrics metrics: list[str] # ["accuracy", "latency", "tokens"] # Thresholds thresholds: dict[str, float] # {"accuracy": 0.9, "latency": 1000} # Results results: list[BenchmarkResult] class BenchmarkTask(BaseModel): id: str prompt: str context: dict expected_outcome: str max_tokens: int max_time_seconds: float class BenchmarkResult(BaseModel): run_id: str timestamp: datetime model: str prompt_version: str metrics: dict[str, float] passed: bool details: dict ``` ### Evaluation Rubric Example ```python QUALITY_RUBRIC = { "accuracy": { "description": "Response is factually correct", "weight": 0.3, "criteria": [ {"score": 5, "description": "Completely accurate, no errors"}, {"score": 4, "description": "Minor inaccuracies, not material"}, {"score": 3, "description": "Some inaccuracies, but mostly correct"}, {"score": 2, "description": "Significant inaccuracies"}, {"score": 1, "description": "Mostly incorrect"}, ] }, "relevance": { "description": "Response addresses the question", "weight": 0.25, # ... }, # ... } ``` --- ## Acceptance Criteria - [ ] Deterministic mode produces identical results across runs - [ ] Golden tests catch 100% of prompt regressions - [ ] Semantic evaluation correlates with human evaluation (>0.8) - [ ] Benchmarks run in <10 minutes for full suite - [ ] CI integration works with Gitea Actions - [ ] >90% of AI functionality has test coverage - [ ] Regression detection has <5% false positive rate - [ ] Documentation complete with examples - [ ] Framework itself has >90% test coverage --- ## Labels `phase-2`, `mcp`, `backend`, `testing`, `quality` ## Milestone Phase 2: MCP Integration

cardosofelipe referenced this issue

2026-01-03 09:18:19 +00:00

[EPIC] Phase 2: MCP Integration #60

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cardosofelipe/syndarix#65