feat(mcp): Tool Quality Framework #64

Open
opened 2026-01-03 09:12:00 +00:00 by cardosofelipe · 0 comments

Overview

Implement a framework that ensures all MCP tools are high-quality, well-documented, and optimized for LLM consumption. This is critical for enabling smaller, cheaper models to use tools effectively - they need crystal-clear descriptions, rich examples, and helpful error messages.

Parent Epic

  • Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

  • Tool descriptions are often vague or ambiguous
  • Smaller models struggle with poorly documented tools
  • Error messages don't help LLMs understand what went wrong
  • No standardization across MCP servers
  • Missing examples lead to incorrect tool usage
  • No way to measure tool quality or effectiveness

The Solution

A comprehensive framework that:

  1. Standardizes tool schemas with rich metadata
  2. Generates examples automatically and validates them
  3. Provides helpful errors that guide LLM recovery
  4. Measures tool quality and usage patterns
  5. Validates inputs with clear error messages
  6. Tests tool effectiveness with LLMs

Implementation Sub-Tasks

1. Project Setup & Architecture

  • Create backend/src/mcp_core/tools/ directory
  • Create __init__.py with public API exports
  • Create framework.py with ToolFramework class
  • Create config.py with Pydantic settings
  • Define tool quality standards
  • Create tool registration pattern
  • Write architecture decision record (ADR)

2. Enhanced Tool Schema

  • Create schema/tool_schema.py with enhanced schema
  • Extend JSON Schema with custom metadata fields
  • Add description_short (one-liner for tool lists)
  • Add description_detailed (full explanation)
  • Add when_to_use (guidance for LLMs)
  • Add when_not_to_use (common mistakes)
  • Add prerequisites (what must be done first)
  • Add related_tools (similar or complementary tools)
  • Add examples (see Examples section)
  • Add common_errors (error patterns and solutions)
  • Add performance_hints (expected latency, costs)
  • Create schema validation
  • Write schema tests

3. Example Generation & Validation

  • Create examples/generator.py for example generation
  • Define example schema (input, output, description)
  • Implement manual example registration
  • Implement automatic example generation from usage logs
  • Create example validation (run and verify output)
  • Implement example categorization (basic, advanced, edge case)
  • Create example selection for context (most relevant)
  • Add negative examples (what NOT to do)
  • Create example refresh (update stale examples)
  • Write example validation tests

4. Input Validation Framework

  • Create validation/input_validator.py with validation logic
  • Implement JSON Schema validation with clear errors
  • Implement custom validators per field type
  • Create validation for:
    • Required fields with helpful "missing X" messages
    • Type checking with "expected X, got Y" messages
    • Range validation with "X must be between A and B"
    • Pattern validation with "X must match format Y"
    • Enum validation with "X must be one of [A, B, C]"
    • Dependency validation with "X requires Y to be set"
  • Implement validation hints (did you mean X?)
  • Create validation caching for performance
  • Write validation tests with edge cases

5. Error Response Framework

  • Create errors/framework.py with error handling
  • Define error response schema for LLMs
  • Create error categories:
    • InputError - Problem with input parameters
    • StateError - Prerequisites not met
    • ResourceError - Resource not found/accessible
    • PermissionError - Insufficient permissions
    • RateLimitError - Rate/quota exceeded
    • ExternalError - External service failure
    • InternalError - Unexpected internal error
  • Implement error enrichment:
    • Add what_went_wrong - Clear explanation
    • Add why_it_happened - Root cause
    • Add how_to_fix - Actionable steps
    • Add related_docs - Links to documentation
    • Add retry_after - When to retry (if applicable)
  • Create error logging and metrics
  • Write error handling tests

6. Tool Documentation Generator

  • Create docs/generator.py with documentation generation
  • Generate Markdown documentation from schema
  • Generate OpenAPI-style documentation
  • Generate LLM-optimized tool cards
  • Create documentation versioning
  • Implement documentation validation
  • Create documentation search index
  • Generate usage examples section
  • Write documentation tests
  • Create discovery/service.py with discovery logic
  • Implement tool registry with metadata
  • Create tool search by keyword
  • Create tool search by capability
  • Create tool recommendation based on context
  • Implement tool similarity scoring
  • Create tool categories and tags
  • Implement tool versioning and deprecation
  • Write discovery tests

8. Tool Quality Metrics

  • Create metrics/quality.py with quality scoring
  • Define quality dimensions:
    • Description completeness (0-100%)
    • Example coverage (0-100%)
    • Error clarity (0-100%)
    • Usage success rate (0-100%)
    • LLM understanding rate (measured via tests)
  • Implement automated quality scoring
  • Create quality reports per tool
  • Create quality dashboards
  • Implement quality alerts (low-quality tools)
  • Write quality metric tests

9. Tool Usage Analytics

  • Create analytics/usage.py with usage tracking
  • Track tool invocation counts
  • Track success/failure rates
  • Track common error patterns
  • Track input parameter distributions
  • Track usage by agent type
  • Track usage by task type
  • Create usage insights (which tools go together)
  • Implement usage-based optimization hints
  • Write analytics tests

10. LLM Effectiveness Testing

  • Create testing/llm_tests.py with LLM testing
  • Define test scenarios (tasks requiring tools)
  • Implement test runner with LLM
  • Measure tool selection accuracy
  • Measure parameter generation accuracy
  • Measure error recovery success
  • Create A/B testing for descriptions
  • Implement regression testing for descriptions
  • Create test reports with recommendations
  • Write meta-tests for testing framework

11. Tool Optimization

  • Create optimization/optimizer.py with optimization logic
  • Analyze usage patterns for optimization opportunities
  • Suggest description improvements based on errors
  • Suggest example additions based on failures
  • Implement automated description refinement (with human review)
  • Create tool composition recommendations
  • Implement caching recommendations
  • Write optimization tests

12. Idempotency Framework

  • Create idempotency/manager.py with idempotency logic
  • Define idempotency keys for tools
  • Implement idempotency checking
  • Create idempotency storage (Redis)
  • Handle duplicate requests gracefully
  • Implement idempotency TTL
  • Add idempotency metrics
  • Document idempotent vs non-idempotent tools
  • Write idempotency tests

13. Tool Versioning

  • Create versioning/manager.py with version management
  • Implement semantic versioning for tools
  • Create version compatibility matrix
  • Implement graceful deprecation
  • Create migration guides for breaking changes
  • Implement version negotiation
  • Add version metrics
  • Write versioning tests

14. MCP Integration

  • Create list_tools tool with enhanced metadata
  • Create get_tool_details tool with full schema
  • Create get_tool_examples tool for specific tool
  • Create search_tools tool with query
  • Create recommend_tools tool based on task
  • Create report_tool_issue tool for feedback
  • Integrate with all MCP servers
  • Write MCP tool tests

15. Testing

  • Write unit tests for schema validation
  • Write unit tests for input validation
  • Write unit tests for error handling
  • Write integration tests for full framework
  • Write LLM-based tool effectiveness tests
  • Write performance benchmarks
  • Achieve >90% code coverage
  • Create regression test suite

16. Documentation

  • Write README with framework overview
  • Document tool schema requirements
  • Document quality standards
  • Document example requirements
  • Document error response guidelines
  • Create tool creation guide
  • Add troubleshooting guide
  • Create best practices checklist

Technical Specifications

Enhanced Tool Schema

class ToolSchema(BaseModel):
    # Basic info
    name: str
    version: str
    
    # Descriptions (optimized for LLMs)
    description_short: str  # Max 100 chars, for tool lists
    description_detailed: str  # Full explanation
    when_to_use: str  # "Use this tool when..."
    when_not_to_use: str  # "Do NOT use this tool when..."
    
    # Parameters
    parameters: JSONSchema
    required_params: list[str]
    optional_params: list[str]
    
    # Guidance
    prerequisites: list[str]  # What must be done first
    related_tools: list[str]  # Similar or complementary
    common_workflows: list[str]  # Common tool sequences
    
    # Examples
    examples: list[ToolExample]
    
    # Error guidance
    common_errors: list[CommonError]
    
    # Metadata
    performance: ToolPerformance
    category: str
    tags: list[str]
    
class ToolExample(BaseModel):
    description: str  # What this example demonstrates
    input: dict  # Example input
    output: dict  # Expected output
    context: str  # When to use this approach
    
class CommonError(BaseModel):
    error_type: str
    symptom: str  # What you'll see
    cause: str  # Why it happens
    solution: str  # How to fix it
    
class ToolPerformance(BaseModel):
    typical_latency_ms: int
    max_latency_ms: int
    token_cost_estimate: int
    is_idempotent: bool
    is_reversible: bool

Error Response Schema

class ToolError(BaseModel):
    error_type: Literal[
        "input_error",
        "state_error", 
        "resource_error",
        "permission_error",
        "rate_limit_error",
        "external_error",
        "internal_error"
    ]
    
    # For LLM understanding
    what_went_wrong: str  # Plain English explanation
    why_it_happened: str  # Root cause
    how_to_fix: list[str]  # Actionable steps
    
    # Technical details
    error_code: str
    details: dict
    
    # Recovery hints
    retry_recommended: bool
    retry_after_seconds: int | None
    alternative_tools: list[str]
    related_docs: list[str]

Quality Score Calculation

Quality Score = (
    0.25 * description_completeness +
    0.20 * example_coverage +
    0.20 * error_clarity +
    0.20 * usage_success_rate +
    0.15 * llm_understanding_rate
)

Thresholds:
- Production Ready: >= 80%
- Needs Improvement: 60-79%
- Not Ready: < 60%

Acceptance Criteria

  • All tools have complete enhanced schemas
  • All tools have ≥3 examples (basic, advanced, edge case)
  • Error responses are LLM-friendly with recovery hints
  • Tool quality scores are ≥80% for all production tools
  • LLM effectiveness tests pass for all tools
  • Input validation catches 100% of invalid inputs
  • Documentation is auto-generated and up-to-date
  • Usage analytics are tracked for all tools
  • >90% test coverage
  • Tool creation guide is complete

Labels

phase-2, mcp, backend, tools, quality

Milestone

Phase 2: MCP Integration

## Overview Implement a framework that ensures all MCP tools are high-quality, well-documented, and optimized for LLM consumption. This is critical for enabling smaller, cheaper models to use tools effectively - they need crystal-clear descriptions, rich examples, and helpful error messages. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - Tool descriptions are often vague or ambiguous - Smaller models struggle with poorly documented tools - Error messages don't help LLMs understand what went wrong - No standardization across MCP servers - Missing examples lead to incorrect tool usage - No way to measure tool quality or effectiveness ### The Solution A comprehensive framework that: 1. **Standardizes tool schemas** with rich metadata 2. **Generates examples** automatically and validates them 3. **Provides helpful errors** that guide LLM recovery 4. **Measures tool quality** and usage patterns 5. **Validates inputs** with clear error messages 6. **Tests tool effectiveness** with LLMs --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/tools/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `framework.py` with `ToolFramework` class - [ ] Create `config.py` with Pydantic settings - [ ] Define tool quality standards - [ ] Create tool registration pattern - [ ] Write architecture decision record (ADR) ### 2. Enhanced Tool Schema - [ ] Create `schema/tool_schema.py` with enhanced schema - [ ] Extend JSON Schema with custom metadata fields - [ ] Add `description_short` (one-liner for tool lists) - [ ] Add `description_detailed` (full explanation) - [ ] Add `when_to_use` (guidance for LLMs) - [ ] Add `when_not_to_use` (common mistakes) - [ ] Add `prerequisites` (what must be done first) - [ ] Add `related_tools` (similar or complementary tools) - [ ] Add `examples` (see Examples section) - [ ] Add `common_errors` (error patterns and solutions) - [ ] Add `performance_hints` (expected latency, costs) - [ ] Create schema validation - [ ] Write schema tests ### 3. Example Generation & Validation - [ ] Create `examples/generator.py` for example generation - [ ] Define example schema (input, output, description) - [ ] Implement manual example registration - [ ] Implement automatic example generation from usage logs - [ ] Create example validation (run and verify output) - [ ] Implement example categorization (basic, advanced, edge case) - [ ] Create example selection for context (most relevant) - [ ] Add negative examples (what NOT to do) - [ ] Create example refresh (update stale examples) - [ ] Write example validation tests ### 4. Input Validation Framework - [ ] Create `validation/input_validator.py` with validation logic - [ ] Implement JSON Schema validation with clear errors - [ ] Implement custom validators per field type - [ ] Create validation for: - [ ] Required fields with helpful "missing X" messages - [ ] Type checking with "expected X, got Y" messages - [ ] Range validation with "X must be between A and B" - [ ] Pattern validation with "X must match format Y" - [ ] Enum validation with "X must be one of [A, B, C]" - [ ] Dependency validation with "X requires Y to be set" - [ ] Implement validation hints (did you mean X?) - [ ] Create validation caching for performance - [ ] Write validation tests with edge cases ### 5. Error Response Framework - [ ] Create `errors/framework.py` with error handling - [ ] Define error response schema for LLMs - [ ] Create error categories: - [ ] `InputError` - Problem with input parameters - [ ] `StateError` - Prerequisites not met - [ ] `ResourceError` - Resource not found/accessible - [ ] `PermissionError` - Insufficient permissions - [ ] `RateLimitError` - Rate/quota exceeded - [ ] `ExternalError` - External service failure - [ ] `InternalError` - Unexpected internal error - [ ] Implement error enrichment: - [ ] Add `what_went_wrong` - Clear explanation - [ ] Add `why_it_happened` - Root cause - [ ] Add `how_to_fix` - Actionable steps - [ ] Add `related_docs` - Links to documentation - [ ] Add `retry_after` - When to retry (if applicable) - [ ] Create error logging and metrics - [ ] Write error handling tests ### 6. Tool Documentation Generator - [ ] Create `docs/generator.py` with documentation generation - [ ] Generate Markdown documentation from schema - [ ] Generate OpenAPI-style documentation - [ ] Generate LLM-optimized tool cards - [ ] Create documentation versioning - [ ] Implement documentation validation - [ ] Create documentation search index - [ ] Generate usage examples section - [ ] Write documentation tests ### 7. Tool Discovery & Search - [ ] Create `discovery/service.py` with discovery logic - [ ] Implement tool registry with metadata - [ ] Create tool search by keyword - [ ] Create tool search by capability - [ ] Create tool recommendation based on context - [ ] Implement tool similarity scoring - [ ] Create tool categories and tags - [ ] Implement tool versioning and deprecation - [ ] Write discovery tests ### 8. Tool Quality Metrics - [ ] Create `metrics/quality.py` with quality scoring - [ ] Define quality dimensions: - [ ] Description completeness (0-100%) - [ ] Example coverage (0-100%) - [ ] Error clarity (0-100%) - [ ] Usage success rate (0-100%) - [ ] LLM understanding rate (measured via tests) - [ ] Implement automated quality scoring - [ ] Create quality reports per tool - [ ] Create quality dashboards - [ ] Implement quality alerts (low-quality tools) - [ ] Write quality metric tests ### 9. Tool Usage Analytics - [ ] Create `analytics/usage.py` with usage tracking - [ ] Track tool invocation counts - [ ] Track success/failure rates - [ ] Track common error patterns - [ ] Track input parameter distributions - [ ] Track usage by agent type - [ ] Track usage by task type - [ ] Create usage insights (which tools go together) - [ ] Implement usage-based optimization hints - [ ] Write analytics tests ### 10. LLM Effectiveness Testing - [ ] Create `testing/llm_tests.py` with LLM testing - [ ] Define test scenarios (tasks requiring tools) - [ ] Implement test runner with LLM - [ ] Measure tool selection accuracy - [ ] Measure parameter generation accuracy - [ ] Measure error recovery success - [ ] Create A/B testing for descriptions - [ ] Implement regression testing for descriptions - [ ] Create test reports with recommendations - [ ] Write meta-tests for testing framework ### 11. Tool Optimization - [ ] Create `optimization/optimizer.py` with optimization logic - [ ] Analyze usage patterns for optimization opportunities - [ ] Suggest description improvements based on errors - [ ] Suggest example additions based on failures - [ ] Implement automated description refinement (with human review) - [ ] Create tool composition recommendations - [ ] Implement caching recommendations - [ ] Write optimization tests ### 12. Idempotency Framework - [ ] Create `idempotency/manager.py` with idempotency logic - [ ] Define idempotency keys for tools - [ ] Implement idempotency checking - [ ] Create idempotency storage (Redis) - [ ] Handle duplicate requests gracefully - [ ] Implement idempotency TTL - [ ] Add idempotency metrics - [ ] Document idempotent vs non-idempotent tools - [ ] Write idempotency tests ### 13. Tool Versioning - [ ] Create `versioning/manager.py` with version management - [ ] Implement semantic versioning for tools - [ ] Create version compatibility matrix - [ ] Implement graceful deprecation - [ ] Create migration guides for breaking changes - [ ] Implement version negotiation - [ ] Add version metrics - [ ] Write versioning tests ### 14. MCP Integration - [ ] Create `list_tools` tool with enhanced metadata - [ ] Create `get_tool_details` tool with full schema - [ ] Create `get_tool_examples` tool for specific tool - [ ] Create `search_tools` tool with query - [ ] Create `recommend_tools` tool based on task - [ ] Create `report_tool_issue` tool for feedback - [ ] Integrate with all MCP servers - [ ] Write MCP tool tests ### 15. Testing - [ ] Write unit tests for schema validation - [ ] Write unit tests for input validation - [ ] Write unit tests for error handling - [ ] Write integration tests for full framework - [ ] Write LLM-based tool effectiveness tests - [ ] Write performance benchmarks - [ ] Achieve >90% code coverage - [ ] Create regression test suite ### 16. Documentation - [ ] Write README with framework overview - [ ] Document tool schema requirements - [ ] Document quality standards - [ ] Document example requirements - [ ] Document error response guidelines - [ ] Create tool creation guide - [ ] Add troubleshooting guide - [ ] Create best practices checklist --- ## Technical Specifications ### Enhanced Tool Schema ```python class ToolSchema(BaseModel): # Basic info name: str version: str # Descriptions (optimized for LLMs) description_short: str # Max 100 chars, for tool lists description_detailed: str # Full explanation when_to_use: str # "Use this tool when..." when_not_to_use: str # "Do NOT use this tool when..." # Parameters parameters: JSONSchema required_params: list[str] optional_params: list[str] # Guidance prerequisites: list[str] # What must be done first related_tools: list[str] # Similar or complementary common_workflows: list[str] # Common tool sequences # Examples examples: list[ToolExample] # Error guidance common_errors: list[CommonError] # Metadata performance: ToolPerformance category: str tags: list[str] class ToolExample(BaseModel): description: str # What this example demonstrates input: dict # Example input output: dict # Expected output context: str # When to use this approach class CommonError(BaseModel): error_type: str symptom: str # What you'll see cause: str # Why it happens solution: str # How to fix it class ToolPerformance(BaseModel): typical_latency_ms: int max_latency_ms: int token_cost_estimate: int is_idempotent: bool is_reversible: bool ``` ### Error Response Schema ```python class ToolError(BaseModel): error_type: Literal[ "input_error", "state_error", "resource_error", "permission_error", "rate_limit_error", "external_error", "internal_error" ] # For LLM understanding what_went_wrong: str # Plain English explanation why_it_happened: str # Root cause how_to_fix: list[str] # Actionable steps # Technical details error_code: str details: dict # Recovery hints retry_recommended: bool retry_after_seconds: int | None alternative_tools: list[str] related_docs: list[str] ``` ### Quality Score Calculation ``` Quality Score = ( 0.25 * description_completeness + 0.20 * example_coverage + 0.20 * error_clarity + 0.20 * usage_success_rate + 0.15 * llm_understanding_rate ) Thresholds: - Production Ready: >= 80% - Needs Improvement: 60-79% - Not Ready: < 60% ``` --- ## Acceptance Criteria - [ ] All tools have complete enhanced schemas - [ ] All tools have ≥3 examples (basic, advanced, edge case) - [ ] Error responses are LLM-friendly with recovery hints - [ ] Tool quality scores are ≥80% for all production tools - [ ] LLM effectiveness tests pass for all tools - [ ] Input validation catches 100% of invalid inputs - [ ] Documentation is auto-generated and up-to-date - [ ] Usage analytics are tracked for all tools - [ ] >90% test coverage - [ ] Tool creation guide is complete --- ## Labels `phase-2`, `mcp`, `backend`, `tools`, `quality` ## Milestone Phase 2: MCP Integration
Sign in to join this conversation.