feat(mcp): Tool Quality Framework #64

New Issue

cardosofelipe · 2026-01-03T09:12:00Z

cardosofelipe commented

2026-01-03 09:12:00 +00:00

Overview

Implement a framework that ensures all MCP tools are high-quality, well-documented, and optimized for LLM consumption. This is critical for enabling smaller, cheaper models to use tools effectively - they need crystal-clear descriptions, rich examples, and helpful error messages.

Parent Epic

Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

Tool descriptions are often vague or ambiguous
Smaller models struggle with poorly documented tools
Error messages don't help LLMs understand what went wrong
No standardization across MCP servers
Missing examples lead to incorrect tool usage
No way to measure tool quality or effectiveness

The Solution

A comprehensive framework that:

Standardizes tool schemas with rich metadata
Generates examples automatically and validates them
Provides helpful errors that guide LLM recovery
Measures tool quality and usage patterns
Validates inputs with clear error messages
Tests tool effectiveness with LLMs

Implementation Sub-Tasks

1. Project Setup & Architecture

Create backend/src/mcp_core/tools/ directory
Create __init__.py with public API exports
Create framework.py with ToolFramework class
Create config.py with Pydantic settings
Define tool quality standards
Create tool registration pattern
Write architecture decision record (ADR)

2. Enhanced Tool Schema

Create schema/tool_schema.py with enhanced schema
Extend JSON Schema with custom metadata fields
Add description_short (one-liner for tool lists)
Add description_detailed (full explanation)
Add when_to_use (guidance for LLMs)
Add when_not_to_use (common mistakes)
Add prerequisites (what must be done first)
Add related_tools (similar or complementary tools)
Add examples (see Examples section)
Add common_errors (error patterns and solutions)
Add performance_hints (expected latency, costs)
Create schema validation
Write schema tests

3. Example Generation & Validation

Create examples/generator.py for example generation
Define example schema (input, output, description)
Implement manual example registration
Implement automatic example generation from usage logs
Create example validation (run and verify output)
Implement example categorization (basic, advanced, edge case)
Create example selection for context (most relevant)
Add negative examples (what NOT to do)
Create example refresh (update stale examples)
Write example validation tests

4. Input Validation Framework

Create validation/input_validator.py with validation logic
Implement JSON Schema validation with clear errors
Implement custom validators per field type
Create validation for:
- Required fields with helpful "missing X" messages
- Type checking with "expected X, got Y" messages
- Range validation with "X must be between A and B"
- Pattern validation with "X must match format Y"
- Enum validation with "X must be one of [A, B, C]"
- Dependency validation with "X requires Y to be set"
Implement validation hints (did you mean X?)
Create validation caching for performance
Write validation tests with edge cases

5. Error Response Framework

Create errors/framework.py with error handling
Define error response schema for LLMs
Create error categories:
- InputError - Problem with input parameters
- StateError - Prerequisites not met
- ResourceError - Resource not found/accessible
- PermissionError - Insufficient permissions
- RateLimitError - Rate/quota exceeded
- ExternalError - External service failure
- InternalError - Unexpected internal error
Implement error enrichment:
- Add what_went_wrong - Clear explanation
- Add why_it_happened - Root cause
- Add how_to_fix - Actionable steps
- Add related_docs - Links to documentation
- Add retry_after - When to retry (if applicable)
Create error logging and metrics
Write error handling tests

6. Tool Documentation Generator

Create docs/generator.py with documentation generation
Generate Markdown documentation from schema
Generate OpenAPI-style documentation
Generate LLM-optimized tool cards
Create documentation versioning
Implement documentation validation
Create documentation search index
Generate usage examples section
Write documentation tests

7. Tool Discovery & Search

Create discovery/service.py with discovery logic
Implement tool registry with metadata
Create tool search by keyword
Create tool search by capability
Create tool recommendation based on context
Implement tool similarity scoring
Create tool categories and tags
Implement tool versioning and deprecation
Write discovery tests

8. Tool Quality Metrics

Create metrics/quality.py with quality scoring
Define quality dimensions:
- Description completeness (0-100%)
- Example coverage (0-100%)
- Error clarity (0-100%)
- Usage success rate (0-100%)
- LLM understanding rate (measured via tests)
Implement automated quality scoring
Create quality reports per tool
Create quality dashboards
Implement quality alerts (low-quality tools)
Write quality metric tests

9. Tool Usage Analytics

Create analytics/usage.py with usage tracking
Track tool invocation counts
Track success/failure rates
Track common error patterns
Track input parameter distributions
Track usage by agent type
Track usage by task type
Create usage insights (which tools go together)
Implement usage-based optimization hints
Write analytics tests

10. LLM Effectiveness Testing

Create testing/llm_tests.py with LLM testing
Define test scenarios (tasks requiring tools)
Implement test runner with LLM
Measure tool selection accuracy
Measure parameter generation accuracy
Measure error recovery success
Create A/B testing for descriptions
Implement regression testing for descriptions
Create test reports with recommendations
Write meta-tests for testing framework

11. Tool Optimization

Create optimization/optimizer.py with optimization logic
Analyze usage patterns for optimization opportunities
Suggest description improvements based on errors
Suggest example additions based on failures
Implement automated description refinement (with human review)
Create tool composition recommendations
Implement caching recommendations
Write optimization tests

12. Idempotency Framework

Create idempotency/manager.py with idempotency logic
Define idempotency keys for tools
Implement idempotency checking
Create idempotency storage (Redis)
Handle duplicate requests gracefully
Implement idempotency TTL
Add idempotency metrics
Document idempotent vs non-idempotent tools
Write idempotency tests

13. Tool Versioning

Create versioning/manager.py with version management
Implement semantic versioning for tools
Create version compatibility matrix
Implement graceful deprecation
Create migration guides for breaking changes
Implement version negotiation
Add version metrics
Write versioning tests

14. MCP Integration

Create list_tools tool with enhanced metadata
Create get_tool_details tool with full schema
Create get_tool_examples tool for specific tool
Create search_tools tool with query
Create recommend_tools tool based on task
Create report_tool_issue tool for feedback
Integrate with all MCP servers
Write MCP tool tests

15. Testing

Write unit tests for schema validation
Write unit tests for input validation
Write unit tests for error handling
Write integration tests for full framework
Write LLM-based tool effectiveness tests
Write performance benchmarks
Achieve >90% code coverage
Create regression test suite

16. Documentation

Write README with framework overview
Document tool schema requirements
Document quality standards
Document example requirements
Document error response guidelines
Create tool creation guide
Add troubleshooting guide
Create best practices checklist

Technical Specifications

Enhanced Tool Schema

class ToolSchema(BaseModel):
    # Basic info
    name: str
    version: str
    
    # Descriptions (optimized for LLMs)
    description_short: str  # Max 100 chars, for tool lists
    description_detailed: str  # Full explanation
    when_to_use: str  # "Use this tool when..."
    when_not_to_use: str  # "Do NOT use this tool when..."
    
    # Parameters
    parameters: JSONSchema
    required_params: list[str]
    optional_params: list[str]
    
    # Guidance
    prerequisites: list[str]  # What must be done first
    related_tools: list[str]  # Similar or complementary
    common_workflows: list[str]  # Common tool sequences
    
    # Examples
    examples: list[ToolExample]
    
    # Error guidance
    common_errors: list[CommonError]
    
    # Metadata
    performance: ToolPerformance
    category: str
    tags: list[str]
    
class ToolExample(BaseModel):
    description: str  # What this example demonstrates
    input: dict  # Example input
    output: dict  # Expected output
    context: str  # When to use this approach
    
class CommonError(BaseModel):
    error_type: str
    symptom: str  # What you'll see
    cause: str  # Why it happens
    solution: str  # How to fix it
    
class ToolPerformance(BaseModel):
    typical_latency_ms: int
    max_latency_ms: int
    token_cost_estimate: int
    is_idempotent: bool
    is_reversible: bool

Error Response Schema

class ToolError(BaseModel):
    error_type: Literal[
        "input_error",
        "state_error", 
        "resource_error",
        "permission_error",
        "rate_limit_error",
        "external_error",
        "internal_error"
    ]
    
    # For LLM understanding
    what_went_wrong: str  # Plain English explanation
    why_it_happened: str  # Root cause
    how_to_fix: list[str]  # Actionable steps
    
    # Technical details
    error_code: str
    details: dict
    
    # Recovery hints
    retry_recommended: bool
    retry_after_seconds: int | None
    alternative_tools: list[str]
    related_docs: list[str]

Quality Score Calculation

Quality Score = (
    0.25 * description_completeness +
    0.20 * example_coverage +
    0.20 * error_clarity +
    0.20 * usage_success_rate +
    0.15 * llm_understanding_rate
)

Thresholds:
- Production Ready: >= 80%
- Needs Improvement: 60-79%
- Not Ready: < 60%

Acceptance Criteria

All tools have complete enhanced schemas
All tools have ≥3 examples (basic, advanced, edge case)
Error responses are LLM-friendly with recovery hints
Tool quality scores are ≥80% for all production tools
LLM effectiveness tests pass for all tools
Input validation catches 100% of invalid inputs
Documentation is auto-generated and up-to-date
Usage analytics are tracked for all tools
>90% test coverage
Tool creation guide is complete

Labels

phase-2, mcp, backend, tools, quality

Milestone

Phase 2: MCP Integration

## Overview Implement a framework that ensures all MCP tools are high-quality, well-documented, and optimized for LLM consumption. This is critical for enabling smaller, cheaper models to use tools effectively - they need crystal-clear descriptions, rich examples, and helpful error messages. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - Tool descriptions are often vague or ambiguous - Smaller models struggle with poorly documented tools - Error messages don't help LLMs understand what went wrong - No standardization across MCP servers - Missing examples lead to incorrect tool usage - No way to measure tool quality or effectiveness ### The Solution A comprehensive framework that: 1. **Standardizes tool schemas** with rich metadata 2. **Generates examples** automatically and validates them 3. **Provides helpful errors** that guide LLM recovery 4. **Measures tool quality** and usage patterns 5. **Validates inputs** with clear error messages 6. **Tests tool effectiveness** with LLMs --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/tools/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `framework.py` with `ToolFramework` class - [ ] Create `config.py` with Pydantic settings - [ ] Define tool quality standards - [ ] Create tool registration pattern - [ ] Write architecture decision record (ADR) ### 2. Enhanced Tool Schema - [ ] Create `schema/tool_schema.py` with enhanced schema - [ ] Extend JSON Schema with custom metadata fields - [ ] Add `description_short` (one-liner for tool lists) - [ ] Add `description_detailed` (full explanation) - [ ] Add `when_to_use` (guidance for LLMs) - [ ] Add `when_not_to_use` (common mistakes) - [ ] Add `prerequisites` (what must be done first) - [ ] Add `related_tools` (similar or complementary tools) - [ ] Add `examples` (see Examples section) - [ ] Add `common_errors` (error patterns and solutions) - [ ] Add `performance_hints` (expected latency, costs) - [ ] Create schema validation - [ ] Write schema tests ### 3. Example Generation & Validation - [ ] Create `examples/generator.py` for example generation - [ ] Define example schema (input, output, description) - [ ] Implement manual example registration - [ ] Implement automatic example generation from usage logs - [ ] Create example validation (run and verify output) - [ ] Implement example categorization (basic, advanced, edge case) - [ ] Create example selection for context (most relevant) - [ ] Add negative examples (what NOT to do) - [ ] Create example refresh (update stale examples) - [ ] Write example validation tests ### 4. Input Validation Framework - [ ] Create `validation/input_validator.py` with validation logic - [ ] Implement JSON Schema validation with clear errors - [ ] Implement custom validators per field type - [ ] Create validation for: - [ ] Required fields with helpful "missing X" messages - [ ] Type checking with "expected X, got Y" messages - [ ] Range validation with "X must be between A and B" - [ ] Pattern validation with "X must match format Y" - [ ] Enum validation with "X must be one of [A, B, C]" - [ ] Dependency validation with "X requires Y to be set" - [ ] Implement validation hints (did you mean X?) - [ ] Create validation caching for performance - [ ] Write validation tests with edge cases ### 5. Error Response Framework - [ ] Create `errors/framework.py` with error handling - [ ] Define error response schema for LLMs - [ ] Create error categories: - [ ] `InputError` - Problem with input parameters - [ ] `StateError` - Prerequisites not met - [ ] `ResourceError` - Resource not found/accessible - [ ] `PermissionError` - Insufficient permissions - [ ] `RateLimitError` - Rate/quota exceeded - [ ] `ExternalError` - External service failure - [ ] `InternalError` - Unexpected internal error - [ ] Implement error enrichment: - [ ] Add `what_went_wrong` - Clear explanation - [ ] Add `why_it_happened` - Root cause - [ ] Add `how_to_fix` - Actionable steps - [ ] Add `related_docs` - Links to documentation - [ ] Add `retry_after` - When to retry (if applicable) - [ ] Create error logging and metrics - [ ] Write error handling tests ### 6. Tool Documentation Generator - [ ] Create `docs/generator.py` with documentation generation - [ ] Generate Markdown documentation from schema - [ ] Generate OpenAPI-style documentation - [ ] Generate LLM-optimized tool cards - [ ] Create documentation versioning - [ ] Implement documentation validation - [ ] Create documentation search index - [ ] Generate usage examples section - [ ] Write documentation tests ### 7. Tool Discovery & Search - [ ] Create `discovery/service.py` with discovery logic - [ ] Implement tool registry with metadata - [ ] Create tool search by keyword - [ ] Create tool search by capability - [ ] Create tool recommendation based on context - [ ] Implement tool similarity scoring - [ ] Create tool categories and tags - [ ] Implement tool versioning and deprecation - [ ] Write discovery tests ### 8. Tool Quality Metrics - [ ] Create `metrics/quality.py` with quality scoring - [ ] Define quality dimensions: - [ ] Description completeness (0-100%) - [ ] Example coverage (0-100%) - [ ] Error clarity (0-100%) - [ ] Usage success rate (0-100%) - [ ] LLM understanding rate (measured via tests) - [ ] Implement automated quality scoring - [ ] Create quality reports per tool - [ ] Create quality dashboards - [ ] Implement quality alerts (low-quality tools) - [ ] Write quality metric tests ### 9. Tool Usage Analytics - [ ] Create `analytics/usage.py` with usage tracking - [ ] Track tool invocation counts - [ ] Track success/failure rates - [ ] Track common error patterns - [ ] Track input parameter distributions - [ ] Track usage by agent type - [ ] Track usage by task type - [ ] Create usage insights (which tools go together) - [ ] Implement usage-based optimization hints - [ ] Write analytics tests ### 10. LLM Effectiveness Testing - [ ] Create `testing/llm_tests.py` with LLM testing - [ ] Define test scenarios (tasks requiring tools) - [ ] Implement test runner with LLM - [ ] Measure tool selection accuracy - [ ] Measure parameter generation accuracy - [ ] Measure error recovery success - [ ] Create A/B testing for descriptions - [ ] Implement regression testing for descriptions - [ ] Create test reports with recommendations - [ ] Write meta-tests for testing framework ### 11. Tool Optimization - [ ] Create `optimization/optimizer.py` with optimization logic - [ ] Analyze usage patterns for optimization opportunities - [ ] Suggest description improvements based on errors - [ ] Suggest example additions based on failures - [ ] Implement automated description refinement (with human review) - [ ] Create tool composition recommendations - [ ] Implement caching recommendations - [ ] Write optimization tests ### 12. Idempotency Framework - [ ] Create `idempotency/manager.py` with idempotency logic - [ ] Define idempotency keys for tools - [ ] Implement idempotency checking - [ ] Create idempotency storage (Redis) - [ ] Handle duplicate requests gracefully - [ ] Implement idempotency TTL - [ ] Add idempotency metrics - [ ] Document idempotent vs non-idempotent tools - [ ] Write idempotency tests ### 13. Tool Versioning - [ ] Create `versioning/manager.py` with version management - [ ] Implement semantic versioning for tools - [ ] Create version compatibility matrix - [ ] Implement graceful deprecation - [ ] Create migration guides for breaking changes - [ ] Implement version negotiation - [ ] Add version metrics - [ ] Write versioning tests ### 14. MCP Integration - [ ] Create `list_tools` tool with enhanced metadata - [ ] Create `get_tool_details` tool with full schema - [ ] Create `get_tool_examples` tool for specific tool - [ ] Create `search_tools` tool with query - [ ] Create `recommend_tools` tool based on task - [ ] Create `report_tool_issue` tool for feedback - [ ] Integrate with all MCP servers - [ ] Write MCP tool tests ### 15. Testing - [ ] Write unit tests for schema validation - [ ] Write unit tests for input validation - [ ] Write unit tests for error handling - [ ] Write integration tests for full framework - [ ] Write LLM-based tool effectiveness tests - [ ] Write performance benchmarks - [ ] Achieve >90% code coverage - [ ] Create regression test suite ### 16. Documentation - [ ] Write README with framework overview - [ ] Document tool schema requirements - [ ] Document quality standards - [ ] Document example requirements - [ ] Document error response guidelines - [ ] Create tool creation guide - [ ] Add troubleshooting guide - [ ] Create best practices checklist --- ## Technical Specifications ### Enhanced Tool Schema ```python class ToolSchema(BaseModel): # Basic info name: str version: str # Descriptions (optimized for LLMs) description_short: str # Max 100 chars, for tool lists description_detailed: str # Full explanation when_to_use: str # "Use this tool when..." when_not_to_use: str # "Do NOT use this tool when..." # Parameters parameters: JSONSchema required_params: list[str] optional_params: list[str] # Guidance prerequisites: list[str] # What must be done first related_tools: list[str] # Similar or complementary common_workflows: list[str] # Common tool sequences # Examples examples: list[ToolExample] # Error guidance common_errors: list[CommonError] # Metadata performance: ToolPerformance category: str tags: list[str] class ToolExample(BaseModel): description: str # What this example demonstrates input: dict # Example input output: dict # Expected output context: str # When to use this approach class CommonError(BaseModel): error_type: str symptom: str # What you'll see cause: str # Why it happens solution: str # How to fix it class ToolPerformance(BaseModel): typical_latency_ms: int max_latency_ms: int token_cost_estimate: int is_idempotent: bool is_reversible: bool ``` ### Error Response Schema ```python class ToolError(BaseModel): error_type: Literal[ "input_error", "state_error", "resource_error", "permission_error", "rate_limit_error", "external_error", "internal_error" ] # For LLM understanding what_went_wrong: str # Plain English explanation why_it_happened: str # Root cause how_to_fix: list[str] # Actionable steps # Technical details error_code: str details: dict # Recovery hints retry_recommended: bool retry_after_seconds: int | None alternative_tools: list[str] related_docs: list[str] ``` ### Quality Score Calculation ``` Quality Score = ( 0.25 * description_completeness + 0.20 * example_coverage + 0.20 * error_clarity + 0.20 * usage_success_rate + 0.15 * llm_understanding_rate ) Thresholds: - Production Ready: >= 80% - Needs Improvement: 60-79% - Not Ready: < 60% ``` --- ## Acceptance Criteria - [ ] All tools have complete enhanced schemas - [ ] All tools have ≥3 examples (basic, advanced, edge case) - [ ] Error responses are LLM-friendly with recovery hints - [ ] Tool quality scores are ≥80% for all production tools - [ ] LLM effectiveness tests pass for all tools - [ ] Input validation catches 100% of invalid inputs - [ ] Documentation is auto-generated and up-to-date - [ ] Usage analytics are tracked for all tools - [ ] >90% test coverage - [ ] Tool creation guide is complete --- ## Labels `phase-2`, `mcp`, `backend`, `tools`, `quality` ## Milestone Phase 2: MCP Integration

cardosofelipe referenced this issue

2026-01-03 09:18:19 +00:00

[EPIC] Phase 2: MCP Integration #60

cardosofelipe referenced this issue

2026-01-06 19:19:02 +00:00

feat(mcp): Git Operations MCP - Git Wrapper Implementation #106

cardosofelipe referenced this issue

2026-01-06 19:19:03 +00:00

feat(mcp): Git Operations MCP - Workspace Management #107

Sign in to join this conversation.