feat(mcp): Error Recovery & Self-Healing #68

Open
opened 2026-01-03 09:17:07 +00:00 by cardosofelipe · 0 comments

Overview

Implement a comprehensive error recovery and self-healing system that enables agents to gracefully handle failures, automatically retry with alternative strategies, and recover to a known-good state when things go wrong. This is essential for autonomous operation - agents must be resilient.

Parent Epic

  • Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

  • Agents fail catastrophically on first error
  • No automatic retry with different approaches
  • Lost work when something goes wrong mid-task
  • No learning from failures (same errors repeat)
  • External service failures cascade to agent failures
  • No graceful degradation (all or nothing)

The Solution

A resilient error handling system that:

  1. Classifies errors to determine appropriate response
  2. Retries intelligently with backoff and alternatives
  3. Recovers state when things go wrong
  4. Degrades gracefully when full functionality unavailable
  5. Learns from errors to prevent recurrence

Implementation Sub-Tasks

1. Project Setup & Architecture

  • Create backend/src/mcp_core/recovery/ directory
  • Create __init__.py with public API exports
  • Create manager.py with RecoveryManager class
  • Create config.py with Pydantic settings
  • Define error classification schema
  • Design recovery strategy patterns
  • Write architecture decision record (ADR)

2. Error Classification

  • Create classification/classifier.py with error classification
  • Define error taxonomy:
    • TransientError - Temporary, will likely succeed on retry
    • RateLimitError - Need to wait and retry
    • ResourceError - Resource unavailable
    • ValidationError - Input/output validation failed
    • AuthenticationError - Auth issues
    • ExternalServiceError - Third-party service down
    • InternalError - Bug in our code
    • FatalError - Unrecoverable, must abort
  • Implement error pattern matching
  • Create error enrichment (add context)
  • Implement error history tracking
  • Write classification tests

3. Retry Framework

  • Create retry/engine.py with retry logic
  • Implement exponential backoff
  • Implement jitter for thundering herd prevention
  • Create retry budgets (max retries per operation)
  • Implement retry with alternatives:
    • Different LLM model
    • Different tool parameters
    • Different approach entirely
  • Create retry hooks (before, after, on_exhaust)
  • Implement cross-request retry limits
  • Add retry metrics
  • Write retry tests

4. Circuit Breaker

  • Create circuit/breaker.py with circuit breaker pattern
  • Implement states: CLOSED, OPEN, HALF_OPEN
  • Create failure threshold configuration
  • Implement reset timeout
  • Create per-service circuit breakers
  • Implement circuit breaker dashboard
  • Add circuit breaker events
  • Write circuit breaker tests

5. Fallback Strategies

  • Create fallback/strategies.py with fallback patterns
  • Implement model fallback chain (Opus → Sonnet → Haiku)
  • Implement provider fallback (Anthropic → OpenAI → Ollama)
  • Implement tool fallback (alternative tools)
  • Create cached response fallback
  • Implement degraded mode responses
  • Create fallback selection logic
  • Add fallback metrics
  • Write fallback tests

6. State Recovery

  • Create state/recovery.py with state recovery
  • Implement checkpoint creation
  • Create checkpoint restoration
  • Implement partial progress recovery
  • Create session state serialization
  • Implement working memory recovery
  • Create task queue recovery
  • Add recovery verification
  • Write state recovery tests

7. Transaction Management

  • Create transactions/manager.py with transaction support
  • Implement begin/commit/rollback pattern
  • Create savepoints within transactions
  • Implement compensating transactions
  • Create transaction timeout
  • Implement nested transactions
  • Add transaction logging
  • Write transaction tests

8. Self-Healing Actions

  • Create healing/actions.py with healing actions
  • Implement automatic service restart
  • Create connection pool refresh
  • Implement cache invalidation
  • Create token refresh on auth errors
  • Implement rate limit backoff
  • Create resource cleanup
  • Add healing triggers
  • Write healing tests

9. Error Recovery Workflows

  • Create workflows/recovery.py with recovery workflows
  • Implement LLM error recovery:
    • Rate limit → wait and retry with different model
    • Context too long → compress and retry
    • Invalid response → re-prompt with clarification
  • Implement tool error recovery:
    • Tool timeout → retry with longer timeout
    • Tool failure → try alternative tool
    • Tool not found → suggest similar tool
  • Implement external service recovery:
    • API down → use cache or fallback
    • Slow response → timeout and retry
  • Write workflow tests

10. Graceful Degradation

  • Create degradation/manager.py with degradation logic
  • Define degradation levels (full, reduced, minimal, offline)
  • Implement feature flags for degradation
  • Create degraded mode behaviors
  • Implement degradation notification
  • Create automatic recovery from degraded mode
  • Add degradation metrics
  • Write degradation tests

11. Error Learning

  • Create learning/analyzer.py with error learning
  • Track error patterns over time
  • Identify recurring errors
  • Create error prevention suggestions
  • Implement adaptive retry strategies
  • Create error trend alerts
  • Implement root cause analysis
  • Write learning tests

12. Recovery Orchestration

  • Create orchestration/orchestrator.py with recovery orchestration
  • Implement recovery decision tree
  • Create recovery priority levels
  • Implement parallel recovery attempts
  • Create recovery chaining
  • Implement recovery timeout
  • Add orchestration logging
  • Write orchestration tests

13. Health Checks

  • Create health/checker.py with health checking
  • Implement service health checks
  • Create dependency health checks
  • Implement proactive health monitoring
  • Create health-based routing
  • Implement health recovery triggers
  • Add health dashboards
  • Write health check tests

14. MCP Integration

  • Create get_error_status tool - Get current error state
  • Create retry_operation tool - Manually trigger retry
  • Create get_recovery_options tool - List recovery options
  • Create trigger_healing tool - Trigger healing action
  • Create get_fallback_status tool - Get fallback availability
  • Create force_degradation tool - Force degradation mode
  • Write MCP tool tests

15. Metrics & Observability

  • Add Prometheus metrics for recovery
  • Track errors_total by type and severity
  • Track retries_total by operation
  • Track recovery_success_rate gauge
  • Track circuit_breaker_state by service
  • Track fallback_invocations_total counter
  • Track degradation_events_total counter
  • Create recovery dashboards
  • Add alerting rules
  • Write metrics tests

16. Testing

  • Write unit tests for all components
  • Write integration tests for recovery flows
  • Create chaos tests (inject failures)
  • Write resilience tests (verify recovery)
  • Create load tests with failure injection
  • Achieve >90% code coverage
  • Create regression test suite

17. Documentation

  • Write README with recovery architecture
  • Document error classification
  • Document retry strategies
  • Document fallback configuration
  • Document recovery workflows
  • Create troubleshooting guide
  • Add runbooks for common errors
  • Create best practices

Technical Specifications

Error Classification Hierarchy

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Error Classification                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  RecoverableError                                                            │
│  ├── TransientError (retry immediately or soon)                             │
│  │   ├── NetworkTimeoutError                                                 │
│  │   ├── TemporaryServiceError                                               │
│  │   └── ConcurrencyError                                                    │
│  │                                                                           │
│  ├── RateLimitError (wait then retry)                                        │
│  │   ├── ProviderRateLimitError                                              │
│  │   ├── TokenBudgetExceededError                                            │
│  │   └── APIQuotaExceededError                                               │
│  │                                                                           │
│  ├── ResourceError (try alternative)                                         │
│  │   ├── ModelUnavailableError                                               │
│  │   ├── ToolUnavailableError                                                │
│  │   └── MemoryLimitError                                                    │
│  │                                                                           │
│  └── ValidationError (fix and retry)                                         │
│      ├── InputValidationError                                                │
│      ├── OutputParsingError                                                  │
│      └── ContextTooLongError                                                 │
│                                                                              │
│  NonRecoverableError                                                         │
│  ├── AuthenticationError                                                     │
│  ├── AuthorizationError                                                      │
│  ├── FatalInternalError                                                      │
│  └── PermanentResourceError                                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Retry Strategy Configuration

class RetryConfig(BaseModel):
    # Basic retry settings
    max_attempts: int = 3
    base_delay_seconds: float = 1.0
    max_delay_seconds: float = 60.0
    exponential_base: float = 2.0
    jitter: bool = True
    
    # Error-specific overrides
    error_overrides: dict[str, RetryOverride] = {
        "RateLimitError": RetryOverride(
            max_attempts=5,
            use_retry_after_header=True,
            min_delay_seconds=60.0
        ),
        "TransientError": RetryOverride(
            max_attempts=5,
            base_delay_seconds=0.5
        ),
        "ValidationError": RetryOverride(
            max_attempts=2,  # Fewer retries, likely won't help
            base_delay_seconds=0.1
        )
    }
    
    # Fallback chain
    fallback_chain: list[FallbackOption] = [
        FallbackOption(type="model", value="claude-3-sonnet"),
        FallbackOption(type="model", value="gpt-4"),
        FallbackOption(type="provider", value="ollama"),
        FallbackOption(type="cache", value="recent"),
    ]

Circuit Breaker States

┌──────────────────────────────────────────────────────────────┐
│                    Circuit Breaker FSM                        │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│    ┌────────┐   failure_threshold    ┌────────┐              │
│    │ CLOSED │──────exceeded──────────▶│  OPEN  │              │
│    │        │                         │        │              │
│    │ Normal │                         │ Fail   │              │
│    │ traffic│                         │ fast   │              │
│    └────────┘                         └────────┘              │
│        ▲                                  │                   │
│        │                            reset_timeout             │
│        │                                  │                   │
│        │                                  ▼                   │
│        │        success           ┌───────────┐               │
│        └───────────────────────────│ HALF_OPEN │               │
│                                    │           │               │
│        failure                     │ Test one  │               │
│        └───────────────────────────│ request   │               │
│                    ▲               └───────────┘               │
│                    │                     │                     │
│                    └─────────────────────┘                     │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Recovery Decision Tree

def decide_recovery(error: Error) -> RecoveryAction:
    # 1. Classify the error
    error_type = classify_error(error)
    
    # 2. Check if retries exhausted
    if error.retry_count >= get_max_retries(error_type):
        # Move to fallback
        return select_fallback(error)
    
    # 3. Check circuit breaker
    if circuit_breaker.is_open(error.service):
        # Fast fail, try alternative
        return RecoveryAction.USE_ALTERNATIVE
    
    # 4. Select retry strategy
    if error_type == "RateLimitError":
        delay = error.retry_after or calculate_backoff(error)
        return RecoveryAction.RETRY_AFTER(delay)
    
    if error_type == "ValidationError":
        # Try to fix the input
        fixed = attempt_fix(error)
        if fixed:
            return RecoveryAction.RETRY_WITH(fixed)
        return RecoveryAction.FALLBACK
    
    if error_type == "TransientError":
        delay = calculate_backoff(error)
        return RecoveryAction.RETRY_AFTER(delay)
    
    # 5. Non-recoverable
    return RecoveryAction.ABORT

Acceptance Criteria

  • 100% of errors are classified appropriately
  • Transient errors succeed within retry budget 95% of time
  • Circuit breakers prevent cascade failures
  • Fallback chain works for all critical operations
  • State recovery restores >95% of work in progress
  • Graceful degradation maintains core functionality
  • Error learning reduces repeat errors by 20%
  • Recovery adds <100ms to normal operation latency
  • >90% test coverage including chaos tests
  • Documentation complete with runbooks

Labels

phase-2, mcp, backend, recovery, resilience

Milestone

Phase 2: MCP Integration

## Overview Implement a comprehensive error recovery and self-healing system that enables agents to gracefully handle failures, automatically retry with alternative strategies, and recover to a known-good state when things go wrong. This is essential for autonomous operation - agents must be resilient. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - Agents fail catastrophically on first error - No automatic retry with different approaches - Lost work when something goes wrong mid-task - No learning from failures (same errors repeat) - External service failures cascade to agent failures - No graceful degradation (all or nothing) ### The Solution A resilient error handling system that: 1. **Classifies errors** to determine appropriate response 2. **Retries intelligently** with backoff and alternatives 3. **Recovers state** when things go wrong 4. **Degrades gracefully** when full functionality unavailable 5. **Learns from errors** to prevent recurrence --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/recovery/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `manager.py` with `RecoveryManager` class - [ ] Create `config.py` with Pydantic settings - [ ] Define error classification schema - [ ] Design recovery strategy patterns - [ ] Write architecture decision record (ADR) ### 2. Error Classification - [ ] Create `classification/classifier.py` with error classification - [ ] Define error taxonomy: - [ ] `TransientError` - Temporary, will likely succeed on retry - [ ] `RateLimitError` - Need to wait and retry - [ ] `ResourceError` - Resource unavailable - [ ] `ValidationError` - Input/output validation failed - [ ] `AuthenticationError` - Auth issues - [ ] `ExternalServiceError` - Third-party service down - [ ] `InternalError` - Bug in our code - [ ] `FatalError` - Unrecoverable, must abort - [ ] Implement error pattern matching - [ ] Create error enrichment (add context) - [ ] Implement error history tracking - [ ] Write classification tests ### 3. Retry Framework - [ ] Create `retry/engine.py` with retry logic - [ ] Implement exponential backoff - [ ] Implement jitter for thundering herd prevention - [ ] Create retry budgets (max retries per operation) - [ ] Implement retry with alternatives: - [ ] Different LLM model - [ ] Different tool parameters - [ ] Different approach entirely - [ ] Create retry hooks (before, after, on_exhaust) - [ ] Implement cross-request retry limits - [ ] Add retry metrics - [ ] Write retry tests ### 4. Circuit Breaker - [ ] Create `circuit/breaker.py` with circuit breaker pattern - [ ] Implement states: CLOSED, OPEN, HALF_OPEN - [ ] Create failure threshold configuration - [ ] Implement reset timeout - [ ] Create per-service circuit breakers - [ ] Implement circuit breaker dashboard - [ ] Add circuit breaker events - [ ] Write circuit breaker tests ### 5. Fallback Strategies - [ ] Create `fallback/strategies.py` with fallback patterns - [ ] Implement model fallback chain (Opus → Sonnet → Haiku) - [ ] Implement provider fallback (Anthropic → OpenAI → Ollama) - [ ] Implement tool fallback (alternative tools) - [ ] Create cached response fallback - [ ] Implement degraded mode responses - [ ] Create fallback selection logic - [ ] Add fallback metrics - [ ] Write fallback tests ### 6. State Recovery - [ ] Create `state/recovery.py` with state recovery - [ ] Implement checkpoint creation - [ ] Create checkpoint restoration - [ ] Implement partial progress recovery - [ ] Create session state serialization - [ ] Implement working memory recovery - [ ] Create task queue recovery - [ ] Add recovery verification - [ ] Write state recovery tests ### 7. Transaction Management - [ ] Create `transactions/manager.py` with transaction support - [ ] Implement begin/commit/rollback pattern - [ ] Create savepoints within transactions - [ ] Implement compensating transactions - [ ] Create transaction timeout - [ ] Implement nested transactions - [ ] Add transaction logging - [ ] Write transaction tests ### 8. Self-Healing Actions - [ ] Create `healing/actions.py` with healing actions - [ ] Implement automatic service restart - [ ] Create connection pool refresh - [ ] Implement cache invalidation - [ ] Create token refresh on auth errors - [ ] Implement rate limit backoff - [ ] Create resource cleanup - [ ] Add healing triggers - [ ] Write healing tests ### 9. Error Recovery Workflows - [ ] Create `workflows/recovery.py` with recovery workflows - [ ] Implement LLM error recovery: - [ ] Rate limit → wait and retry with different model - [ ] Context too long → compress and retry - [ ] Invalid response → re-prompt with clarification - [ ] Implement tool error recovery: - [ ] Tool timeout → retry with longer timeout - [ ] Tool failure → try alternative tool - [ ] Tool not found → suggest similar tool - [ ] Implement external service recovery: - [ ] API down → use cache or fallback - [ ] Slow response → timeout and retry - [ ] Write workflow tests ### 10. Graceful Degradation - [ ] Create `degradation/manager.py` with degradation logic - [ ] Define degradation levels (full, reduced, minimal, offline) - [ ] Implement feature flags for degradation - [ ] Create degraded mode behaviors - [ ] Implement degradation notification - [ ] Create automatic recovery from degraded mode - [ ] Add degradation metrics - [ ] Write degradation tests ### 11. Error Learning - [ ] Create `learning/analyzer.py` with error learning - [ ] Track error patterns over time - [ ] Identify recurring errors - [ ] Create error prevention suggestions - [ ] Implement adaptive retry strategies - [ ] Create error trend alerts - [ ] Implement root cause analysis - [ ] Write learning tests ### 12. Recovery Orchestration - [ ] Create `orchestration/orchestrator.py` with recovery orchestration - [ ] Implement recovery decision tree - [ ] Create recovery priority levels - [ ] Implement parallel recovery attempts - [ ] Create recovery chaining - [ ] Implement recovery timeout - [ ] Add orchestration logging - [ ] Write orchestration tests ### 13. Health Checks - [ ] Create `health/checker.py` with health checking - [ ] Implement service health checks - [ ] Create dependency health checks - [ ] Implement proactive health monitoring - [ ] Create health-based routing - [ ] Implement health recovery triggers - [ ] Add health dashboards - [ ] Write health check tests ### 14. MCP Integration - [ ] Create `get_error_status` tool - Get current error state - [ ] Create `retry_operation` tool - Manually trigger retry - [ ] Create `get_recovery_options` tool - List recovery options - [ ] Create `trigger_healing` tool - Trigger healing action - [ ] Create `get_fallback_status` tool - Get fallback availability - [ ] Create `force_degradation` tool - Force degradation mode - [ ] Write MCP tool tests ### 15. Metrics & Observability - [ ] Add Prometheus metrics for recovery - [ ] Track `errors_total` by type and severity - [ ] Track `retries_total` by operation - [ ] Track `recovery_success_rate` gauge - [ ] Track `circuit_breaker_state` by service - [ ] Track `fallback_invocations_total` counter - [ ] Track `degradation_events_total` counter - [ ] Create recovery dashboards - [ ] Add alerting rules - [ ] Write metrics tests ### 16. Testing - [ ] Write unit tests for all components - [ ] Write integration tests for recovery flows - [ ] Create chaos tests (inject failures) - [ ] Write resilience tests (verify recovery) - [ ] Create load tests with failure injection - [ ] Achieve >90% code coverage - [ ] Create regression test suite ### 17. Documentation - [ ] Write README with recovery architecture - [ ] Document error classification - [ ] Document retry strategies - [ ] Document fallback configuration - [ ] Document recovery workflows - [ ] Create troubleshooting guide - [ ] Add runbooks for common errors - [ ] Create best practices --- ## Technical Specifications ### Error Classification Hierarchy ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Error Classification │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ RecoverableError │ │ ├── TransientError (retry immediately or soon) │ │ │ ├── NetworkTimeoutError │ │ │ ├── TemporaryServiceError │ │ │ └── ConcurrencyError │ │ │ │ │ ├── RateLimitError (wait then retry) │ │ │ ├── ProviderRateLimitError │ │ │ ├── TokenBudgetExceededError │ │ │ └── APIQuotaExceededError │ │ │ │ │ ├── ResourceError (try alternative) │ │ │ ├── ModelUnavailableError │ │ │ ├── ToolUnavailableError │ │ │ └── MemoryLimitError │ │ │ │ │ └── ValidationError (fix and retry) │ │ ├── InputValidationError │ │ ├── OutputParsingError │ │ └── ContextTooLongError │ │ │ │ NonRecoverableError │ │ ├── AuthenticationError │ │ ├── AuthorizationError │ │ ├── FatalInternalError │ │ └── PermanentResourceError │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Retry Strategy Configuration ```python class RetryConfig(BaseModel): # Basic retry settings max_attempts: int = 3 base_delay_seconds: float = 1.0 max_delay_seconds: float = 60.0 exponential_base: float = 2.0 jitter: bool = True # Error-specific overrides error_overrides: dict[str, RetryOverride] = { "RateLimitError": RetryOverride( max_attempts=5, use_retry_after_header=True, min_delay_seconds=60.0 ), "TransientError": RetryOverride( max_attempts=5, base_delay_seconds=0.5 ), "ValidationError": RetryOverride( max_attempts=2, # Fewer retries, likely won't help base_delay_seconds=0.1 ) } # Fallback chain fallback_chain: list[FallbackOption] = [ FallbackOption(type="model", value="claude-3-sonnet"), FallbackOption(type="model", value="gpt-4"), FallbackOption(type="provider", value="ollama"), FallbackOption(type="cache", value="recent"), ] ``` ### Circuit Breaker States ``` ┌──────────────────────────────────────────────────────────────┐ │ Circuit Breaker FSM │ ├──────────────────────────────────────────────────────────────┤ │ │ │ ┌────────┐ failure_threshold ┌────────┐ │ │ │ CLOSED │──────exceeded──────────▶│ OPEN │ │ │ │ │ │ │ │ │ │ Normal │ │ Fail │ │ │ │ traffic│ │ fast │ │ │ └────────┘ └────────┘ │ │ ▲ │ │ │ │ reset_timeout │ │ │ │ │ │ │ ▼ │ │ │ success ┌───────────┐ │ │ └───────────────────────────│ HALF_OPEN │ │ │ │ │ │ │ failure │ Test one │ │ │ └───────────────────────────│ request │ │ │ ▲ └───────────┘ │ │ │ │ │ │ └─────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────┘ ``` ### Recovery Decision Tree ```python def decide_recovery(error: Error) -> RecoveryAction: # 1. Classify the error error_type = classify_error(error) # 2. Check if retries exhausted if error.retry_count >= get_max_retries(error_type): # Move to fallback return select_fallback(error) # 3. Check circuit breaker if circuit_breaker.is_open(error.service): # Fast fail, try alternative return RecoveryAction.USE_ALTERNATIVE # 4. Select retry strategy if error_type == "RateLimitError": delay = error.retry_after or calculate_backoff(error) return RecoveryAction.RETRY_AFTER(delay) if error_type == "ValidationError": # Try to fix the input fixed = attempt_fix(error) if fixed: return RecoveryAction.RETRY_WITH(fixed) return RecoveryAction.FALLBACK if error_type == "TransientError": delay = calculate_backoff(error) return RecoveryAction.RETRY_AFTER(delay) # 5. Non-recoverable return RecoveryAction.ABORT ``` --- ## Acceptance Criteria - [ ] 100% of errors are classified appropriately - [ ] Transient errors succeed within retry budget 95% of time - [ ] Circuit breakers prevent cascade failures - [ ] Fallback chain works for all critical operations - [ ] State recovery restores >95% of work in progress - [ ] Graceful degradation maintains core functionality - [ ] Error learning reduces repeat errors by 20% - [ ] Recovery adds <100ms to normal operation latency - [ ] >90% test coverage including chaos tests - [ ] Documentation complete with runbooks --- ## Labels `phase-2`, `mcp`, `backend`, `recovery`, `resilience` ## Milestone Phase 2: MCP Integration
Sign in to join this conversation.