feat(mcp): Error Recovery & Self-Healing #68

New Issue

cardosofelipe · 2026-01-03T09:17:07Z

cardosofelipe commented

2026-01-03 09:17:07 +00:00

Overview

Implement a comprehensive error recovery and self-healing system that enables agents to gracefully handle failures, automatically retry with alternative strategies, and recover to a known-good state when things go wrong. This is essential for autonomous operation - agents must be resilient.

Parent Epic

Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Problem

Agents fail catastrophically on first error
No automatic retry with different approaches
Lost work when something goes wrong mid-task
No learning from failures (same errors repeat)
External service failures cascade to agent failures
No graceful degradation (all or nothing)

The Solution

A resilient error handling system that:

Classifies errors to determine appropriate response
Retries intelligently with backoff and alternatives
Recovers state when things go wrong
Degrades gracefully when full functionality unavailable
Learns from errors to prevent recurrence

Implementation Sub-Tasks

1. Project Setup & Architecture

Create backend/src/mcp_core/recovery/ directory
Create __init__.py with public API exports
Create manager.py with RecoveryManager class
Create config.py with Pydantic settings
Define error classification schema
Design recovery strategy patterns
Write architecture decision record (ADR)

2. Error Classification

Create classification/classifier.py with error classification
Define error taxonomy:
- TransientError - Temporary, will likely succeed on retry
- RateLimitError - Need to wait and retry
- ResourceError - Resource unavailable
- ValidationError - Input/output validation failed
- AuthenticationError - Auth issues
- ExternalServiceError - Third-party service down
- InternalError - Bug in our code
- FatalError - Unrecoverable, must abort
Implement error pattern matching
Create error enrichment (add context)
Implement error history tracking
Write classification tests

3. Retry Framework

Create retry/engine.py with retry logic
Implement exponential backoff
Implement jitter for thundering herd prevention
Create retry budgets (max retries per operation)
Implement retry with alternatives:
- Different LLM model
- Different tool parameters
- Different approach entirely
Create retry hooks (before, after, on_exhaust)
Implement cross-request retry limits
Add retry metrics
Write retry tests

4. Circuit Breaker

Create circuit/breaker.py with circuit breaker pattern
Implement states: CLOSED, OPEN, HALF_OPEN
Create failure threshold configuration
Implement reset timeout
Create per-service circuit breakers
Implement circuit breaker dashboard
Add circuit breaker events
Write circuit breaker tests

5. Fallback Strategies

Create fallback/strategies.py with fallback patterns
Implement model fallback chain (Opus → Sonnet → Haiku)
Implement provider fallback (Anthropic → OpenAI → Ollama)
Implement tool fallback (alternative tools)
Create cached response fallback
Implement degraded mode responses
Create fallback selection logic
Add fallback metrics
Write fallback tests

6. State Recovery

Create state/recovery.py with state recovery
Implement checkpoint creation
Create checkpoint restoration
Implement partial progress recovery
Create session state serialization
Implement working memory recovery
Create task queue recovery
Add recovery verification
Write state recovery tests

7. Transaction Management

Create transactions/manager.py with transaction support
Implement begin/commit/rollback pattern
Create savepoints within transactions
Implement compensating transactions
Create transaction timeout
Implement nested transactions
Add transaction logging
Write transaction tests

8. Self-Healing Actions

Create healing/actions.py with healing actions
Implement automatic service restart
Create connection pool refresh
Implement cache invalidation
Create token refresh on auth errors
Implement rate limit backoff
Create resource cleanup
Add healing triggers
Write healing tests

9. Error Recovery Workflows

Create workflows/recovery.py with recovery workflows
Implement LLM error recovery:
- Rate limit → wait and retry with different model
- Context too long → compress and retry
- Invalid response → re-prompt with clarification
Implement tool error recovery:
- Tool timeout → retry with longer timeout
- Tool failure → try alternative tool
- Tool not found → suggest similar tool
Implement external service recovery:
- API down → use cache or fallback
- Slow response → timeout and retry
Write workflow tests

10. Graceful Degradation

Create degradation/manager.py with degradation logic
Define degradation levels (full, reduced, minimal, offline)
Implement feature flags for degradation
Create degraded mode behaviors
Implement degradation notification
Create automatic recovery from degraded mode
Add degradation metrics
Write degradation tests

11. Error Learning

Create learning/analyzer.py with error learning
Track error patterns over time
Identify recurring errors
Create error prevention suggestions
Implement adaptive retry strategies
Create error trend alerts
Implement root cause analysis
Write learning tests

12. Recovery Orchestration

Create orchestration/orchestrator.py with recovery orchestration
Implement recovery decision tree
Create recovery priority levels
Implement parallel recovery attempts
Create recovery chaining
Implement recovery timeout
Add orchestration logging
Write orchestration tests

13. Health Checks

Create health/checker.py with health checking
Implement service health checks
Create dependency health checks
Implement proactive health monitoring
Create health-based routing
Implement health recovery triggers
Add health dashboards
Write health check tests

14. MCP Integration

Create get_error_status tool - Get current error state
Create retry_operation tool - Manually trigger retry
Create get_recovery_options tool - List recovery options
Create trigger_healing tool - Trigger healing action
Create get_fallback_status tool - Get fallback availability
Create force_degradation tool - Force degradation mode
Write MCP tool tests

15. Metrics & Observability

Add Prometheus metrics for recovery
Track errors_total by type and severity
Track retries_total by operation
Track recovery_success_rate gauge
Track circuit_breaker_state by service
Track fallback_invocations_total counter
Track degradation_events_total counter
Create recovery dashboards
Add alerting rules
Write metrics tests

16. Testing

Write unit tests for all components
Write integration tests for recovery flows
Create chaos tests (inject failures)
Write resilience tests (verify recovery)
Create load tests with failure injection
Achieve >90% code coverage
Create regression test suite

17. Documentation

Write README with recovery architecture
Document error classification
Document retry strategies
Document fallback configuration
Document recovery workflows
Create troubleshooting guide
Add runbooks for common errors
Create best practices

Technical Specifications

Error Classification Hierarchy

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Error Classification                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  RecoverableError                                                            │
│  ├── TransientError (retry immediately or soon)                             │
│  │   ├── NetworkTimeoutError                                                 │
│  │   ├── TemporaryServiceError                                               │
│  │   └── ConcurrencyError                                                    │
│  │                                                                           │
│  ├── RateLimitError (wait then retry)                                        │
│  │   ├── ProviderRateLimitError                                              │
│  │   ├── TokenBudgetExceededError                                            │
│  │   └── APIQuotaExceededError                                               │
│  │                                                                           │
│  ├── ResourceError (try alternative)                                         │
│  │   ├── ModelUnavailableError                                               │
│  │   ├── ToolUnavailableError                                                │
│  │   └── MemoryLimitError                                                    │
│  │                                                                           │
│  └── ValidationError (fix and retry)                                         │
│      ├── InputValidationError                                                │
│      ├── OutputParsingError                                                  │
│      └── ContextTooLongError                                                 │
│                                                                              │
│  NonRecoverableError                                                         │
│  ├── AuthenticationError                                                     │
│  ├── AuthorizationError                                                      │
│  ├── FatalInternalError                                                      │
│  └── PermanentResourceError                                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Retry Strategy Configuration

class RetryConfig(BaseModel):
    # Basic retry settings
    max_attempts: int = 3
    base_delay_seconds: float = 1.0
    max_delay_seconds: float = 60.0
    exponential_base: float = 2.0
    jitter: bool = True
    
    # Error-specific overrides
    error_overrides: dict[str, RetryOverride] = {
        "RateLimitError": RetryOverride(
            max_attempts=5,
            use_retry_after_header=True,
            min_delay_seconds=60.0
        ),
        "TransientError": RetryOverride(
            max_attempts=5,
            base_delay_seconds=0.5
        ),
        "ValidationError": RetryOverride(
            max_attempts=2,  # Fewer retries, likely won't help
            base_delay_seconds=0.1
        )
    }
    
    # Fallback chain
    fallback_chain: list[FallbackOption] = [
        FallbackOption(type="model", value="claude-3-sonnet"),
        FallbackOption(type="model", value="gpt-4"),
        FallbackOption(type="provider", value="ollama"),
        FallbackOption(type="cache", value="recent"),
    ]

Circuit Breaker States

┌──────────────────────────────────────────────────────────────┐
│                    Circuit Breaker FSM                        │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│    ┌────────┐   failure_threshold    ┌────────┐              │
│    │ CLOSED │──────exceeded──────────▶│  OPEN  │              │
│    │        │                         │        │              │
│    │ Normal │                         │ Fail   │              │
│    │ traffic│                         │ fast   │              │
│    └────────┘                         └────────┘              │
│        ▲                                  │                   │
│        │                            reset_timeout             │
│        │                                  │                   │
│        │                                  ▼                   │
│        │        success           ┌───────────┐               │
│        └───────────────────────────│ HALF_OPEN │               │
│                                    │           │               │
│        failure                     │ Test one  │               │
│        └───────────────────────────│ request   │               │
│                    ▲               └───────────┘               │
│                    │                     │                     │
│                    └─────────────────────┘                     │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Recovery Decision Tree

def decide_recovery(error: Error) -> RecoveryAction:
    # 1. Classify the error
    error_type = classify_error(error)
    
    # 2. Check if retries exhausted
    if error.retry_count >= get_max_retries(error_type):
        # Move to fallback
        return select_fallback(error)
    
    # 3. Check circuit breaker
    if circuit_breaker.is_open(error.service):
        # Fast fail, try alternative
        return RecoveryAction.USE_ALTERNATIVE
    
    # 4. Select retry strategy
    if error_type == "RateLimitError":
        delay = error.retry_after or calculate_backoff(error)
        return RecoveryAction.RETRY_AFTER(delay)
    
    if error_type == "ValidationError":
        # Try to fix the input
        fixed = attempt_fix(error)
        if fixed:
            return RecoveryAction.RETRY_WITH(fixed)
        return RecoveryAction.FALLBACK
    
    if error_type == "TransientError":
        delay = calculate_backoff(error)
        return RecoveryAction.RETRY_AFTER(delay)
    
    # 5. Non-recoverable
    return RecoveryAction.ABORT

Acceptance Criteria

100% of errors are classified appropriately
Transient errors succeed within retry budget 95% of time
Circuit breakers prevent cascade failures
Fallback chain works for all critical operations
State recovery restores >95% of work in progress
Graceful degradation maintains core functionality
Error learning reduces repeat errors by 20%
Recovery adds <100ms to normal operation latency
>90% test coverage including chaos tests
Documentation complete with runbooks

Labels

phase-2, mcp, backend, recovery, resilience

Milestone

Phase 2: MCP Integration

## Overview Implement a comprehensive error recovery and self-healing system that enables agents to gracefully handle failures, automatically retry with alternative strategies, and recover to a known-good state when things go wrong. This is essential for autonomous operation - agents must be resilient. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Problem - Agents fail catastrophically on first error - No automatic retry with different approaches - Lost work when something goes wrong mid-task - No learning from failures (same errors repeat) - External service failures cascade to agent failures - No graceful degradation (all or nothing) ### The Solution A resilient error handling system that: 1. **Classifies errors** to determine appropriate response 2. **Retries intelligently** with backoff and alternatives 3. **Recovers state** when things go wrong 4. **Degrades gracefully** when full functionality unavailable 5. **Learns from errors** to prevent recurrence --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/recovery/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `manager.py` with `RecoveryManager` class - [ ] Create `config.py` with Pydantic settings - [ ] Define error classification schema - [ ] Design recovery strategy patterns - [ ] Write architecture decision record (ADR) ### 2. Error Classification - [ ] Create `classification/classifier.py` with error classification - [ ] Define error taxonomy: - [ ] `TransientError` - Temporary, will likely succeed on retry - [ ] `RateLimitError` - Need to wait and retry - [ ] `ResourceError` - Resource unavailable - [ ] `ValidationError` - Input/output validation failed - [ ] `AuthenticationError` - Auth issues - [ ] `ExternalServiceError` - Third-party service down - [ ] `InternalError` - Bug in our code - [ ] `FatalError` - Unrecoverable, must abort - [ ] Implement error pattern matching - [ ] Create error enrichment (add context) - [ ] Implement error history tracking - [ ] Write classification tests ### 3. Retry Framework - [ ] Create `retry/engine.py` with retry logic - [ ] Implement exponential backoff - [ ] Implement jitter for thundering herd prevention - [ ] Create retry budgets (max retries per operation) - [ ] Implement retry with alternatives: - [ ] Different LLM model - [ ] Different tool parameters - [ ] Different approach entirely - [ ] Create retry hooks (before, after, on_exhaust) - [ ] Implement cross-request retry limits - [ ] Add retry metrics - [ ] Write retry tests ### 4. Circuit Breaker - [ ] Create `circuit/breaker.py` with circuit breaker pattern - [ ] Implement states: CLOSED, OPEN, HALF_OPEN - [ ] Create failure threshold configuration - [ ] Implement reset timeout - [ ] Create per-service circuit breakers - [ ] Implement circuit breaker dashboard - [ ] Add circuit breaker events - [ ] Write circuit breaker tests ### 5. Fallback Strategies - [ ] Create `fallback/strategies.py` with fallback patterns - [ ] Implement model fallback chain (Opus → Sonnet → Haiku) - [ ] Implement provider fallback (Anthropic → OpenAI → Ollama) - [ ] Implement tool fallback (alternative tools) - [ ] Create cached response fallback - [ ] Implement degraded mode responses - [ ] Create fallback selection logic - [ ] Add fallback metrics - [ ] Write fallback tests ### 6. State Recovery - [ ] Create `state/recovery.py` with state recovery - [ ] Implement checkpoint creation - [ ] Create checkpoint restoration - [ ] Implement partial progress recovery - [ ] Create session state serialization - [ ] Implement working memory recovery - [ ] Create task queue recovery - [ ] Add recovery verification - [ ] Write state recovery tests ### 7. Transaction Management - [ ] Create `transactions/manager.py` with transaction support - [ ] Implement begin/commit/rollback pattern - [ ] Create savepoints within transactions - [ ] Implement compensating transactions - [ ] Create transaction timeout - [ ] Implement nested transactions - [ ] Add transaction logging - [ ] Write transaction tests ### 8. Self-Healing Actions - [ ] Create `healing/actions.py` with healing actions - [ ] Implement automatic service restart - [ ] Create connection pool refresh - [ ] Implement cache invalidation - [ ] Create token refresh on auth errors - [ ] Implement rate limit backoff - [ ] Create resource cleanup - [ ] Add healing triggers - [ ] Write healing tests ### 9. Error Recovery Workflows - [ ] Create `workflows/recovery.py` with recovery workflows - [ ] Implement LLM error recovery: - [ ] Rate limit → wait and retry with different model - [ ] Context too long → compress and retry - [ ] Invalid response → re-prompt with clarification - [ ] Implement tool error recovery: - [ ] Tool timeout → retry with longer timeout - [ ] Tool failure → try alternative tool - [ ] Tool not found → suggest similar tool - [ ] Implement external service recovery: - [ ] API down → use cache or fallback - [ ] Slow response → timeout and retry - [ ] Write workflow tests ### 10. Graceful Degradation - [ ] Create `degradation/manager.py` with degradation logic - [ ] Define degradation levels (full, reduced, minimal, offline) - [ ] Implement feature flags for degradation - [ ] Create degraded mode behaviors - [ ] Implement degradation notification - [ ] Create automatic recovery from degraded mode - [ ] Add degradation metrics - [ ] Write degradation tests ### 11. Error Learning - [ ] Create `learning/analyzer.py` with error learning - [ ] Track error patterns over time - [ ] Identify recurring errors - [ ] Create error prevention suggestions - [ ] Implement adaptive retry strategies - [ ] Create error trend alerts - [ ] Implement root cause analysis - [ ] Write learning tests ### 12. Recovery Orchestration - [ ] Create `orchestration/orchestrator.py` with recovery orchestration - [ ] Implement recovery decision tree - [ ] Create recovery priority levels - [ ] Implement parallel recovery attempts - [ ] Create recovery chaining - [ ] Implement recovery timeout - [ ] Add orchestration logging - [ ] Write orchestration tests ### 13. Health Checks - [ ] Create `health/checker.py` with health checking - [ ] Implement service health checks - [ ] Create dependency health checks - [ ] Implement proactive health monitoring - [ ] Create health-based routing - [ ] Implement health recovery triggers - [ ] Add health dashboards - [ ] Write health check tests ### 14. MCP Integration - [ ] Create `get_error_status` tool - Get current error state - [ ] Create `retry_operation` tool - Manually trigger retry - [ ] Create `get_recovery_options` tool - List recovery options - [ ] Create `trigger_healing` tool - Trigger healing action - [ ] Create `get_fallback_status` tool - Get fallback availability - [ ] Create `force_degradation` tool - Force degradation mode - [ ] Write MCP tool tests ### 15. Metrics & Observability - [ ] Add Prometheus metrics for recovery - [ ] Track `errors_total` by type and severity - [ ] Track `retries_total` by operation - [ ] Track `recovery_success_rate` gauge - [ ] Track `circuit_breaker_state` by service - [ ] Track `fallback_invocations_total` counter - [ ] Track `degradation_events_total` counter - [ ] Create recovery dashboards - [ ] Add alerting rules - [ ] Write metrics tests ### 16. Testing - [ ] Write unit tests for all components - [ ] Write integration tests for recovery flows - [ ] Create chaos tests (inject failures) - [ ] Write resilience tests (verify recovery) - [ ] Create load tests with failure injection - [ ] Achieve >90% code coverage - [ ] Create regression test suite ### 17. Documentation - [ ] Write README with recovery architecture - [ ] Document error classification - [ ] Document retry strategies - [ ] Document fallback configuration - [ ] Document recovery workflows - [ ] Create troubleshooting guide - [ ] Add runbooks for common errors - [ ] Create best practices --- ## Technical Specifications ### Error Classification Hierarchy ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Error Classification │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ RecoverableError │ │ ├── TransientError (retry immediately or soon) │ │ │ ├── NetworkTimeoutError │ │ │ ├── TemporaryServiceError │ │ │ └── ConcurrencyError │ │ │ │ │ ├── RateLimitError (wait then retry) │ │ │ ├── ProviderRateLimitError │ │ │ ├── TokenBudgetExceededError │ │ │ └── APIQuotaExceededError │ │ │ │ │ ├── ResourceError (try alternative) │ │ │ ├── ModelUnavailableError │ │ │ ├── ToolUnavailableError │ │ │ └── MemoryLimitError │ │ │ │ │ └── ValidationError (fix and retry) │ │ ├── InputValidationError │ │ ├── OutputParsingError │ │ └── ContextTooLongError │ │ │ │ NonRecoverableError │ │ ├── AuthenticationError │ │ ├── AuthorizationError │ │ ├── FatalInternalError │ │ └── PermanentResourceError │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Retry Strategy Configuration ```python class RetryConfig(BaseModel): # Basic retry settings max_attempts: int = 3 base_delay_seconds: float = 1.0 max_delay_seconds: float = 60.0 exponential_base: float = 2.0 jitter: bool = True # Error-specific overrides error_overrides: dict[str, RetryOverride] = { "RateLimitError": RetryOverride( max_attempts=5, use_retry_after_header=True, min_delay_seconds=60.0 ), "TransientError": RetryOverride( max_attempts=5, base_delay_seconds=0.5 ), "ValidationError": RetryOverride( max_attempts=2, # Fewer retries, likely won't help base_delay_seconds=0.1 ) } # Fallback chain fallback_chain: list[FallbackOption] = [ FallbackOption(type="model", value="claude-3-sonnet"), FallbackOption(type="model", value="gpt-4"), FallbackOption(type="provider", value="ollama"), FallbackOption(type="cache", value="recent"), ] ``` ### Circuit Breaker States ``` ┌──────────────────────────────────────────────────────────────┐ │ Circuit Breaker FSM │ ├──────────────────────────────────────────────────────────────┤ │ │ │ ┌────────┐ failure_threshold ┌────────┐ │ │ │ CLOSED │──────exceeded──────────▶│ OPEN │ │ │ │ │ │ │ │ │ │ Normal │ │ Fail │ │ │ │ traffic│ │ fast │ │ │ └────────┘ └────────┘ │ │ ▲ │ │ │ │ reset_timeout │ │ │ │ │ │ │ ▼ │ │ │ success ┌───────────┐ │ │ └───────────────────────────│ HALF_OPEN │ │ │ │ │ │ │ failure │ Test one │ │ │ └───────────────────────────│ request │ │ │ ▲ └───────────┘ │ │ │ │ │ │ └─────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────┘ ``` ### Recovery Decision Tree ```python def decide_recovery(error: Error) -> RecoveryAction: # 1. Classify the error error_type = classify_error(error) # 2. Check if retries exhausted if error.retry_count >= get_max_retries(error_type): # Move to fallback return select_fallback(error) # 3. Check circuit breaker if circuit_breaker.is_open(error.service): # Fast fail, try alternative return RecoveryAction.USE_ALTERNATIVE # 4. Select retry strategy if error_type == "RateLimitError": delay = error.retry_after or calculate_backoff(error) return RecoveryAction.RETRY_AFTER(delay) if error_type == "ValidationError": # Try to fix the input fixed = attempt_fix(error) if fixed: return RecoveryAction.RETRY_WITH(fixed) return RecoveryAction.FALLBACK if error_type == "TransientError": delay = calculate_backoff(error) return RecoveryAction.RETRY_AFTER(delay) # 5. Non-recoverable return RecoveryAction.ABORT ``` --- ## Acceptance Criteria - [ ] 100% of errors are classified appropriately - [ ] Transient errors succeed within retry budget 95% of time - [ ] Circuit breakers prevent cascade failures - [ ] Fallback chain works for all critical operations - [ ] State recovery restores >95% of work in progress - [ ] Graceful degradation maintains core functionality - [ ] Error learning reduces repeat errors by 20% - [ ] Recovery adds <100ms to normal operation latency - [ ] >90% test coverage including chaos tests - [ ] Documentation complete with runbooks --- ## Labels `phase-2`, `mcp`, `backend`, `recovery`, `resilience` ## Milestone Phase 2: MCP Integration

cardosofelipe referenced this issue

2026-01-03 09:18:19 +00:00

[EPIC] Phase 2: MCP Integration #60

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cardosofelipe/syndarix#68