feat(mcp): Guardrails & Safety Framework #63

Closed
opened 2026-01-03 09:10:46 +00:00 by cardosofelipe · 0 comments

Overview

Implement a comprehensive safety framework that prevents runaway agents, validates actions before execution, enables rollback of mistakes, and controls costs. This is NON-NEGOTIABLE for autonomous agent operation.

Parent Epic

  • Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Risks

  • Runaway costs: Agent keeps calling LLMs without limits
  • Destructive actions: Agent deletes critical files or data
  • Infinite loops: Agent gets stuck repeating the same action
  • Security breaches: Agent accesses unauthorized resources
  • Data loss: No way to undo agent mistakes
  • Resource exhaustion: Agent consumes all available compute

The Solution

A multi-layered safety system:

  1. Pre-execution validation: Check if action is safe BEFORE executing
  2. Cost controls: Budget limits and usage tracking
  3. Rate limiting: Prevent runaway behavior
  4. Rollback capability: Undo what went wrong
  5. Human-in-the-loop: Confirmation for critical actions
  6. Audit trail: Complete history for debugging and compliance

Implementation Sub-Tasks

1. Project Setup & Architecture

  • Create backend/src/mcp_core/safety/ directory
  • Create __init__.py with public API exports
  • Create guardian.py with SafetyGuardian class
  • Create config.py with Pydantic settings
  • Define safety policy schema
  • Create middleware integration pattern
  • Write architecture decision record (ADR)

2. Action Validation Framework

  • Create validation/validator.py with ActionValidator class
  • Define action schema with required metadata
  • Implement pre-execution hook system
  • Create validation rule engine
  • Implement rule types:
    • Allow rules (whitelist specific actions)
    • Deny rules (blacklist specific actions)
    • Require approval rules (human confirmation)
    • Condition rules (allow if condition met)
  • Create rule priority and conflict resolution
  • Implement validation caching for performance
  • Add validation bypass for emergencies (with audit)
  • Write validation rule tests

3. Cost Control System

  • Create costs/controller.py with CostController class
  • Implement per-agent token budgets
  • Implement per-project token budgets
  • Implement per-session token budgets
  • Create budget allocation across agents
  • Implement real-time cost tracking
  • Create cost alerts (warn at 80%, block at 100%)
  • Implement cost prediction for planned actions
  • Create cost reporting and analytics
  • Add cost override for privileged operations
  • Implement cost rollover policies (daily/weekly/monthly)
  • Write cost control tests

4. Rate Limiting

  • Create limits/limiter.py with RateLimiter class
  • Implement per-tool rate limits
  • Implement per-agent rate limits
  • Implement per-resource rate limits (API, file, etc.)
  • Create sliding window rate limiting
  • Implement burst allowance with recovery
  • Create rate limit escalation (slow down before block)
  • Implement rate limit exceptions for priority tasks
  • Add rate limit metrics and monitoring
  • Write rate limiting tests

5. Loop Detection & Prevention

  • Create loops/detector.py with LoopDetector class
  • Implement action history tracking
  • Detect exact repetition (same action, same args)
  • Detect semantic repetition (similar actions)
  • Detect oscillation patterns (A → B → A → B)
  • Create configurable loop thresholds
  • Implement loop breaking strategies
  • Add loop detection alerts
  • Create loop analysis for debugging
  • Write loop detection tests

6. Permission System

  • Create permissions/manager.py with PermissionManager class
  • Define permission types (read, write, execute, delete)
  • Define resource types (file, api, database, external)
  • Implement permission checking per action
  • Create permission inheritance (project → agent → action)
  • Implement least-privilege defaults
  • Create permission request workflow
  • Add permission escalation logging
  • Implement temporary permissions (time-limited)
  • Write permission tests

7. Rollback System

  • Create rollback/manager.py with RollbackManager class
  • Implement transaction wrapping for actions
  • Create checkpoints before destructive actions
  • Implement file system rollback (store original files)
  • Implement database rollback (transaction savepoints)
  • Implement git rollback (branch before changes)
  • Create rollback triggers (automatic on failure)
  • Implement partial rollback (undo specific actions)
  • Add rollback confirmation for critical operations
  • Create rollback history and audit
  • Write rollback integration tests

8. Human-in-the-Loop (HITL)

  • Create hitl/manager.py with HITLManager class
  • Define confirmation-required action patterns
  • Implement confirmation request queue
  • Create timeout handling (default deny)
  • Implement approval delegation
  • Create batch approval for similar actions
  • Implement approval with modifications
  • Add approval audit trail
  • Create notification channels (email, Slack, webhook)
  • Write HITL workflow tests

9. Content Filtering

  • Create content/filter.py with ContentFilter class
  • Implement PII detection and masking
  • Implement secret detection (API keys, passwords)
  • Create prohibited content patterns
  • Implement content sanitization
  • Create output filtering (before showing to user)
  • Add content filter bypass for privileged operations
  • Write content filter tests

10. Sandbox Execution

  • Create sandbox/executor.py with SandboxExecutor class
  • Implement Docker-based sandboxing for untrusted code
  • Implement file system isolation
  • Implement network isolation
  • Create resource limits (CPU, memory, disk)
  • Implement timeout enforcement
  • Create sandbox escape detection
  • Add sandbox cleanup after execution
  • Write sandbox security tests

11. Audit System

  • Create audit/logger.py with AuditLogger class
  • Define audit event schema
  • Implement action audit logging (all actions)
  • Implement decision audit logging (why allowed/denied)
  • Implement outcome audit logging (what happened)
  • Create audit retention policies
  • Implement audit search and query
  • Create audit export for compliance
  • Add tamper detection for audit logs
  • Implement audit alerts for suspicious patterns
  • Write audit tests

12. Emergency Controls

  • Create emergency/controls.py with emergency handlers
  • Implement kill switch (stop all agents immediately)
  • Implement pause (stop accepting new tasks)
  • Implement resource lockdown (prevent access to resources)
  • Create emergency escalation procedures
  • Implement graceful shutdown with state preservation
  • Add emergency recovery procedures
  • Create post-mortem data collection
  • Write emergency control tests

13. Safety Policies

  • Create policies/schema.py with policy definitions
  • Define project-level safety policies
  • Define agent-type safety policies
  • Define autonomy-level based policies
  • Create policy templates for common scenarios
  • Implement policy validation
  • Create policy versioning
  • Implement policy inheritance and override
  • Write policy application tests

14. MCP Integration

  • Create check_permission tool - Verify action is allowed
  • Create request_approval tool - Request human approval
  • Create report_issue tool - Report safety concern
  • Create get_budget_status tool - Check remaining budget
  • Create create_checkpoint tool - Create rollback point
  • Create rollback_to_checkpoint tool - Undo to checkpoint
  • Integrate with all MCP servers for pre/post hooks
  • Write MCP tool tests

15. Metrics & Observability

  • Add Prometheus metrics for safety events
  • Track actions_blocked_total counter by reason
  • Track approvals_requested_total counter
  • Track rollbacks_performed_total counter
  • Track cost_budget_remaining gauge by scope
  • Track rate_limit_hits_total counter
  • Track loop_detections_total counter
  • Add structured logging for all safety decisions
  • Create Grafana dashboard for safety metrics
  • Add alerting for safety incidents

16. Testing

  • Write unit tests for each safety component
  • Write integration tests for safety workflows
  • Create adversarial tests (try to bypass safety)
  • Write performance tests (safety overhead)
  • Create chaos tests (failure scenarios)
  • Write compliance tests (audit completeness)
  • Achieve >90% code coverage
  • Create security penetration tests

17. Documentation

  • Write README with safety architecture
  • Document policy configuration
  • Document validation rules
  • Document cost control settings
  • Document HITL workflow
  • Document rollback procedures
  • Document emergency procedures
  • Create incident response playbook

Technical Specifications

Safety Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Safety Guardian Flow                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Agent Action Request                                                        │
│         │                                                                    │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Permission   │──▶ Denied ──▶ Block + Audit                              │
│  │ Check        │                                                           │
│  └──────┬───────┘                                                           │
│         │ Allowed                                                           │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Cost Check   │──▶ Over Budget ──▶ Block + Alert                         │
│  └──────┬───────┘                                                           │
│         │ Within Budget                                                      │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Rate Limit   │──▶ Rate Limited ──▶ Delay/Block                          │
│  │ Check        │                                                           │
│  └──────┬───────┘                                                           │
│         │ OK                                                                 │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Loop         │──▶ Loop Detected ──▶ Break + Alert                       │
│  │ Detection    │                                                           │
│  └──────┬───────┘                                                           │
│         │ No Loop                                                            │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ HITL Check   │──▶ Approval Required ──▶ Wait for Human                  │
│  └──────┬───────┘                                                           │
│         │ Auto-approve                                                       │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Create       │                                                           │
│  │ Checkpoint   │ (if destructive)                                          │
│  └──────┬───────┘                                                           │
│         │                                                                    │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Execute in   │──▶ Failure ──▶ Rollback + Audit                          │
│  │ Sandbox      │                                                           │
│  └──────┬───────┘                                                           │
│         │ Success                                                            │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Audit Log    │                                                           │
│  │ + Metrics    │                                                           │
│  └──────────────┘                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety Policy Schema

class SafetyPolicy(BaseModel):
    name: str
    description: str
    
    # Cost controls
    max_tokens_per_session: int = 100_000
    max_tokens_per_day: int = 1_000_000
    max_cost_per_session_usd: float = 10.0
    max_cost_per_day_usd: float = 100.0
    
    # Rate limits
    max_actions_per_minute: int = 60
    max_llm_calls_per_minute: int = 20
    max_file_operations_per_minute: int = 100
    
    # Permissions
    allowed_tools: list[str] = ["*"]  # or specific tool names
    denied_tools: list[str] = []
    allowed_file_patterns: list[str] = ["**/*"]
    denied_file_patterns: list[str] = ["**/.env", "**/secrets/**"]
    
    # HITL
    require_approval_for: list[str] = [
        "delete_file",
        "push_to_remote", 
        "deploy_to_production",
        "modify_critical_config"
    ]
    
    # Loop detection
    max_repeated_actions: int = 5
    max_similar_actions: int = 10
    
    # Sandbox
    require_sandbox: bool = False
    sandbox_timeout_seconds: int = 300
    sandbox_memory_mb: int = 1024

Autonomy Level Mapping

Autonomy Level Pre-approval Cost Limit Destructive Actions
FULL_CONTROL All actions $1/session Block
MILESTONE Critical only $10/session Require approval
AUTONOMOUS None $100/session Auto-checkpoint

Acceptance Criteria

  • No action executes without passing safety checks
  • Cost limits are enforced (0% over-budget executions)
  • Loop detection catches 100% of exact loops
  • Rollback works for all supported action types
  • Audit trail is complete and tamper-evident
  • HITL workflow works with <5 second latency
  • Emergency kill switch stops all agents within 1 second
  • Safety overhead <50ms per action
  • >90% test coverage including adversarial tests
  • Documentation complete with incident playbooks

Labels

phase-2, mcp, backend, safety, critical

Milestone

Phase 2: MCP Integration

## Overview Implement a comprehensive safety framework that prevents runaway agents, validates actions before execution, enables rollback of mistakes, and controls costs. This is **NON-NEGOTIABLE** for autonomous agent operation. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Risks - **Runaway costs**: Agent keeps calling LLMs without limits - **Destructive actions**: Agent deletes critical files or data - **Infinite loops**: Agent gets stuck repeating the same action - **Security breaches**: Agent accesses unauthorized resources - **Data loss**: No way to undo agent mistakes - **Resource exhaustion**: Agent consumes all available compute ### The Solution A multi-layered safety system: 1. **Pre-execution validation**: Check if action is safe BEFORE executing 2. **Cost controls**: Budget limits and usage tracking 3. **Rate limiting**: Prevent runaway behavior 4. **Rollback capability**: Undo what went wrong 5. **Human-in-the-loop**: Confirmation for critical actions 6. **Audit trail**: Complete history for debugging and compliance --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/safety/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `guardian.py` with `SafetyGuardian` class - [ ] Create `config.py` with Pydantic settings - [ ] Define safety policy schema - [ ] Create middleware integration pattern - [ ] Write architecture decision record (ADR) ### 2. Action Validation Framework - [ ] Create `validation/validator.py` with `ActionValidator` class - [ ] Define action schema with required metadata - [ ] Implement pre-execution hook system - [ ] Create validation rule engine - [ ] Implement rule types: - [ ] Allow rules (whitelist specific actions) - [ ] Deny rules (blacklist specific actions) - [ ] Require approval rules (human confirmation) - [ ] Condition rules (allow if condition met) - [ ] Create rule priority and conflict resolution - [ ] Implement validation caching for performance - [ ] Add validation bypass for emergencies (with audit) - [ ] Write validation rule tests ### 3. Cost Control System - [ ] Create `costs/controller.py` with `CostController` class - [ ] Implement per-agent token budgets - [ ] Implement per-project token budgets - [ ] Implement per-session token budgets - [ ] Create budget allocation across agents - [ ] Implement real-time cost tracking - [ ] Create cost alerts (warn at 80%, block at 100%) - [ ] Implement cost prediction for planned actions - [ ] Create cost reporting and analytics - [ ] Add cost override for privileged operations - [ ] Implement cost rollover policies (daily/weekly/monthly) - [ ] Write cost control tests ### 4. Rate Limiting - [ ] Create `limits/limiter.py` with `RateLimiter` class - [ ] Implement per-tool rate limits - [ ] Implement per-agent rate limits - [ ] Implement per-resource rate limits (API, file, etc.) - [ ] Create sliding window rate limiting - [ ] Implement burst allowance with recovery - [ ] Create rate limit escalation (slow down before block) - [ ] Implement rate limit exceptions for priority tasks - [ ] Add rate limit metrics and monitoring - [ ] Write rate limiting tests ### 5. Loop Detection & Prevention - [ ] Create `loops/detector.py` with `LoopDetector` class - [ ] Implement action history tracking - [ ] Detect exact repetition (same action, same args) - [ ] Detect semantic repetition (similar actions) - [ ] Detect oscillation patterns (A → B → A → B) - [ ] Create configurable loop thresholds - [ ] Implement loop breaking strategies - [ ] Add loop detection alerts - [ ] Create loop analysis for debugging - [ ] Write loop detection tests ### 6. Permission System - [ ] Create `permissions/manager.py` with `PermissionManager` class - [ ] Define permission types (read, write, execute, delete) - [ ] Define resource types (file, api, database, external) - [ ] Implement permission checking per action - [ ] Create permission inheritance (project → agent → action) - [ ] Implement least-privilege defaults - [ ] Create permission request workflow - [ ] Add permission escalation logging - [ ] Implement temporary permissions (time-limited) - [ ] Write permission tests ### 7. Rollback System - [ ] Create `rollback/manager.py` with `RollbackManager` class - [ ] Implement transaction wrapping for actions - [ ] Create checkpoints before destructive actions - [ ] Implement file system rollback (store original files) - [ ] Implement database rollback (transaction savepoints) - [ ] Implement git rollback (branch before changes) - [ ] Create rollback triggers (automatic on failure) - [ ] Implement partial rollback (undo specific actions) - [ ] Add rollback confirmation for critical operations - [ ] Create rollback history and audit - [ ] Write rollback integration tests ### 8. Human-in-the-Loop (HITL) - [ ] Create `hitl/manager.py` with `HITLManager` class - [ ] Define confirmation-required action patterns - [ ] Implement confirmation request queue - [ ] Create timeout handling (default deny) - [ ] Implement approval delegation - [ ] Create batch approval for similar actions - [ ] Implement approval with modifications - [ ] Add approval audit trail - [ ] Create notification channels (email, Slack, webhook) - [ ] Write HITL workflow tests ### 9. Content Filtering - [ ] Create `content/filter.py` with `ContentFilter` class - [ ] Implement PII detection and masking - [ ] Implement secret detection (API keys, passwords) - [ ] Create prohibited content patterns - [ ] Implement content sanitization - [ ] Create output filtering (before showing to user) - [ ] Add content filter bypass for privileged operations - [ ] Write content filter tests ### 10. Sandbox Execution - [ ] Create `sandbox/executor.py` with `SandboxExecutor` class - [ ] Implement Docker-based sandboxing for untrusted code - [ ] Implement file system isolation - [ ] Implement network isolation - [ ] Create resource limits (CPU, memory, disk) - [ ] Implement timeout enforcement - [ ] Create sandbox escape detection - [ ] Add sandbox cleanup after execution - [ ] Write sandbox security tests ### 11. Audit System - [ ] Create `audit/logger.py` with `AuditLogger` class - [ ] Define audit event schema - [ ] Implement action audit logging (all actions) - [ ] Implement decision audit logging (why allowed/denied) - [ ] Implement outcome audit logging (what happened) - [ ] Create audit retention policies - [ ] Implement audit search and query - [ ] Create audit export for compliance - [ ] Add tamper detection for audit logs - [ ] Implement audit alerts for suspicious patterns - [ ] Write audit tests ### 12. Emergency Controls - [ ] Create `emergency/controls.py` with emergency handlers - [ ] Implement kill switch (stop all agents immediately) - [ ] Implement pause (stop accepting new tasks) - [ ] Implement resource lockdown (prevent access to resources) - [ ] Create emergency escalation procedures - [ ] Implement graceful shutdown with state preservation - [ ] Add emergency recovery procedures - [ ] Create post-mortem data collection - [ ] Write emergency control tests ### 13. Safety Policies - [ ] Create `policies/schema.py` with policy definitions - [ ] Define project-level safety policies - [ ] Define agent-type safety policies - [ ] Define autonomy-level based policies - [ ] Create policy templates for common scenarios - [ ] Implement policy validation - [ ] Create policy versioning - [ ] Implement policy inheritance and override - [ ] Write policy application tests ### 14. MCP Integration - [ ] Create `check_permission` tool - Verify action is allowed - [ ] Create `request_approval` tool - Request human approval - [ ] Create `report_issue` tool - Report safety concern - [ ] Create `get_budget_status` tool - Check remaining budget - [ ] Create `create_checkpoint` tool - Create rollback point - [ ] Create `rollback_to_checkpoint` tool - Undo to checkpoint - [ ] Integrate with all MCP servers for pre/post hooks - [ ] Write MCP tool tests ### 15. Metrics & Observability - [ ] Add Prometheus metrics for safety events - [ ] Track `actions_blocked_total` counter by reason - [ ] Track `approvals_requested_total` counter - [ ] Track `rollbacks_performed_total` counter - [ ] Track `cost_budget_remaining` gauge by scope - [ ] Track `rate_limit_hits_total` counter - [ ] Track `loop_detections_total` counter - [ ] Add structured logging for all safety decisions - [ ] Create Grafana dashboard for safety metrics - [ ] Add alerting for safety incidents ### 16. Testing - [ ] Write unit tests for each safety component - [ ] Write integration tests for safety workflows - [ ] Create adversarial tests (try to bypass safety) - [ ] Write performance tests (safety overhead) - [ ] Create chaos tests (failure scenarios) - [ ] Write compliance tests (audit completeness) - [ ] Achieve >90% code coverage - [ ] Create security penetration tests ### 17. Documentation - [ ] Write README with safety architecture - [ ] Document policy configuration - [ ] Document validation rules - [ ] Document cost control settings - [ ] Document HITL workflow - [ ] Document rollback procedures - [ ] Document emergency procedures - [ ] Create incident response playbook --- ## Technical Specifications ### Safety Flow ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Safety Guardian Flow │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Agent Action Request │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Permission │──▶ Denied ──▶ Block + Audit │ │ │ Check │ │ │ └──────┬───────┘ │ │ │ Allowed │ │ ▼ │ │ ┌──────────────┐ │ │ │ Cost Check │──▶ Over Budget ──▶ Block + Alert │ │ └──────┬───────┘ │ │ │ Within Budget │ │ ▼ │ │ ┌──────────────┐ │ │ │ Rate Limit │──▶ Rate Limited ──▶ Delay/Block │ │ │ Check │ │ │ └──────┬───────┘ │ │ │ OK │ │ ▼ │ │ ┌──────────────┐ │ │ │ Loop │──▶ Loop Detected ──▶ Break + Alert │ │ │ Detection │ │ │ └──────┬───────┘ │ │ │ No Loop │ │ ▼ │ │ ┌──────────────┐ │ │ │ HITL Check │──▶ Approval Required ──▶ Wait for Human │ │ └──────┬───────┘ │ │ │ Auto-approve │ │ ▼ │ │ ┌──────────────┐ │ │ │ Create │ │ │ │ Checkpoint │ (if destructive) │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Execute in │──▶ Failure ──▶ Rollback + Audit │ │ │ Sandbox │ │ │ └──────┬───────┘ │ │ │ Success │ │ ▼ │ │ ┌──────────────┐ │ │ │ Audit Log │ │ │ │ + Metrics │ │ │ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Safety Policy Schema ```python class SafetyPolicy(BaseModel): name: str description: str # Cost controls max_tokens_per_session: int = 100_000 max_tokens_per_day: int = 1_000_000 max_cost_per_session_usd: float = 10.0 max_cost_per_day_usd: float = 100.0 # Rate limits max_actions_per_minute: int = 60 max_llm_calls_per_minute: int = 20 max_file_operations_per_minute: int = 100 # Permissions allowed_tools: list[str] = ["*"] # or specific tool names denied_tools: list[str] = [] allowed_file_patterns: list[str] = ["**/*"] denied_file_patterns: list[str] = ["**/.env", "**/secrets/**"] # HITL require_approval_for: list[str] = [ "delete_file", "push_to_remote", "deploy_to_production", "modify_critical_config" ] # Loop detection max_repeated_actions: int = 5 max_similar_actions: int = 10 # Sandbox require_sandbox: bool = False sandbox_timeout_seconds: int = 300 sandbox_memory_mb: int = 1024 ``` ### Autonomy Level Mapping | Autonomy Level | Pre-approval | Cost Limit | Destructive Actions | |----------------|--------------|------------|---------------------| | FULL_CONTROL | All actions | $1/session | Block | | MILESTONE | Critical only | $10/session | Require approval | | AUTONOMOUS | None | $100/session | Auto-checkpoint | --- ## Acceptance Criteria - [ ] No action executes without passing safety checks - [ ] Cost limits are enforced (0% over-budget executions) - [ ] Loop detection catches 100% of exact loops - [ ] Rollback works for all supported action types - [ ] Audit trail is complete and tamper-evident - [ ] HITL workflow works with <5 second latency - [ ] Emergency kill switch stops all agents within 1 second - [ ] Safety overhead <50ms per action - [ ] >90% test coverage including adversarial tests - [ ] Documentation complete with incident playbooks --- ## Labels `phase-2`, `mcp`, `backend`, `safety`, `critical` ## Milestone Phase 2: MCP Integration
Sign in to join this conversation.