feat(mcp): Guardrails & Safety Framework #63

New Issue

cardosofelipe · 2026-01-03T09:10:46Z

cardosofelipe commented

2026-01-03 09:10:46 +00:00

Overview

Implement a comprehensive safety framework that prevents runaway agents, validates actions before execution, enables rollback of mistakes, and controls costs. This is NON-NEGOTIABLE for autonomous agent operation.

Parent Epic

Epic #60: [EPIC] Phase 2: MCP Integration

Why This Is Critical

The Risks

Runaway costs: Agent keeps calling LLMs without limits
Destructive actions: Agent deletes critical files or data
Infinite loops: Agent gets stuck repeating the same action
Security breaches: Agent accesses unauthorized resources
Data loss: No way to undo agent mistakes
Resource exhaustion: Agent consumes all available compute

The Solution

A multi-layered safety system:

Pre-execution validation: Check if action is safe BEFORE executing
Cost controls: Budget limits and usage tracking
Rate limiting: Prevent runaway behavior
Rollback capability: Undo what went wrong
Human-in-the-loop: Confirmation for critical actions
Audit trail: Complete history for debugging and compliance

Implementation Sub-Tasks

1. Project Setup & Architecture

Create backend/src/mcp_core/safety/ directory
Create __init__.py with public API exports
Create guardian.py with SafetyGuardian class
Create config.py with Pydantic settings
Define safety policy schema
Create middleware integration pattern
Write architecture decision record (ADR)

2. Action Validation Framework

Create validation/validator.py with ActionValidator class
Define action schema with required metadata
Implement pre-execution hook system
Create validation rule engine
Implement rule types:
- Allow rules (whitelist specific actions)
- Deny rules (blacklist specific actions)
- Require approval rules (human confirmation)
- Condition rules (allow if condition met)
Create rule priority and conflict resolution
Implement validation caching for performance
Add validation bypass for emergencies (with audit)
Write validation rule tests

3. Cost Control System

Create costs/controller.py with CostController class
Implement per-agent token budgets
Implement per-project token budgets
Implement per-session token budgets
Create budget allocation across agents
Implement real-time cost tracking
Create cost alerts (warn at 80%, block at 100%)
Implement cost prediction for planned actions
Create cost reporting and analytics
Add cost override for privileged operations
Implement cost rollover policies (daily/weekly/monthly)
Write cost control tests

4. Rate Limiting

Create limits/limiter.py with RateLimiter class
Implement per-tool rate limits
Implement per-agent rate limits
Implement per-resource rate limits (API, file, etc.)
Create sliding window rate limiting
Implement burst allowance with recovery
Create rate limit escalation (slow down before block)
Implement rate limit exceptions for priority tasks
Add rate limit metrics and monitoring
Write rate limiting tests

5. Loop Detection & Prevention

Create loops/detector.py with LoopDetector class
Implement action history tracking
Detect exact repetition (same action, same args)
Detect semantic repetition (similar actions)
Detect oscillation patterns (A → B → A → B)
Create configurable loop thresholds
Implement loop breaking strategies
Add loop detection alerts
Create loop analysis for debugging
Write loop detection tests

6. Permission System

Create permissions/manager.py with PermissionManager class
Define permission types (read, write, execute, delete)
Define resource types (file, api, database, external)
Implement permission checking per action
Create permission inheritance (project → agent → action)
Implement least-privilege defaults
Create permission request workflow
Add permission escalation logging
Implement temporary permissions (time-limited)
Write permission tests

7. Rollback System

Create rollback/manager.py with RollbackManager class
Implement transaction wrapping for actions
Create checkpoints before destructive actions
Implement file system rollback (store original files)
Implement database rollback (transaction savepoints)
Implement git rollback (branch before changes)
Create rollback triggers (automatic on failure)
Implement partial rollback (undo specific actions)
Add rollback confirmation for critical operations
Create rollback history and audit
Write rollback integration tests

8. Human-in-the-Loop (HITL)

Create hitl/manager.py with HITLManager class
Define confirmation-required action patterns
Implement confirmation request queue
Create timeout handling (default deny)
Implement approval delegation
Create batch approval for similar actions
Implement approval with modifications
Add approval audit trail
Create notification channels (email, Slack, webhook)
Write HITL workflow tests

9. Content Filtering

Create content/filter.py with ContentFilter class
Implement PII detection and masking
Implement secret detection (API keys, passwords)
Create prohibited content patterns
Implement content sanitization
Create output filtering (before showing to user)
Add content filter bypass for privileged operations
Write content filter tests

10. Sandbox Execution

Create sandbox/executor.py with SandboxExecutor class
Implement Docker-based sandboxing for untrusted code
Implement file system isolation
Implement network isolation
Create resource limits (CPU, memory, disk)
Implement timeout enforcement
Create sandbox escape detection
Add sandbox cleanup after execution
Write sandbox security tests

11. Audit System

Create audit/logger.py with AuditLogger class
Define audit event schema
Implement action audit logging (all actions)
Implement decision audit logging (why allowed/denied)
Implement outcome audit logging (what happened)
Create audit retention policies
Implement audit search and query
Create audit export for compliance
Add tamper detection for audit logs
Implement audit alerts for suspicious patterns
Write audit tests

12. Emergency Controls

Create emergency/controls.py with emergency handlers
Implement kill switch (stop all agents immediately)
Implement pause (stop accepting new tasks)
Implement resource lockdown (prevent access to resources)
Create emergency escalation procedures
Implement graceful shutdown with state preservation
Add emergency recovery procedures
Create post-mortem data collection
Write emergency control tests

13. Safety Policies

Create policies/schema.py with policy definitions
Define project-level safety policies
Define agent-type safety policies
Define autonomy-level based policies
Create policy templates for common scenarios
Implement policy validation
Create policy versioning
Implement policy inheritance and override
Write policy application tests

14. MCP Integration

Create check_permission tool - Verify action is allowed
Create request_approval tool - Request human approval
Create report_issue tool - Report safety concern
Create get_budget_status tool - Check remaining budget
Create create_checkpoint tool - Create rollback point
Create rollback_to_checkpoint tool - Undo to checkpoint
Integrate with all MCP servers for pre/post hooks
Write MCP tool tests

15. Metrics & Observability

Add Prometheus metrics for safety events
Track actions_blocked_total counter by reason
Track approvals_requested_total counter
Track rollbacks_performed_total counter
Track cost_budget_remaining gauge by scope
Track rate_limit_hits_total counter
Track loop_detections_total counter
Add structured logging for all safety decisions
Create Grafana dashboard for safety metrics
Add alerting for safety incidents

16. Testing

Write unit tests for each safety component
Write integration tests for safety workflows
Create adversarial tests (try to bypass safety)
Write performance tests (safety overhead)
Create chaos tests (failure scenarios)
Write compliance tests (audit completeness)
Achieve >90% code coverage
Create security penetration tests

17. Documentation

Write README with safety architecture
Document policy configuration
Document validation rules
Document cost control settings
Document HITL workflow
Document rollback procedures
Document emergency procedures
Create incident response playbook

Technical Specifications

Safety Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Safety Guardian Flow                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Agent Action Request                                                        │
│         │                                                                    │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Permission   │──▶ Denied ──▶ Block + Audit                              │
│  │ Check        │                                                           │
│  └──────┬───────┘                                                           │
│         │ Allowed                                                           │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Cost Check   │──▶ Over Budget ──▶ Block + Alert                         │
│  └──────┬───────┘                                                           │
│         │ Within Budget                                                      │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Rate Limit   │──▶ Rate Limited ──▶ Delay/Block                          │
│  │ Check        │                                                           │
│  └──────┬───────┘                                                           │
│         │ OK                                                                 │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Loop         │──▶ Loop Detected ──▶ Break + Alert                       │
│  │ Detection    │                                                           │
│  └──────┬───────┘                                                           │
│         │ No Loop                                                            │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ HITL Check   │──▶ Approval Required ──▶ Wait for Human                  │
│  └──────┬───────┘                                                           │
│         │ Auto-approve                                                       │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Create       │                                                           │
│  │ Checkpoint   │ (if destructive)                                          │
│  └──────┬───────┘                                                           │
│         │                                                                    │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Execute in   │──▶ Failure ──▶ Rollback + Audit                          │
│  │ Sandbox      │                                                           │
│  └──────┬───────┘                                                           │
│         │ Success                                                            │
│         ▼                                                                    │
│  ┌──────────────┐                                                           │
│  │ Audit Log    │                                                           │
│  │ + Metrics    │                                                           │
│  └──────────────┘                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety Policy Schema

class SafetyPolicy(BaseModel):
    name: str
    description: str
    
    # Cost controls
    max_tokens_per_session: int = 100_000
    max_tokens_per_day: int = 1_000_000
    max_cost_per_session_usd: float = 10.0
    max_cost_per_day_usd: float = 100.0
    
    # Rate limits
    max_actions_per_minute: int = 60
    max_llm_calls_per_minute: int = 20
    max_file_operations_per_minute: int = 100
    
    # Permissions
    allowed_tools: list[str] = ["*"]  # or specific tool names
    denied_tools: list[str] = []
    allowed_file_patterns: list[str] = ["**/*"]
    denied_file_patterns: list[str] = ["**/.env", "**/secrets/**"]
    
    # HITL
    require_approval_for: list[str] = [
        "delete_file",
        "push_to_remote", 
        "deploy_to_production",
        "modify_critical_config"
    ]
    
    # Loop detection
    max_repeated_actions: int = 5
    max_similar_actions: int = 10
    
    # Sandbox
    require_sandbox: bool = False
    sandbox_timeout_seconds: int = 300
    sandbox_memory_mb: int = 1024

Autonomy Level Mapping

Autonomy Level	Pre-approval	Cost Limit	Destructive Actions
FULL_CONTROL	All actions	$1/session	Block
MILESTONE	Critical only	$10/session	Require approval
AUTONOMOUS	None	$100/session	Auto-checkpoint

Acceptance Criteria

No action executes without passing safety checks
Cost limits are enforced (0% over-budget executions)
Loop detection catches 100% of exact loops
Rollback works for all supported action types
Audit trail is complete and tamper-evident
HITL workflow works with <5 second latency
Emergency kill switch stops all agents within 1 second
Safety overhead <50ms per action
>90% test coverage including adversarial tests
Documentation complete with incident playbooks

Labels

phase-2, mcp, backend, safety, critical

Milestone

Phase 2: MCP Integration

## Overview Implement a comprehensive safety framework that prevents runaway agents, validates actions before execution, enables rollback of mistakes, and controls costs. This is **NON-NEGOTIABLE** for autonomous agent operation. ## Parent Epic - Epic #60: [EPIC] Phase 2: MCP Integration ## Why This Is Critical ### The Risks - **Runaway costs**: Agent keeps calling LLMs without limits - **Destructive actions**: Agent deletes critical files or data - **Infinite loops**: Agent gets stuck repeating the same action - **Security breaches**: Agent accesses unauthorized resources - **Data loss**: No way to undo agent mistakes - **Resource exhaustion**: Agent consumes all available compute ### The Solution A multi-layered safety system: 1. **Pre-execution validation**: Check if action is safe BEFORE executing 2. **Cost controls**: Budget limits and usage tracking 3. **Rate limiting**: Prevent runaway behavior 4. **Rollback capability**: Undo what went wrong 5. **Human-in-the-loop**: Confirmation for critical actions 6. **Audit trail**: Complete history for debugging and compliance --- ## Implementation Sub-Tasks ### 1. Project Setup & Architecture - [ ] Create `backend/src/mcp_core/safety/` directory - [ ] Create `__init__.py` with public API exports - [ ] Create `guardian.py` with `SafetyGuardian` class - [ ] Create `config.py` with Pydantic settings - [ ] Define safety policy schema - [ ] Create middleware integration pattern - [ ] Write architecture decision record (ADR) ### 2. Action Validation Framework - [ ] Create `validation/validator.py` with `ActionValidator` class - [ ] Define action schema with required metadata - [ ] Implement pre-execution hook system - [ ] Create validation rule engine - [ ] Implement rule types: - [ ] Allow rules (whitelist specific actions) - [ ] Deny rules (blacklist specific actions) - [ ] Require approval rules (human confirmation) - [ ] Condition rules (allow if condition met) - [ ] Create rule priority and conflict resolution - [ ] Implement validation caching for performance - [ ] Add validation bypass for emergencies (with audit) - [ ] Write validation rule tests ### 3. Cost Control System - [ ] Create `costs/controller.py` with `CostController` class - [ ] Implement per-agent token budgets - [ ] Implement per-project token budgets - [ ] Implement per-session token budgets - [ ] Create budget allocation across agents - [ ] Implement real-time cost tracking - [ ] Create cost alerts (warn at 80%, block at 100%) - [ ] Implement cost prediction for planned actions - [ ] Create cost reporting and analytics - [ ] Add cost override for privileged operations - [ ] Implement cost rollover policies (daily/weekly/monthly) - [ ] Write cost control tests ### 4. Rate Limiting - [ ] Create `limits/limiter.py` with `RateLimiter` class - [ ] Implement per-tool rate limits - [ ] Implement per-agent rate limits - [ ] Implement per-resource rate limits (API, file, etc.) - [ ] Create sliding window rate limiting - [ ] Implement burst allowance with recovery - [ ] Create rate limit escalation (slow down before block) - [ ] Implement rate limit exceptions for priority tasks - [ ] Add rate limit metrics and monitoring - [ ] Write rate limiting tests ### 5. Loop Detection & Prevention - [ ] Create `loops/detector.py` with `LoopDetector` class - [ ] Implement action history tracking - [ ] Detect exact repetition (same action, same args) - [ ] Detect semantic repetition (similar actions) - [ ] Detect oscillation patterns (A → B → A → B) - [ ] Create configurable loop thresholds - [ ] Implement loop breaking strategies - [ ] Add loop detection alerts - [ ] Create loop analysis for debugging - [ ] Write loop detection tests ### 6. Permission System - [ ] Create `permissions/manager.py` with `PermissionManager` class - [ ] Define permission types (read, write, execute, delete) - [ ] Define resource types (file, api, database, external) - [ ] Implement permission checking per action - [ ] Create permission inheritance (project → agent → action) - [ ] Implement least-privilege defaults - [ ] Create permission request workflow - [ ] Add permission escalation logging - [ ] Implement temporary permissions (time-limited) - [ ] Write permission tests ### 7. Rollback System - [ ] Create `rollback/manager.py` with `RollbackManager` class - [ ] Implement transaction wrapping for actions - [ ] Create checkpoints before destructive actions - [ ] Implement file system rollback (store original files) - [ ] Implement database rollback (transaction savepoints) - [ ] Implement git rollback (branch before changes) - [ ] Create rollback triggers (automatic on failure) - [ ] Implement partial rollback (undo specific actions) - [ ] Add rollback confirmation for critical operations - [ ] Create rollback history and audit - [ ] Write rollback integration tests ### 8. Human-in-the-Loop (HITL) - [ ] Create `hitl/manager.py` with `HITLManager` class - [ ] Define confirmation-required action patterns - [ ] Implement confirmation request queue - [ ] Create timeout handling (default deny) - [ ] Implement approval delegation - [ ] Create batch approval for similar actions - [ ] Implement approval with modifications - [ ] Add approval audit trail - [ ] Create notification channels (email, Slack, webhook) - [ ] Write HITL workflow tests ### 9. Content Filtering - [ ] Create `content/filter.py` with `ContentFilter` class - [ ] Implement PII detection and masking - [ ] Implement secret detection (API keys, passwords) - [ ] Create prohibited content patterns - [ ] Implement content sanitization - [ ] Create output filtering (before showing to user) - [ ] Add content filter bypass for privileged operations - [ ] Write content filter tests ### 10. Sandbox Execution - [ ] Create `sandbox/executor.py` with `SandboxExecutor` class - [ ] Implement Docker-based sandboxing for untrusted code - [ ] Implement file system isolation - [ ] Implement network isolation - [ ] Create resource limits (CPU, memory, disk) - [ ] Implement timeout enforcement - [ ] Create sandbox escape detection - [ ] Add sandbox cleanup after execution - [ ] Write sandbox security tests ### 11. Audit System - [ ] Create `audit/logger.py` with `AuditLogger` class - [ ] Define audit event schema - [ ] Implement action audit logging (all actions) - [ ] Implement decision audit logging (why allowed/denied) - [ ] Implement outcome audit logging (what happened) - [ ] Create audit retention policies - [ ] Implement audit search and query - [ ] Create audit export for compliance - [ ] Add tamper detection for audit logs - [ ] Implement audit alerts for suspicious patterns - [ ] Write audit tests ### 12. Emergency Controls - [ ] Create `emergency/controls.py` with emergency handlers - [ ] Implement kill switch (stop all agents immediately) - [ ] Implement pause (stop accepting new tasks) - [ ] Implement resource lockdown (prevent access to resources) - [ ] Create emergency escalation procedures - [ ] Implement graceful shutdown with state preservation - [ ] Add emergency recovery procedures - [ ] Create post-mortem data collection - [ ] Write emergency control tests ### 13. Safety Policies - [ ] Create `policies/schema.py` with policy definitions - [ ] Define project-level safety policies - [ ] Define agent-type safety policies - [ ] Define autonomy-level based policies - [ ] Create policy templates for common scenarios - [ ] Implement policy validation - [ ] Create policy versioning - [ ] Implement policy inheritance and override - [ ] Write policy application tests ### 14. MCP Integration - [ ] Create `check_permission` tool - Verify action is allowed - [ ] Create `request_approval` tool - Request human approval - [ ] Create `report_issue` tool - Report safety concern - [ ] Create `get_budget_status` tool - Check remaining budget - [ ] Create `create_checkpoint` tool - Create rollback point - [ ] Create `rollback_to_checkpoint` tool - Undo to checkpoint - [ ] Integrate with all MCP servers for pre/post hooks - [ ] Write MCP tool tests ### 15. Metrics & Observability - [ ] Add Prometheus metrics for safety events - [ ] Track `actions_blocked_total` counter by reason - [ ] Track `approvals_requested_total` counter - [ ] Track `rollbacks_performed_total` counter - [ ] Track `cost_budget_remaining` gauge by scope - [ ] Track `rate_limit_hits_total` counter - [ ] Track `loop_detections_total` counter - [ ] Add structured logging for all safety decisions - [ ] Create Grafana dashboard for safety metrics - [ ] Add alerting for safety incidents ### 16. Testing - [ ] Write unit tests for each safety component - [ ] Write integration tests for safety workflows - [ ] Create adversarial tests (try to bypass safety) - [ ] Write performance tests (safety overhead) - [ ] Create chaos tests (failure scenarios) - [ ] Write compliance tests (audit completeness) - [ ] Achieve >90% code coverage - [ ] Create security penetration tests ### 17. Documentation - [ ] Write README with safety architecture - [ ] Document policy configuration - [ ] Document validation rules - [ ] Document cost control settings - [ ] Document HITL workflow - [ ] Document rollback procedures - [ ] Document emergency procedures - [ ] Create incident response playbook --- ## Technical Specifications ### Safety Flow ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Safety Guardian Flow │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Agent Action Request │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Permission │──▶ Denied ──▶ Block + Audit │ │ │ Check │ │ │ └──────┬───────┘ │ │ │ Allowed │ │ ▼ │ │ ┌──────────────┐ │ │ │ Cost Check │──▶ Over Budget ──▶ Block + Alert │ │ └──────┬───────┘ │ │ │ Within Budget │ │ ▼ │ │ ┌──────────────┐ │ │ │ Rate Limit │──▶ Rate Limited ──▶ Delay/Block │ │ │ Check │ │ │ └──────┬───────┘ │ │ │ OK │ │ ▼ │ │ ┌──────────────┐ │ │ │ Loop │──▶ Loop Detected ──▶ Break + Alert │ │ │ Detection │ │ │ └──────┬───────┘ │ │ │ No Loop │ │ ▼ │ │ ┌──────────────┐ │ │ │ HITL Check │──▶ Approval Required ──▶ Wait for Human │ │ └──────┬───────┘ │ │ │ Auto-approve │ │ ▼ │ │ ┌──────────────┐ │ │ │ Create │ │ │ │ Checkpoint │ (if destructive) │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Execute in │──▶ Failure ──▶ Rollback + Audit │ │ │ Sandbox │ │ │ └──────┬───────┘ │ │ │ Success │ │ ▼ │ │ ┌──────────────┐ │ │ │ Audit Log │ │ │ │ + Metrics │ │ │ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Safety Policy Schema ```python class SafetyPolicy(BaseModel): name: str description: str # Cost controls max_tokens_per_session: int = 100_000 max_tokens_per_day: int = 1_000_000 max_cost_per_session_usd: float = 10.0 max_cost_per_day_usd: float = 100.0 # Rate limits max_actions_per_minute: int = 60 max_llm_calls_per_minute: int = 20 max_file_operations_per_minute: int = 100 # Permissions allowed_tools: list[str] = ["*"] # or specific tool names denied_tools: list[str] = [] allowed_file_patterns: list[str] = ["**/*"] denied_file_patterns: list[str] = ["**/.env", "**/secrets/**"] # HITL require_approval_for: list[str] = [ "delete_file", "push_to_remote", "deploy_to_production", "modify_critical_config" ] # Loop detection max_repeated_actions: int = 5 max_similar_actions: int = 10 # Sandbox require_sandbox: bool = False sandbox_timeout_seconds: int = 300 sandbox_memory_mb: int = 1024 ``` ### Autonomy Level Mapping | Autonomy Level | Pre-approval | Cost Limit | Destructive Actions | |----------------|--------------|------------|---------------------| | FULL_CONTROL | All actions | $1/session | Block | | MILESTONE | Critical only | $10/session | Require approval | | AUTONOMOUS | None | $100/session | Auto-checkpoint | --- ## Acceptance Criteria - [ ] No action executes without passing safety checks - [ ] Cost limits are enforced (0% over-budget executions) - [ ] Loop detection catches 100% of exact loops - [ ] Rollback works for all supported action types - [ ] Audit trail is complete and tamper-evident - [ ] HITL workflow works with <5 second latency - [ ] Emergency kill switch stops all agents within 1 second - [ ] Safety overhead <50ms per action - [ ] >90% test coverage including adversarial tests - [ ] Documentation complete with incident playbooks --- ## Labels `phase-2`, `mcp`, `backend`, `safety`, `critical` ## Milestone Phase 2: MCP Integration

cardosofelipe referenced this issue

2026-01-03 09:18:19 +00:00

[EPIC] Phase 2: MCP Integration #60

cardosofelipe referenced this issue from a commit

2026-01-03 10:22:40 +00:00

feat(backend): add safety framework foundation (Phase A) (#63)

cardosofelipe referenced this issue from a commit

2026-01-03 10:28:11 +00:00

feat(backend): add Phase B safety subsystems (#63)

cardosofelipe referenced this issue from a commit

2026-01-03 10:36:35 +00:00

feat(safety): add Phase C advanced controls