[SPIKE-011] Audit Logging & Decision Tracking #11

Closed
opened 2025-12-29 03:51:02 +00:00 by cardosofelipe · 1 comment

Objective

Design comprehensive audit logging for all agent actions and decisions.

What to Log

  1. All agent actions (tool calls, file changes, issue updates)
  2. Agent decisions and reasoning
  3. Approval requests and responses
  4. State transitions
  5. Configuration changes

Key Questions

  1. What's the log storage strategy? (database, file, external service)
  2. How do we structure log entries for searchability?
  3. How long do we retain logs?
  4. How do we make logs useful for debugging agent behavior?
  5. How do we protect sensitive data in logs?

Research Areas

  • Structured logging patterns
  • Log aggregation options
  • PII/sensitive data handling
  • Log retention policies

Expected Deliverables

  • Audit log schema
  • Logging middleware/decorators
  • Log query API
  • Retention policy
  • ADR documenting the approach

Acceptance Criteria

  • All agent actions logged
  • Logs are searchable by project/agent/time
  • Sensitive data protected
  • Retention policy implemented

Labels

spike, architecture, observability

## Objective Design comprehensive audit logging for all agent actions and decisions. ## What to Log 1. All agent actions (tool calls, file changes, issue updates) 2. Agent decisions and reasoning 3. Approval requests and responses 4. State transitions 5. Configuration changes ## Key Questions 1. What's the log storage strategy? (database, file, external service) 2. How do we structure log entries for searchability? 3. How long do we retain logs? 4. How do we make logs useful for debugging agent behavior? 5. How do we protect sensitive data in logs? ## Research Areas - [ ] Structured logging patterns - [ ] Log aggregation options - [ ] PII/sensitive data handling - [ ] Log retention policies ## Expected Deliverables - Audit log schema - Logging middleware/decorators - Log query API - Retention policy - ADR documenting the approach ## Acceptance Criteria - [ ] All agent actions logged - [ ] Logs are searchable by project/agent/time - [ ] Sensitive data protected - [ ] Retention policy implemented ## Labels `spike`, `architecture`, `observability`
Author
Owner

SPIKE-011: Audit Logging - Research Completed

The spike document has been created at docs/spikes/SPIKE-011-audit-logging.md.

Executive Summary

Recommendation: Implement a structured, OpenTelemetry-compatible audit logging system using:

  • Structlog for structured JSON logging with contextual enrichment
  • PostgreSQL + TimescaleDB for hot storage (0-90 days)
  • S3-compatible object storage (MinIO) for cold archival (90+ days)
  • Cryptographic hash chaining for immutability verification
  • OpenTelemetry integration for trace/span correlation

Key Findings

What to Log

The spike defines comprehensive logging for:

  • Agent Actions: spawned, action.started, action.completed, action.failed, decision, terminated
  • LLM Interactions: request, response, error, tool_call (with prompt/response capture)
  • MCP Tool Invocations: invoked, result, error
  • Human Approvals: requested, granted, rejected, timeout
  • Git Operations: commit, branch.created, pr.created, pr.merged
  • Project Lifecycle: created, sprint.started, milestone.completed, checkpoint

Storage Architecture

HOT (0-30 days)     -> PostgreSQL + TimescaleDB (full detail, fast queries)
WARM (30-90 days)   -> TimescaleDB compressed chunks + aggregates
COLD (90+ days)     -> S3/MinIO Parquet archives (7 year retention)

Immutability

  • SHA-256 hash chaining (blockchain-like) for tamper evidence
  • Each event includes previous_hash and event_hash
  • Verification API to audit chain integrity

Compliance

  • SOC2: Tamper-evident logs, access controls, documented retention
  • GDPR: PII redaction, pseudonymization in archives, right-to-deletion support
  • 7-year retention for financial/legal compliance

Implementation Phases

  1. Week 1-2: Foundation (TimescaleDB, base schema, agent action decorator)
  2. Week 3-4: LLM & MCP logging with OpenTelemetry integration
  3. Week 5-6: Immutability, compliance, cold storage archival
  4. Week 7-8: Query APIs, full-text search, audit dashboard

Code Examples Included

  • AuditEvent Pydantic schema with all required fields
  • TimescaleDB table schema with hypertables and compression
  • @audit_agent_action, @audit_llm_call, @audit_mcp_tool decorators
  • Hash chaining implementation for immutability
  • OpenTelemetry integration for trace correlation
  • Common query patterns (timeline, search, LLM usage summary)

Next Steps

  • Review spike findings with team
  • Create ADR-007: Audit Logging Architecture
  • Begin Phase 1 implementation

Spike document: docs/spikes/SPIKE-011-audit-logging.md

## SPIKE-011: Audit Logging - Research Completed The spike document has been created at `docs/spikes/SPIKE-011-audit-logging.md`. ### Executive Summary **Recommendation:** Implement a structured, OpenTelemetry-compatible audit logging system using: - **Structlog** for structured JSON logging with contextual enrichment - **PostgreSQL + TimescaleDB** for hot storage (0-90 days) - **S3-compatible object storage** (MinIO) for cold archival (90+ days) - **Cryptographic hash chaining** for immutability verification - **OpenTelemetry** integration for trace/span correlation ### Key Findings #### What to Log The spike defines comprehensive logging for: - **Agent Actions**: spawned, action.started, action.completed, action.failed, decision, terminated - **LLM Interactions**: request, response, error, tool_call (with prompt/response capture) - **MCP Tool Invocations**: invoked, result, error - **Human Approvals**: requested, granted, rejected, timeout - **Git Operations**: commit, branch.created, pr.created, pr.merged - **Project Lifecycle**: created, sprint.started, milestone.completed, checkpoint #### Storage Architecture ``` HOT (0-30 days) -> PostgreSQL + TimescaleDB (full detail, fast queries) WARM (30-90 days) -> TimescaleDB compressed chunks + aggregates COLD (90+ days) -> S3/MinIO Parquet archives (7 year retention) ``` #### Immutability - SHA-256 hash chaining (blockchain-like) for tamper evidence - Each event includes `previous_hash` and `event_hash` - Verification API to audit chain integrity #### Compliance - **SOC2**: Tamper-evident logs, access controls, documented retention - **GDPR**: PII redaction, pseudonymization in archives, right-to-deletion support - **7-year retention** for financial/legal compliance ### Implementation Phases 1. **Week 1-2**: Foundation (TimescaleDB, base schema, agent action decorator) 2. **Week 3-4**: LLM & MCP logging with OpenTelemetry integration 3. **Week 5-6**: Immutability, compliance, cold storage archival 4. **Week 7-8**: Query APIs, full-text search, audit dashboard ### Code Examples Included - `AuditEvent` Pydantic schema with all required fields - TimescaleDB table schema with hypertables and compression - `@audit_agent_action`, `@audit_llm_call`, `@audit_mcp_tool` decorators - Hash chaining implementation for immutability - OpenTelemetry integration for trace correlation - Common query patterns (timeline, search, LLM usage summary) ### Next Steps - Review spike findings with team - Create ADR-007: Audit Logging Architecture - Begin Phase 1 implementation --- *Spike document: `docs/spikes/SPIKE-011-audit-logging.md`*
Sign in to join this conversation.