Files
syndarix/docs/adrs/ADR-008-knowledge-base-rag.md
Felipe Cardoso f138417486 fix: Resolve ADR/Requirements inconsistencies from comprehensive review
## ADR Compliance Section Fixes

- ADR-007: Fixed invalid NFR-501 and TC-002 references
  - NFR-501 → NFR-402 (Fault tolerance)
  - TC-002 → Core Principle (self-hostability)

- ADR-008: Fixed invalid NFR-501 reference
  - Added TC-006 (pgvector extension)

- ADR-011: Fixed invalid FR-201-205 and NFR-201 references
  - Now correctly references FR-401-404 (Issue Tracking series)

- ADR-012: Fixed invalid FR-401, FR-402, NFR-302 references
  - Now references new FR-800 series (Cost & Budget Management)

- ADR-014: Fixed invalid FR-601-605 and FR-102 references
  - Now correctly references FR-203 (Autonomy Level Configuration)

## ADR-007 Model Identifier Fix

- Changed "claude-sonnet-4-20250514" to "claude-3-5-sonnet-latest"
- Matches documented primary model (Claude 3.5 Sonnet)

## New Requirements Added

- FR-801: Real-time cost tracking
- FR-802: Budget configuration (soft/hard limits)
- FR-803: Budget alerts
- FR-804: Cost analytics

This resolves all HIGH priority inconsistencies identified by the
4-agent parallel review of ADRs against requirements and architecture.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 14:13:26 +01:00

5.2 KiB

ADR-008: Knowledge Base and RAG Architecture

Status: Accepted Date: 2025-12-29 Deciders: Architecture Team Related Spikes: SPIKE-006


Context

Syndarix agents require access to project-specific knowledge bases for Retrieval-Augmented Generation (RAG). This enables agents to reference requirements, codebase context, documentation, and past decisions when performing tasks.

Decision Drivers

  • Operational Simplicity: Minimize infrastructure complexity
  • Performance: Sub-100ms query latency
  • Isolation: Per-project knowledge separation
  • Cost: Avoid expensive dedicated vector databases
  • Flexibility: Support multiple content types (code, docs, conversations)

Considered Options

Option 1: Dedicated Vector Database (Pinecone, Qdrant)

Pros:

  • Purpose-built for vector search
  • Excellent query performance at scale
  • Managed offerings available

Cons:

  • Additional infrastructure
  • Cost at scale ($27-$70/month per 1M vectors)
  • Data sync complexity with PostgreSQL

Option 2: pgvector Extension (Selected)

Pros:

  • Already using PostgreSQL (zero additional infrastructure)
  • ACID transactions with application data
  • Row-level security for multi-tenant isolation
  • Handles 10-100M vectors effectively
  • Hybrid search with PostgreSQL full-text

Cons:

  • Less performant than dedicated solutions at billion-scale
  • Requires PostgreSQL 15+

Option 3: Weaviate (Self-hosted)

Pros:

  • Multi-modal support
  • Knowledge graph features

Cons:

  • Additional service to manage
  • Overkill for our scale

Decision

Adopt pgvector as the vector store for RAG functionality.

Syndarix's per-project isolation means knowledge bases remain in the thousands to millions of vectors per tenant, well within pgvector's optimal range. The operational simplicity of using existing PostgreSQL infrastructure outweighs the performance benefits of dedicated vector databases.

Implementation

Embedding Model Strategy

Content Type Embedding Model Dimensions Rationale
Code files voyage-code-3 1024 State-of-art for code retrieval
Documentation text-embedding-3-small 1536 Good balance cost/quality
Conversations text-embedding-3-small 1536 General purpose

Chunking Strategy

Content Type Strategy Chunk Size
Python/JS code AST-based (function/class) Per function
Markdown docs Heading-based Per section
PDF specs Page-level + semantic 1000 tokens
Conversations Turn-based Per exchange

Database Schema

CREATE TABLE knowledge_chunks (
    id UUID PRIMARY KEY,
    project_id UUID NOT NULL REFERENCES projects(id),
    source_type VARCHAR(50) NOT NULL,  -- 'code', 'doc', 'conversation'
    source_path VARCHAR(500),
    content TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON knowledge_chunks USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
CREATE INDEX ON knowledge_chunks (project_id);
CREATE INDEX ON knowledge_chunks USING gin (metadata);
async def hybrid_search(
    project_id: str,
    query: str,
    top_k: int = 10,
    vector_weight: float = 0.7
) -> list[Chunk]:
    """Combine vector similarity with keyword matching."""
    query_embedding = await embed(query)

    results = await db.execute("""
        WITH vector_results AS (
            SELECT id, content, metadata,
                   1 - (embedding <=> $1) as vector_score
            FROM knowledge_chunks
            WHERE project_id = $2
            ORDER BY embedding <=> $1
            LIMIT $3 * 2
        ),
        keyword_results AS (
            SELECT id, content, metadata,
                   ts_rank(to_tsvector(content), plainto_tsquery($4)) as text_score
            FROM knowledge_chunks
            WHERE project_id = $2
              AND to_tsvector(content) @@ plainto_tsquery($4)
            LIMIT $3 * 2
        )
        SELECT DISTINCT ON (id) id, content, metadata,
               COALESCE(v.vector_score, 0) * $5 +
               COALESCE(k.text_score, 0) * (1 - $5) as combined_score
        FROM vector_results v
        FULL OUTER JOIN keyword_results k USING (id, content, metadata)
        ORDER BY combined_score DESC
        LIMIT $3
    """, query_embedding, project_id, top_k, query, vector_weight)

    return results

Consequences

Positive

  • Zero additional infrastructure
  • Transactional consistency with application data
  • Unified backup/restore
  • Row-level security for tenant isolation

Negative

  • May need migration to dedicated vector DB if scaling beyond 100M vectors
  • Index tuning required for optimal performance

Migration Path

If scale requires it, migrate to Qdrant (self-hosted, open-source) with the same embedding models, preserving vectors.

Compliance

This decision aligns with:

  • FR-103: Agent domain knowledge (RAG)
  • TC-001: PostgreSQL as primary database
  • TC-006: pgvector extension required
  • Core Principle: Self-hostability (pgvector is open source)

This ADR establishes the knowledge base and RAG architecture for Syndarix.