[SPIKE-006] Knowledge Base with pgvector (RAG) #6

Closed
opened 2025-12-29 03:50:15 +00:00 by cardosofelipe · 1 comment

Objective

Implement RAG-based knowledge base using PostgreSQL with pgvector extension.

Key Questions

  1. How do we structure the knowledge documents table?
  2. What embedding model to use? (OpenAI, local model?)
  3. How do we handle document chunking?
  4. How do we filter by project_id efficiently with vector search?
  5. How do we update embeddings incrementally?

Research Areas

  • pgvector installation and configuration
  • Embedding model options (OpenAI ada-002, local alternatives)
  • Document chunking strategies
  • Hybrid search (vector + keyword)
  • SQLAlchemy integration with pgvector

Expected Deliverables

  • Database schema with pgvector
  • Embedding pipeline for documents
  • Search API with project scoping
  • Integration with agent context
  • ADR documenting the approach

Acceptance Criteria

  • Can index documents with embeddings
  • Semantic search returns relevant results
  • Search respects project_id scoping
  • Performance acceptable (<500ms for search)
  • Incremental updates work

Labels

spike, architecture, knowledge-base

## Objective Implement RAG-based knowledge base using PostgreSQL with pgvector extension. ## Key Questions 1. How do we structure the knowledge documents table? 2. What embedding model to use? (OpenAI, local model?) 3. How do we handle document chunking? 4. How do we filter by project_id efficiently with vector search? 5. How do we update embeddings incrementally? ## Research Areas - [ ] pgvector installation and configuration - [ ] Embedding model options (OpenAI ada-002, local alternatives) - [ ] Document chunking strategies - [ ] Hybrid search (vector + keyword) - [ ] SQLAlchemy integration with pgvector ## Expected Deliverables - Database schema with pgvector - Embedding pipeline for documents - Search API with project scoping - Integration with agent context - ADR documenting the approach ## Acceptance Criteria - [ ] Can index documents with embeddings - [ ] Semantic search returns relevant results - [ ] Search respects project_id scoping - [ ] Performance acceptable (&lt;500ms for search) - [ ] Incremental updates work ## Labels `spike`, `architecture`, `knowledge-base`
Author
Owner

SPIKE-006 Completed: Knowledge Base with pgvector for RAG System

The comprehensive spike document has been created at docs/spikes/SPIKE-006-knowledge-base-pgvector.md.

Executive Summary

Recommendation: Use pgvector with hybrid search as the primary solution for Syndarix RAG:

  • Why pgvector? Already using PostgreSQL, handles 10-100M vectors, transactional consistency, zero additional infrastructure
  • Migration path: Qdrant or Pinecone if scaling beyond 100M vectors per tenant

Key Findings

1. Vector Database Comparison

Database Max Scale Best For
pgvector 10-100M PostgreSQL users, moderate scale
Pinecone Billions Zero-ops, enterprise
Qdrant Billions Open-source, high-performance filtering
  • Code: voyage/voyage-code-3 (state-of-art for code retrieval)
  • Documentation: text-embedding-3-small (cost-effective, good quality)
  • General: text-embedding-3-small or nomic-embed-text (local fallback)

All integrated via LiteLLM for unified API.

3. Chunking Strategies

Content Type Strategy Chunk Size
Code AST-based (functions/classes) Per unit
Markdown Heading-based Per section
PDF Page-level + semantic ~1000 tokens
Conversations Turn-based Per exchange

4. Hybrid Search Architecture

  • Semantic: pgvector HNSW for vector similarity
  • Keyword: PostgreSQL tsvector for BM25-style matching
  • Fusion: Reciprocal Rank Fusion (RRF) combining both

Research shows 2-3x reduction in hallucinations vs single-route retrieval.

5. Multi-Tenant Isolation

  • Shared knowledge_chunks table with project_id and agent_id columns
  • Row-Level Security (RLS) for additional isolation
  • Partial indexes for high-traffic projects

6. HNSW Index Configuration

-- Default: m=16, ef_construction=64
-- Large collections: m=32, ef_construction=128
-- Query tuning: SET hnsw.ef_search = 100

Schema Design Included

  • knowledge_chunks table with vector embedding, full-text search
  • knowledge_collections for organizing chunks
  • Complete Alembic migration script
  • SQLAlchemy models with pgvector integration

Code Examples Provided

  • KnowledgeBaseService - Main service with ingest/search methods
  • HybridSearchService - RRF-based hybrid search
  • Content-type aware chunkers (Code, Markdown, Semantic)
  • MCP tool for agent knowledge access
  • Agent context builder for RAG prompts

Performance Targets

Vector Count Target Latency Memory (1536-dim)
<100K <20ms ~600MB
1M <50ms ~9.2GB
10M <100ms ~92GB

Next Steps

  1. Review spike document and provide feedback
  2. Create ADR-006: Knowledge Base Architecture
  3. Implement database migration
  4. Build ingestion pipeline with Celery
  5. Create MCP knowledge tools

Spike document: docs/spikes/SPIKE-006-knowledge-base-pgvector.md

## SPIKE-006 Completed: Knowledge Base with pgvector for RAG System The comprehensive spike document has been created at `docs/spikes/SPIKE-006-knowledge-base-pgvector.md`. ### Executive Summary **Recommendation: Use pgvector with hybrid search** as the primary solution for Syndarix RAG: - **Why pgvector?** Already using PostgreSQL, handles 10-100M vectors, transactional consistency, zero additional infrastructure - **Migration path:** Qdrant or Pinecone if scaling beyond 100M vectors per tenant ### Key Findings #### 1. Vector Database Comparison | Database | Max Scale | Best For | |----------|-----------|----------| | pgvector | 10-100M | PostgreSQL users, moderate scale | | Pinecone | Billions | Zero-ops, enterprise | | Qdrant | Billions | Open-source, high-performance filtering | #### 2. Embedding Models Recommended - **Code:** `voyage/voyage-code-3` (state-of-art for code retrieval) - **Documentation:** `text-embedding-3-small` (cost-effective, good quality) - **General:** `text-embedding-3-small` or `nomic-embed-text` (local fallback) All integrated via LiteLLM for unified API. #### 3. Chunking Strategies | Content Type | Strategy | Chunk Size | |--------------|----------|------------| | Code | AST-based (functions/classes) | Per unit | | Markdown | Heading-based | Per section | | PDF | Page-level + semantic | ~1000 tokens | | Conversations | Turn-based | Per exchange | #### 4. Hybrid Search Architecture - **Semantic:** pgvector HNSW for vector similarity - **Keyword:** PostgreSQL tsvector for BM25-style matching - **Fusion:** Reciprocal Rank Fusion (RRF) combining both Research shows 2-3x reduction in hallucinations vs single-route retrieval. #### 5. Multi-Tenant Isolation - Shared `knowledge_chunks` table with `project_id` and `agent_id` columns - Row-Level Security (RLS) for additional isolation - Partial indexes for high-traffic projects #### 6. HNSW Index Configuration ```sql -- Default: m=16, ef_construction=64 -- Large collections: m=32, ef_construction=128 -- Query tuning: SET hnsw.ef_search = 100 ``` ### Schema Design Included - `knowledge_chunks` table with vector embedding, full-text search - `knowledge_collections` for organizing chunks - Complete Alembic migration script - SQLAlchemy models with pgvector integration ### Code Examples Provided - `KnowledgeBaseService` - Main service with ingest/search methods - `HybridSearchService` - RRF-based hybrid search - Content-type aware chunkers (Code, Markdown, Semantic) - MCP tool for agent knowledge access - Agent context builder for RAG prompts ### Performance Targets | Vector Count | Target Latency | Memory (1536-dim) | |--------------|----------------|-------------------| | <100K | <20ms | ~600MB | | 1M | <50ms | ~9.2GB | | 10M | <100ms | ~92GB | ### Next Steps 1. Review spike document and provide feedback 2. Create ADR-006: Knowledge Base Architecture 3. Implement database migration 4. Build ingestion pipeline with Celery 5. Create MCP knowledge tools --- *Spike document: `docs/spikes/SPIKE-006-knowledge-base-pgvector.md`*
Sign in to join this conversation.