[SPIKE-006] Knowledge Base with pgvector (RAG) #6

New Issue

cardosofelipe · 2025-12-29T03:50:15Z

cardosofelipe commented

2025-12-29 03:50:15 +00:00

Objective

Implement RAG-based knowledge base using PostgreSQL with pgvector extension.

Key Questions

How do we structure the knowledge documents table?
What embedding model to use? (OpenAI, local model?)
How do we handle document chunking?
How do we filter by project_id efficiently with vector search?
How do we update embeddings incrementally?

Research Areas

pgvector installation and configuration
Embedding model options (OpenAI ada-002, local alternatives)
Document chunking strategies
Hybrid search (vector + keyword)
SQLAlchemy integration with pgvector

Expected Deliverables

Database schema with pgvector
Embedding pipeline for documents
Search API with project scoping
Integration with agent context
ADR documenting the approach

Acceptance Criteria

Can index documents with embeddings
Semantic search returns relevant results
Search respects project_id scoping
Performance acceptable (<500ms for search)
Incremental updates work

Labels

spike, architecture, knowledge-base

## Objective Implement RAG-based knowledge base using PostgreSQL with pgvector extension. ## Key Questions 1. How do we structure the knowledge documents table? 2. What embedding model to use? (OpenAI, local model?) 3. How do we handle document chunking? 4. How do we filter by project_id efficiently with vector search? 5. How do we update embeddings incrementally? ## Research Areas - [ ] pgvector installation and configuration - [ ] Embedding model options (OpenAI ada-002, local alternatives) - [ ] Document chunking strategies - [ ] Hybrid search (vector + keyword) - [ ] SQLAlchemy integration with pgvector ## Expected Deliverables - Database schema with pgvector - Embedding pipeline for documents - Search API with project scoping - Integration with agent context - ADR documenting the approach ## Acceptance Criteria - [ ] Can index documents with embeddings - [ ] Semantic search returns relevant results - [ ] Search respects project_id scoping - [ ] Performance acceptable (<500ms for search) - [ ] Incremental updates work ## Labels `spike`, `architecture`, `knowledge-base`

cardosofelipe commented

2025-12-29 12:22:54 +00:00

SPIKE-006 Completed: Knowledge Base with pgvector for RAG System

The comprehensive spike document has been created at docs/spikes/SPIKE-006-knowledge-base-pgvector.md.

Executive Summary

Recommendation: Use pgvector with hybrid search as the primary solution for Syndarix RAG:

Why pgvector? Already using PostgreSQL, handles 10-100M vectors, transactional consistency, zero additional infrastructure
Migration path: Qdrant or Pinecone if scaling beyond 100M vectors per tenant

Key Findings

1. Vector Database Comparison

Database	Max Scale	Best For
pgvector	10-100M	PostgreSQL users, moderate scale
Pinecone	Billions	Zero-ops, enterprise
Qdrant	Billions	Open-source, high-performance filtering

2. Embedding Models Recommended

Code: voyage/voyage-code-3 (state-of-art for code retrieval)
Documentation: text-embedding-3-small (cost-effective, good quality)
General: text-embedding-3-small or nomic-embed-text (local fallback)

All integrated via LiteLLM for unified API.

3. Chunking Strategies

Content Type	Strategy	Chunk Size
Code	AST-based (functions/classes)	Per unit
Markdown	Heading-based	Per section
PDF	Page-level + semantic	~1000 tokens
Conversations	Turn-based	Per exchange

4. Hybrid Search Architecture

Semantic: pgvector HNSW for vector similarity
Keyword: PostgreSQL tsvector for BM25-style matching
Fusion: Reciprocal Rank Fusion (RRF) combining both

Research shows 2-3x reduction in hallucinations vs single-route retrieval.

5. Multi-Tenant Isolation

Shared knowledge_chunks table with project_id and agent_id columns
Row-Level Security (RLS) for additional isolation
Partial indexes for high-traffic projects

6. HNSW Index Configuration

-- Default: m=16, ef_construction=64
-- Large collections: m=32, ef_construction=128
-- Query tuning: SET hnsw.ef_search = 100

Schema Design Included

knowledge_chunks table with vector embedding, full-text search
knowledge_collections for organizing chunks
Complete Alembic migration script
SQLAlchemy models with pgvector integration

Code Examples Provided

KnowledgeBaseService - Main service with ingest/search methods
HybridSearchService - RRF-based hybrid search
Content-type aware chunkers (Code, Markdown, Semantic)
MCP tool for agent knowledge access
Agent context builder for RAG prompts

Performance Targets

Vector Count	Target Latency	Memory (1536-dim)
<100K	<20ms	~600MB
1M	<50ms	~9.2GB
10M	<100ms	~92GB

Next Steps

Review spike document and provide feedback
Create ADR-006: Knowledge Base Architecture
Implement database migration
Build ingestion pipeline with Celery
Create MCP knowledge tools

Spike document: docs/spikes/SPIKE-006-knowledge-base-pgvector.md

## SPIKE-006 Completed: Knowledge Base with pgvector for RAG System The comprehensive spike document has been created at `docs/spikes/SPIKE-006-knowledge-base-pgvector.md`. ### Executive Summary **Recommendation: Use pgvector with hybrid search** as the primary solution for Syndarix RAG: - **Why pgvector?** Already using PostgreSQL, handles 10-100M vectors, transactional consistency, zero additional infrastructure - **Migration path:** Qdrant or Pinecone if scaling beyond 100M vectors per tenant ### Key Findings #### 1. Vector Database Comparison | Database | Max Scale | Best For | |----------|-----------|----------| | pgvector | 10-100M | PostgreSQL users, moderate scale | | Pinecone | Billions | Zero-ops, enterprise | | Qdrant | Billions | Open-source, high-performance filtering | #### 2. Embedding Models Recommended - **Code:** `voyage/voyage-code-3` (state-of-art for code retrieval) - **Documentation:** `text-embedding-3-small` (cost-effective, good quality) - **General:** `text-embedding-3-small` or `nomic-embed-text` (local fallback) All integrated via LiteLLM for unified API. #### 3. Chunking Strategies | Content Type | Strategy | Chunk Size | |--------------|----------|------------| | Code | AST-based (functions/classes) | Per unit | | Markdown | Heading-based | Per section | | PDF | Page-level + semantic | ~1000 tokens | | Conversations | Turn-based | Per exchange | #### 4. Hybrid Search Architecture - **Semantic:** pgvector HNSW for vector similarity - **Keyword:** PostgreSQL tsvector for BM25-style matching - **Fusion:** Reciprocal Rank Fusion (RRF) combining both Research shows 2-3x reduction in hallucinations vs single-route retrieval. #### 5. Multi-Tenant Isolation - Shared `knowledge_chunks` table with `project_id` and `agent_id` columns - Row-Level Security (RLS) for additional isolation - Partial indexes for high-traffic projects #### 6. HNSW Index Configuration ```sql -- Default: m=16, ef_construction=64 -- Large collections: m=32, ef_construction=128 -- Query tuning: SET hnsw.ef_search = 100 ``` ### Schema Design Included - `knowledge_chunks` table with vector embedding, full-text search - `knowledge_collections` for organizing chunks - Complete Alembic migration script - SQLAlchemy models with pgvector integration ### Code Examples Provided - `KnowledgeBaseService` - Main service with ingest/search methods - `HybridSearchService` - RRF-based hybrid search - Content-type aware chunkers (Code, Markdown, Semantic) - MCP tool for agent knowledge access - Agent context builder for RAG prompts ### Performance Targets | Vector Count | Target Latency | Memory (1536-dim) | |--------------|----------------|-------------------| | <100K | <20ms | ~600MB | | 1M | <50ms | ~9.2GB | | 10M | <100ms | ~92GB | ### Next Steps 1. Review spike document and provide feedback 2. Create ADR-006: Knowledge Base Architecture 3. Implement database migration 4. Build ingestion pipeline with Celery 5. Create MCP knowledge tools --- *Spike document: `docs/spikes/SPIKE-006-knowledge-base-pgvector.md`*

cardosofelipe referenced this issue from a commit

2025-12-29 12:31:10 +00:00

docs: add architecture spikes and deep analysis documentation

cardosofelipe closed this issue

2025-12-29 12:31:47 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: cardosofelipe/syndarix#6