feat(mcp): Implement Knowledge Base MCP Server #57

New Issue

cardosofelipe · 2026-01-03T01:24:54Z

cardosofelipe commented

2026-01-03 01:24:54 +00:00

Summary

Implement the Knowledge Base MCP server that provides RAG (Retrieval Augmented Generation) capabilities using pgvector for semantic search. This enables agents to search and retrieve project-specific knowledge.

Sub-Tasks

1. Project Setup

Initialize FastMCP project in mcp-servers/knowledge-base/
Create pyproject.toml with dependencies
Add fastmcp>=0.4.0, asyncpg>=0.29.0, pgvector>=0.2.0
Add tiktoken>=0.5.0 for token counting
Create Docker configuration (Dockerfile, .dockerignore)
Add to docker-compose.dev.yml
Create README.md with setup instructions

2. Database Schema (`migrations/`)

Create Alembic migration for knowledge_embeddings table
Add id (UUID, primary key)
Add project_id (UUID, foreign key to projects)
Add collection (VARCHAR(100), e.g., 'code', 'docs', 'conversations')
Add source_path (VARCHAR(500), file path or URL)
Add source_type (VARCHAR(50), 'file', 'url', 'conversation')
Add chunk_index (INTEGER, position in source)
Add chunk_hash (VARCHAR(64), SHA256 for dedup)
Add content (TEXT, the actual text)
Add embedding (vector(1536), OpenAI ada-002 dimension)
Add metadata (JSONB, custom attributes)
Add token_count (INTEGER, cached token count)
Add created_at, updated_at (TIMESTAMPTZ)
Create IVFFlat index on embedding column
Create composite index on (project_id, collection)
Create index on source_path for lookups
Create GIN index on content for full-text search

3. Embedding Generation (`embeddings.py`)

Create EmbeddingService class
Integrate with LLM Gateway MCP for embeddings
Implement batch embedding (up to 100 texts)
Add caching for repeated texts (Redis)
Handle embedding dimension validation
Add retry logic for API failures
Track embedding costs

4. Code Chunking Strategy (`chunking/code.py`)

Create CodeChunker class
Parse Python files using AST
Parse TypeScript/JavaScript using tree-sitter
Extract function/class level chunks
Include imports and decorators in context
Target ~500 tokens per chunk
Add 50 token overlap between chunks
Preserve syntax highlighting hints in metadata
Handle edge cases (minified code, very long functions)

5. Markdown Chunking Strategy (`chunking/markdown.py`)

Create MarkdownChunker class
Parse markdown structure (headers, lists, code blocks)
Split on header boundaries (h1, h2, h3)
Keep code blocks intact
Target ~800 tokens per chunk
Add 100 token overlap
Preserve header hierarchy in metadata
Handle tables and special formatting

6. Text Chunking Strategy (`chunking/text.py`)

Create TextChunker class
Implement sentence-based splitting
Use NLTK or spaCy for sentence detection
Target ~400 tokens per chunk
Add 50 token overlap
Handle edge cases (no periods, very long sentences)
Create recursive splitter for oversized chunks

7. Document Ingestion Pipeline (`ingestion.py`)

Create IngestionPipeline class
Detect file type from extension/content
Route to appropriate chunker
Generate embeddings for chunks
Compute chunk hashes for deduplication
Store in PostgreSQL with pgvector
Handle update vs insert logic
Implement batch processing for efficiency
Add progress reporting
Handle encoding issues (UTF-8, etc.)

8. Semantic Search (`search/semantic.py`)

Create SemanticSearch class
Generate query embedding
Perform cosine similarity search via pgvector
Implement configurable threshold (default 0.7)
Return ranked results with scores
Include metadata in results
Add query expansion (optional)

9. Keyword Search (`search/keyword.py`)

Create KeywordSearch class
Use PostgreSQL full-text search (tsvector)
Support phrase matching
Implement relevance ranking
Handle special characters
Support prefix matching

10. Hybrid Search (`search/hybrid.py`)

Create HybridSearch class
Combine semantic and keyword results
Implement Reciprocal Rank Fusion (RRF)
Configurable weight between semantic/keyword
Deduplicate overlapping results
Boost exact matches

11. Collection Management (`collections.py`)

Create CollectionManager class
List collections for a project
Get collection statistics (document count, size)
Delete collection contents
Rename collection
Export collection to file

12. MCP Tools Implementation (`server.py`)

Implement ingest_document tool
- Accept project_id, collection, content, source_path
- Accept metadata as optional dict
- Return chunk count and status
Implement ingest_file tool
- Accept file path or URL
- Auto-detect file type
- Stream large files
Implement search tool
- Accept project_id, query, collection filter
- Accept limit, threshold, search_type parameters
- Return ranked results with snippets
Implement delete_document tool
- Delete by source_path
- Delete by collection
- Return deleted count
Implement get_collections tool
- Return list with stats
Implement get_document tool
- Retrieve full document by source_path
- Return all chunks ordered

13. Project Isolation

Enforce project_id in all queries
Add RLS (Row Level Security) policies if needed
Validate project access in API layer
Prevent cross-project data leakage
Add project-level quotas (optional)

14. Performance Optimization

Implement connection pooling for PostgreSQL
Add query result caching (Redis)
Optimize IVFFlat parameters (lists, probes)
Batch embedding requests
Add async ingestion queue
Monitor query latency

15. Docker & Deployment

Create optimized Dockerfile
Add health check endpoint
Configure database connection pooling
Add resource limits
Create initialization script for indexes

16. Testing

Create unit tests for each chunker
Create unit tests for search implementations
Create integration tests with test PostgreSQL
Test embedding generation with mocks
Test hybrid search ranking
Test project isolation
Achieve >90% code coverage

17. Documentation

Document all MCP tools with examples
Create ingestion guide
Document search parameters
Add performance tuning guide
Create troubleshooting guide

Technical Specifications

Database Schema

CREATE TABLE knowledge_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
    collection VARCHAR(100) NOT NULL,
    source_path VARCHAR(500),
    source_type VARCHAR(50) NOT NULL,
    chunk_index INTEGER NOT NULL DEFAULT 0,
    chunk_hash VARCHAR(64) NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB DEFAULT '{}',
    token_count INTEGER NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    
    UNIQUE(project_id, source_path, chunk_index)
);

CREATE INDEX idx_embeddings_project_collection ON knowledge_embeddings(project_id, collection);
CREATE INDEX idx_embeddings_source ON knowledge_embeddings(source_path);
CREATE INDEX idx_embeddings_vector ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_embeddings_content_fts ON knowledge_embeddings USING GIN (to_tsvector('english', content));

MCP Tools

@mcp.tool()
async def ingest_document(
    project_id: str,
    collection: str,
    content: str,
    source_path: str = None,
    source_type: str = "text",
    metadata: dict = None,
) -> IngestResult:
    """Ingest a document into the knowledge base."""

@mcp.tool()
async def search(
    project_id: str,
    query: str,
    collection: str = None,
    limit: int = 10,
    threshold: float = 0.7,
    search_type: str = "hybrid",
) -> list[SearchResult]:
    """Search the knowledge base."""

@mcp.tool()
async def delete_document(
    project_id: str,
    source_path: str = None,
    collection: str = None,
) -> DeleteResult:
    """Delete documents from knowledge base."""

@mcp.tool()
async def get_collections(
    project_id: str,
) -> list[CollectionInfo]:
    """List all collections for a project."""

Acceptance Criteria

pgvector schema created via migration
Code chunking works for Python and TypeScript
Markdown chunking preserves structure
Text chunking handles various formats
Semantic search returns relevant results
Keyword search works with full-text
Hybrid search combines both effectively
Per-project collection isolation enforced
All MCP tools documented and working
Unit tests >90% coverage
Integration tests with real pgvector
Docker container builds and runs
Documentation complete

Dependencies

Depends on: #55 (MCP Client Infrastructure), #56 (LLM Gateway - for embeddings)
Blocks: Phase 3 Agent Orchestration (knowledge retrieval)

Assignable To

backend-engineer agent

## Summary Implement the Knowledge Base MCP server that provides RAG (Retrieval Augmented Generation) capabilities using pgvector for semantic search. This enables agents to search and retrieve project-specific knowledge. ## Sub-Tasks ### 1. Project Setup - [ ] Initialize FastMCP project in `mcp-servers/knowledge-base/` - [ ] Create `pyproject.toml` with dependencies - [ ] Add `fastmcp>=0.4.0`, `asyncpg>=0.29.0`, `pgvector>=0.2.0` - [ ] Add `tiktoken>=0.5.0` for token counting - [ ] Create Docker configuration (`Dockerfile`, `.dockerignore`) - [ ] Add to `docker-compose.dev.yml` - [ ] Create `README.md` with setup instructions ### 2. Database Schema (`migrations/`) - [ ] Create Alembic migration for `knowledge_embeddings` table - [ ] Add `id` (UUID, primary key) - [ ] Add `project_id` (UUID, foreign key to projects) - [ ] Add `collection` (VARCHAR(100), e.g., 'code', 'docs', 'conversations') - [ ] Add `source_path` (VARCHAR(500), file path or URL) - [ ] Add `source_type` (VARCHAR(50), 'file', 'url', 'conversation') - [ ] Add `chunk_index` (INTEGER, position in source) - [ ] Add `chunk_hash` (VARCHAR(64), SHA256 for dedup) - [ ] Add `content` (TEXT, the actual text) - [ ] Add `embedding` (vector(1536), OpenAI ada-002 dimension) - [ ] Add `metadata` (JSONB, custom attributes) - [ ] Add `token_count` (INTEGER, cached token count) - [ ] Add `created_at`, `updated_at` (TIMESTAMPTZ) - [ ] Create IVFFlat index on `embedding` column - [ ] Create composite index on `(project_id, collection)` - [ ] Create index on `source_path` for lookups - [ ] Create GIN index on `content` for full-text search ### 3. Embedding Generation (`embeddings.py`) - [ ] Create `EmbeddingService` class - [ ] Integrate with LLM Gateway MCP for embeddings - [ ] Implement batch embedding (up to 100 texts) - [ ] Add caching for repeated texts (Redis) - [ ] Handle embedding dimension validation - [ ] Add retry logic for API failures - [ ] Track embedding costs ### 4. Code Chunking Strategy (`chunking/code.py`) - [ ] Create `CodeChunker` class - [ ] Parse Python files using AST - [ ] Parse TypeScript/JavaScript using tree-sitter - [ ] Extract function/class level chunks - [ ] Include imports and decorators in context - [ ] Target ~500 tokens per chunk - [ ] Add 50 token overlap between chunks - [ ] Preserve syntax highlighting hints in metadata - [ ] Handle edge cases (minified code, very long functions) ### 5. Markdown Chunking Strategy (`chunking/markdown.py`) - [ ] Create `MarkdownChunker` class - [ ] Parse markdown structure (headers, lists, code blocks) - [ ] Split on header boundaries (h1, h2, h3) - [ ] Keep code blocks intact - [ ] Target ~800 tokens per chunk - [ ] Add 100 token overlap - [ ] Preserve header hierarchy in metadata - [ ] Handle tables and special formatting ### 6. Text Chunking Strategy (`chunking/text.py`) - [ ] Create `TextChunker` class - [ ] Implement sentence-based splitting - [ ] Use NLTK or spaCy for sentence detection - [ ] Target ~400 tokens per chunk - [ ] Add 50 token overlap - [ ] Handle edge cases (no periods, very long sentences) - [ ] Create recursive splitter for oversized chunks ### 7. Document Ingestion Pipeline (`ingestion.py`) - [ ] Create `IngestionPipeline` class - [ ] Detect file type from extension/content - [ ] Route to appropriate chunker - [ ] Generate embeddings for chunks - [ ] Compute chunk hashes for deduplication - [ ] Store in PostgreSQL with pgvector - [ ] Handle update vs insert logic - [ ] Implement batch processing for efficiency - [ ] Add progress reporting - [ ] Handle encoding issues (UTF-8, etc.) ### 8. Semantic Search (`search/semantic.py`) - [ ] Create `SemanticSearch` class - [ ] Generate query embedding - [ ] Perform cosine similarity search via pgvector - [ ] Implement configurable threshold (default 0.7) - [ ] Return ranked results with scores - [ ] Include metadata in results - [ ] Add query expansion (optional) ### 9. Keyword Search (`search/keyword.py`) - [ ] Create `KeywordSearch` class - [ ] Use PostgreSQL full-text search (tsvector) - [ ] Support phrase matching - [ ] Implement relevance ranking - [ ] Handle special characters - [ ] Support prefix matching ### 10. Hybrid Search (`search/hybrid.py`) - [ ] Create `HybridSearch` class - [ ] Combine semantic and keyword results - [ ] Implement Reciprocal Rank Fusion (RRF) - [ ] Configurable weight between semantic/keyword - [ ] Deduplicate overlapping results - [ ] Boost exact matches ### 11. Collection Management (`collections.py`) - [ ] Create `CollectionManager` class - [ ] List collections for a project - [ ] Get collection statistics (document count, size) - [ ] Delete collection contents - [ ] Rename collection - [ ] Export collection to file ### 12. MCP Tools Implementation (`server.py`) - [ ] Implement `ingest_document` tool - [ ] Accept project_id, collection, content, source_path - [ ] Accept metadata as optional dict - [ ] Return chunk count and status - [ ] Implement `ingest_file` tool - [ ] Accept file path or URL - [ ] Auto-detect file type - [ ] Stream large files - [ ] Implement `search` tool - [ ] Accept project_id, query, collection filter - [ ] Accept limit, threshold, search_type parameters - [ ] Return ranked results with snippets - [ ] Implement `delete_document` tool - [ ] Delete by source_path - [ ] Delete by collection - [ ] Return deleted count - [ ] Implement `get_collections` tool - [ ] Return list with stats - [ ] Implement `get_document` tool - [ ] Retrieve full document by source_path - [ ] Return all chunks ordered ### 13. Project Isolation - [ ] Enforce project_id in all queries - [ ] Add RLS (Row Level Security) policies if needed - [ ] Validate project access in API layer - [ ] Prevent cross-project data leakage - [ ] Add project-level quotas (optional) ### 14. Performance Optimization - [ ] Implement connection pooling for PostgreSQL - [ ] Add query result caching (Redis) - [ ] Optimize IVFFlat parameters (lists, probes) - [ ] Batch embedding requests - [ ] Add async ingestion queue - [ ] Monitor query latency ### 15. Docker & Deployment - [ ] Create optimized `Dockerfile` - [ ] Add health check endpoint - [ ] Configure database connection pooling - [ ] Add resource limits - [ ] Create initialization script for indexes ### 16. Testing - [ ] Create unit tests for each chunker - [ ] Create unit tests for search implementations - [ ] Create integration tests with test PostgreSQL - [ ] Test embedding generation with mocks - [ ] Test hybrid search ranking - [ ] Test project isolation - [ ] Achieve >90% code coverage ### 17. Documentation - [ ] Document all MCP tools with examples - [ ] Create ingestion guide - [ ] Document search parameters - [ ] Add performance tuning guide - [ ] Create troubleshooting guide ## Technical Specifications ### Database Schema ```sql CREATE TABLE knowledge_embeddings ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, collection VARCHAR(100) NOT NULL, source_path VARCHAR(500), source_type VARCHAR(50) NOT NULL, chunk_index INTEGER NOT NULL DEFAULT 0, chunk_hash VARCHAR(64) NOT NULL, content TEXT NOT NULL, embedding vector(1536), metadata JSONB DEFAULT '{}', token_count INTEGER NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE(project_id, source_path, chunk_index) ); CREATE INDEX idx_embeddings_project_collection ON knowledge_embeddings(project_id, collection); CREATE INDEX idx_embeddings_source ON knowledge_embeddings(source_path); CREATE INDEX idx_embeddings_vector ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); CREATE INDEX idx_embeddings_content_fts ON knowledge_embeddings USING GIN (to_tsvector('english', content)); ``` ### MCP Tools ```python @mcp.tool() async def ingest_document( project_id: str, collection: str, content: str, source_path: str = None, source_type: str = "text", metadata: dict = None, ) -> IngestResult: """Ingest a document into the knowledge base.""" @mcp.tool() async def search( project_id: str, query: str, collection: str = None, limit: int = 10, threshold: float = 0.7, search_type: str = "hybrid", ) -> list[SearchResult]: """Search the knowledge base.""" @mcp.tool() async def delete_document( project_id: str, source_path: str = None, collection: str = None, ) -> DeleteResult: """Delete documents from knowledge base.""" @mcp.tool() async def get_collections( project_id: str, ) -> list[CollectionInfo]: """List all collections for a project.""" ``` ## Acceptance Criteria - [ ] pgvector schema created via migration - [ ] Code chunking works for Python and TypeScript - [ ] Markdown chunking preserves structure - [ ] Text chunking handles various formats - [ ] Semantic search returns relevant results - [ ] Keyword search works with full-text - [ ] Hybrid search combines both effectively - [ ] Per-project collection isolation enforced - [ ] All MCP tools documented and working - [ ] Unit tests >90% coverage - [ ] Integration tests with real pgvector - [ ] Docker container builds and runs - [ ] Documentation complete ## Dependencies - Depends on: #55 (MCP Client Infrastructure), #56 (LLM Gateway - for embeddings) - Blocks: Phase 3 Agent Orchestration (knowledge retrieval) ## Assignable To backend-engineer agent

cardosofelipe added the database mcp phase-2 labels 2026-01-03 01:25:45 +00:00

cardosofelipe referenced this issue

2026-01-03 01:26:17 +00:00

[EPIC] Phase 2: MCP Integration #60

cardosofelipe referenced this issue

2026-01-03 08:58:02 +00:00

feat(backend): Implement MCP Client Infrastructure #55

cardosofelipe referenced this issue

2026-01-03 08:58:59 +00:00

feat(mcp): Implement LLM Gateway MCP Server #56

cardosofelipe referenced this issue

2026-01-03 09:07:52 +00:00