feat(mcp): Implement Knowledge Base MCP Server #57

Closed
opened 2026-01-03 01:24:54 +00:00 by cardosofelipe · 0 comments

Summary

Implement the Knowledge Base MCP server that provides RAG (Retrieval Augmented Generation) capabilities using pgvector for semantic search. This enables agents to search and retrieve project-specific knowledge.

Sub-Tasks

1. Project Setup

  • Initialize FastMCP project in mcp-servers/knowledge-base/
  • Create pyproject.toml with dependencies
  • Add fastmcp>=0.4.0, asyncpg>=0.29.0, pgvector>=0.2.0
  • Add tiktoken>=0.5.0 for token counting
  • Create Docker configuration (Dockerfile, .dockerignore)
  • Add to docker-compose.dev.yml
  • Create README.md with setup instructions

2. Database Schema (migrations/)

  • Create Alembic migration for knowledge_embeddings table
  • Add id (UUID, primary key)
  • Add project_id (UUID, foreign key to projects)
  • Add collection (VARCHAR(100), e.g., 'code', 'docs', 'conversations')
  • Add source_path (VARCHAR(500), file path or URL)
  • Add source_type (VARCHAR(50), 'file', 'url', 'conversation')
  • Add chunk_index (INTEGER, position in source)
  • Add chunk_hash (VARCHAR(64), SHA256 for dedup)
  • Add content (TEXT, the actual text)
  • Add embedding (vector(1536), OpenAI ada-002 dimension)
  • Add metadata (JSONB, custom attributes)
  • Add token_count (INTEGER, cached token count)
  • Add created_at, updated_at (TIMESTAMPTZ)
  • Create IVFFlat index on embedding column
  • Create composite index on (project_id, collection)
  • Create index on source_path for lookups
  • Create GIN index on content for full-text search

3. Embedding Generation (embeddings.py)

  • Create EmbeddingService class
  • Integrate with LLM Gateway MCP for embeddings
  • Implement batch embedding (up to 100 texts)
  • Add caching for repeated texts (Redis)
  • Handle embedding dimension validation
  • Add retry logic for API failures
  • Track embedding costs

4. Code Chunking Strategy (chunking/code.py)

  • Create CodeChunker class
  • Parse Python files using AST
  • Parse TypeScript/JavaScript using tree-sitter
  • Extract function/class level chunks
  • Include imports and decorators in context
  • Target ~500 tokens per chunk
  • Add 50 token overlap between chunks
  • Preserve syntax highlighting hints in metadata
  • Handle edge cases (minified code, very long functions)

5. Markdown Chunking Strategy (chunking/markdown.py)

  • Create MarkdownChunker class
  • Parse markdown structure (headers, lists, code blocks)
  • Split on header boundaries (h1, h2, h3)
  • Keep code blocks intact
  • Target ~800 tokens per chunk
  • Add 100 token overlap
  • Preserve header hierarchy in metadata
  • Handle tables and special formatting

6. Text Chunking Strategy (chunking/text.py)

  • Create TextChunker class
  • Implement sentence-based splitting
  • Use NLTK or spaCy for sentence detection
  • Target ~400 tokens per chunk
  • Add 50 token overlap
  • Handle edge cases (no periods, very long sentences)
  • Create recursive splitter for oversized chunks

7. Document Ingestion Pipeline (ingestion.py)

  • Create IngestionPipeline class
  • Detect file type from extension/content
  • Route to appropriate chunker
  • Generate embeddings for chunks
  • Compute chunk hashes for deduplication
  • Store in PostgreSQL with pgvector
  • Handle update vs insert logic
  • Implement batch processing for efficiency
  • Add progress reporting
  • Handle encoding issues (UTF-8, etc.)

8. Semantic Search (search/semantic.py)

  • Create SemanticSearch class
  • Generate query embedding
  • Perform cosine similarity search via pgvector
  • Implement configurable threshold (default 0.7)
  • Return ranked results with scores
  • Include metadata in results
  • Add query expansion (optional)

9. Keyword Search (search/keyword.py)

  • Create KeywordSearch class
  • Use PostgreSQL full-text search (tsvector)
  • Support phrase matching
  • Implement relevance ranking
  • Handle special characters
  • Support prefix matching

10. Hybrid Search (search/hybrid.py)

  • Create HybridSearch class
  • Combine semantic and keyword results
  • Implement Reciprocal Rank Fusion (RRF)
  • Configurable weight between semantic/keyword
  • Deduplicate overlapping results
  • Boost exact matches

11. Collection Management (collections.py)

  • Create CollectionManager class
  • List collections for a project
  • Get collection statistics (document count, size)
  • Delete collection contents
  • Rename collection
  • Export collection to file

12. MCP Tools Implementation (server.py)

  • Implement ingest_document tool
    • Accept project_id, collection, content, source_path
    • Accept metadata as optional dict
    • Return chunk count and status
  • Implement ingest_file tool
    • Accept file path or URL
    • Auto-detect file type
    • Stream large files
  • Implement search tool
    • Accept project_id, query, collection filter
    • Accept limit, threshold, search_type parameters
    • Return ranked results with snippets
  • Implement delete_document tool
    • Delete by source_path
    • Delete by collection
    • Return deleted count
  • Implement get_collections tool
    • Return list with stats
  • Implement get_document tool
    • Retrieve full document by source_path
    • Return all chunks ordered

13. Project Isolation

  • Enforce project_id in all queries
  • Add RLS (Row Level Security) policies if needed
  • Validate project access in API layer
  • Prevent cross-project data leakage
  • Add project-level quotas (optional)

14. Performance Optimization

  • Implement connection pooling for PostgreSQL
  • Add query result caching (Redis)
  • Optimize IVFFlat parameters (lists, probes)
  • Batch embedding requests
  • Add async ingestion queue
  • Monitor query latency

15. Docker & Deployment

  • Create optimized Dockerfile
  • Add health check endpoint
  • Configure database connection pooling
  • Add resource limits
  • Create initialization script for indexes

16. Testing

  • Create unit tests for each chunker
  • Create unit tests for search implementations
  • Create integration tests with test PostgreSQL
  • Test embedding generation with mocks
  • Test hybrid search ranking
  • Test project isolation
  • Achieve >90% code coverage

17. Documentation

  • Document all MCP tools with examples
  • Create ingestion guide
  • Document search parameters
  • Add performance tuning guide
  • Create troubleshooting guide

Technical Specifications

Database Schema

CREATE TABLE knowledge_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
    collection VARCHAR(100) NOT NULL,
    source_path VARCHAR(500),
    source_type VARCHAR(50) NOT NULL,
    chunk_index INTEGER NOT NULL DEFAULT 0,
    chunk_hash VARCHAR(64) NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB DEFAULT '{}',
    token_count INTEGER NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    
    UNIQUE(project_id, source_path, chunk_index)
);

CREATE INDEX idx_embeddings_project_collection ON knowledge_embeddings(project_id, collection);
CREATE INDEX idx_embeddings_source ON knowledge_embeddings(source_path);
CREATE INDEX idx_embeddings_vector ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_embeddings_content_fts ON knowledge_embeddings USING GIN (to_tsvector('english', content));

MCP Tools

@mcp.tool()
async def ingest_document(
    project_id: str,
    collection: str,
    content: str,
    source_path: str = None,
    source_type: str = "text",
    metadata: dict = None,
) -> IngestResult:
    """Ingest a document into the knowledge base."""

@mcp.tool()
async def search(
    project_id: str,
    query: str,
    collection: str = None,
    limit: int = 10,
    threshold: float = 0.7,
    search_type: str = "hybrid",
) -> list[SearchResult]:
    """Search the knowledge base."""

@mcp.tool()
async def delete_document(
    project_id: str,
    source_path: str = None,
    collection: str = None,
) -> DeleteResult:
    """Delete documents from knowledge base."""

@mcp.tool()
async def get_collections(
    project_id: str,
) -> list[CollectionInfo]:
    """List all collections for a project."""

Acceptance Criteria

  • pgvector schema created via migration
  • Code chunking works for Python and TypeScript
  • Markdown chunking preserves structure
  • Text chunking handles various formats
  • Semantic search returns relevant results
  • Keyword search works with full-text
  • Hybrid search combines both effectively
  • Per-project collection isolation enforced
  • All MCP tools documented and working
  • Unit tests >90% coverage
  • Integration tests with real pgvector
  • Docker container builds and runs
  • Documentation complete

Dependencies

  • Depends on: #55 (MCP Client Infrastructure), #56 (LLM Gateway - for embeddings)
  • Blocks: Phase 3 Agent Orchestration (knowledge retrieval)

Assignable To

backend-engineer agent

## Summary Implement the Knowledge Base MCP server that provides RAG (Retrieval Augmented Generation) capabilities using pgvector for semantic search. This enables agents to search and retrieve project-specific knowledge. ## Sub-Tasks ### 1. Project Setup - [ ] Initialize FastMCP project in `mcp-servers/knowledge-base/` - [ ] Create `pyproject.toml` with dependencies - [ ] Add `fastmcp>=0.4.0`, `asyncpg>=0.29.0`, `pgvector>=0.2.0` - [ ] Add `tiktoken>=0.5.0` for token counting - [ ] Create Docker configuration (`Dockerfile`, `.dockerignore`) - [ ] Add to `docker-compose.dev.yml` - [ ] Create `README.md` with setup instructions ### 2. Database Schema (`migrations/`) - [ ] Create Alembic migration for `knowledge_embeddings` table - [ ] Add `id` (UUID, primary key) - [ ] Add `project_id` (UUID, foreign key to projects) - [ ] Add `collection` (VARCHAR(100), e.g., 'code', 'docs', 'conversations') - [ ] Add `source_path` (VARCHAR(500), file path or URL) - [ ] Add `source_type` (VARCHAR(50), 'file', 'url', 'conversation') - [ ] Add `chunk_index` (INTEGER, position in source) - [ ] Add `chunk_hash` (VARCHAR(64), SHA256 for dedup) - [ ] Add `content` (TEXT, the actual text) - [ ] Add `embedding` (vector(1536), OpenAI ada-002 dimension) - [ ] Add `metadata` (JSONB, custom attributes) - [ ] Add `token_count` (INTEGER, cached token count) - [ ] Add `created_at`, `updated_at` (TIMESTAMPTZ) - [ ] Create IVFFlat index on `embedding` column - [ ] Create composite index on `(project_id, collection)` - [ ] Create index on `source_path` for lookups - [ ] Create GIN index on `content` for full-text search ### 3. Embedding Generation (`embeddings.py`) - [ ] Create `EmbeddingService` class - [ ] Integrate with LLM Gateway MCP for embeddings - [ ] Implement batch embedding (up to 100 texts) - [ ] Add caching for repeated texts (Redis) - [ ] Handle embedding dimension validation - [ ] Add retry logic for API failures - [ ] Track embedding costs ### 4. Code Chunking Strategy (`chunking/code.py`) - [ ] Create `CodeChunker` class - [ ] Parse Python files using AST - [ ] Parse TypeScript/JavaScript using tree-sitter - [ ] Extract function/class level chunks - [ ] Include imports and decorators in context - [ ] Target ~500 tokens per chunk - [ ] Add 50 token overlap between chunks - [ ] Preserve syntax highlighting hints in metadata - [ ] Handle edge cases (minified code, very long functions) ### 5. Markdown Chunking Strategy (`chunking/markdown.py`) - [ ] Create `MarkdownChunker` class - [ ] Parse markdown structure (headers, lists, code blocks) - [ ] Split on header boundaries (h1, h2, h3) - [ ] Keep code blocks intact - [ ] Target ~800 tokens per chunk - [ ] Add 100 token overlap - [ ] Preserve header hierarchy in metadata - [ ] Handle tables and special formatting ### 6. Text Chunking Strategy (`chunking/text.py`) - [ ] Create `TextChunker` class - [ ] Implement sentence-based splitting - [ ] Use NLTK or spaCy for sentence detection - [ ] Target ~400 tokens per chunk - [ ] Add 50 token overlap - [ ] Handle edge cases (no periods, very long sentences) - [ ] Create recursive splitter for oversized chunks ### 7. Document Ingestion Pipeline (`ingestion.py`) - [ ] Create `IngestionPipeline` class - [ ] Detect file type from extension/content - [ ] Route to appropriate chunker - [ ] Generate embeddings for chunks - [ ] Compute chunk hashes for deduplication - [ ] Store in PostgreSQL with pgvector - [ ] Handle update vs insert logic - [ ] Implement batch processing for efficiency - [ ] Add progress reporting - [ ] Handle encoding issues (UTF-8, etc.) ### 8. Semantic Search (`search/semantic.py`) - [ ] Create `SemanticSearch` class - [ ] Generate query embedding - [ ] Perform cosine similarity search via pgvector - [ ] Implement configurable threshold (default 0.7) - [ ] Return ranked results with scores - [ ] Include metadata in results - [ ] Add query expansion (optional) ### 9. Keyword Search (`search/keyword.py`) - [ ] Create `KeywordSearch` class - [ ] Use PostgreSQL full-text search (tsvector) - [ ] Support phrase matching - [ ] Implement relevance ranking - [ ] Handle special characters - [ ] Support prefix matching ### 10. Hybrid Search (`search/hybrid.py`) - [ ] Create `HybridSearch` class - [ ] Combine semantic and keyword results - [ ] Implement Reciprocal Rank Fusion (RRF) - [ ] Configurable weight between semantic/keyword - [ ] Deduplicate overlapping results - [ ] Boost exact matches ### 11. Collection Management (`collections.py`) - [ ] Create `CollectionManager` class - [ ] List collections for a project - [ ] Get collection statistics (document count, size) - [ ] Delete collection contents - [ ] Rename collection - [ ] Export collection to file ### 12. MCP Tools Implementation (`server.py`) - [ ] Implement `ingest_document` tool - [ ] Accept project_id, collection, content, source_path - [ ] Accept metadata as optional dict - [ ] Return chunk count and status - [ ] Implement `ingest_file` tool - [ ] Accept file path or URL - [ ] Auto-detect file type - [ ] Stream large files - [ ] Implement `search` tool - [ ] Accept project_id, query, collection filter - [ ] Accept limit, threshold, search_type parameters - [ ] Return ranked results with snippets - [ ] Implement `delete_document` tool - [ ] Delete by source_path - [ ] Delete by collection - [ ] Return deleted count - [ ] Implement `get_collections` tool - [ ] Return list with stats - [ ] Implement `get_document` tool - [ ] Retrieve full document by source_path - [ ] Return all chunks ordered ### 13. Project Isolation - [ ] Enforce project_id in all queries - [ ] Add RLS (Row Level Security) policies if needed - [ ] Validate project access in API layer - [ ] Prevent cross-project data leakage - [ ] Add project-level quotas (optional) ### 14. Performance Optimization - [ ] Implement connection pooling for PostgreSQL - [ ] Add query result caching (Redis) - [ ] Optimize IVFFlat parameters (lists, probes) - [ ] Batch embedding requests - [ ] Add async ingestion queue - [ ] Monitor query latency ### 15. Docker & Deployment - [ ] Create optimized `Dockerfile` - [ ] Add health check endpoint - [ ] Configure database connection pooling - [ ] Add resource limits - [ ] Create initialization script for indexes ### 16. Testing - [ ] Create unit tests for each chunker - [ ] Create unit tests for search implementations - [ ] Create integration tests with test PostgreSQL - [ ] Test embedding generation with mocks - [ ] Test hybrid search ranking - [ ] Test project isolation - [ ] Achieve >90% code coverage ### 17. Documentation - [ ] Document all MCP tools with examples - [ ] Create ingestion guide - [ ] Document search parameters - [ ] Add performance tuning guide - [ ] Create troubleshooting guide ## Technical Specifications ### Database Schema ```sql CREATE TABLE knowledge_embeddings ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, collection VARCHAR(100) NOT NULL, source_path VARCHAR(500), source_type VARCHAR(50) NOT NULL, chunk_index INTEGER NOT NULL DEFAULT 0, chunk_hash VARCHAR(64) NOT NULL, content TEXT NOT NULL, embedding vector(1536), metadata JSONB DEFAULT '{}', token_count INTEGER NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE(project_id, source_path, chunk_index) ); CREATE INDEX idx_embeddings_project_collection ON knowledge_embeddings(project_id, collection); CREATE INDEX idx_embeddings_source ON knowledge_embeddings(source_path); CREATE INDEX idx_embeddings_vector ON knowledge_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); CREATE INDEX idx_embeddings_content_fts ON knowledge_embeddings USING GIN (to_tsvector('english', content)); ``` ### MCP Tools ```python @mcp.tool() async def ingest_document( project_id: str, collection: str, content: str, source_path: str = None, source_type: str = "text", metadata: dict = None, ) -> IngestResult: """Ingest a document into the knowledge base.""" @mcp.tool() async def search( project_id: str, query: str, collection: str = None, limit: int = 10, threshold: float = 0.7, search_type: str = "hybrid", ) -> list[SearchResult]: """Search the knowledge base.""" @mcp.tool() async def delete_document( project_id: str, source_path: str = None, collection: str = None, ) -> DeleteResult: """Delete documents from knowledge base.""" @mcp.tool() async def get_collections( project_id: str, ) -> list[CollectionInfo]: """List all collections for a project.""" ``` ## Acceptance Criteria - [ ] pgvector schema created via migration - [ ] Code chunking works for Python and TypeScript - [ ] Markdown chunking preserves structure - [ ] Text chunking handles various formats - [ ] Semantic search returns relevant results - [ ] Keyword search works with full-text - [ ] Hybrid search combines both effectively - [ ] Per-project collection isolation enforced - [ ] All MCP tools documented and working - [ ] Unit tests >90% coverage - [ ] Integration tests with real pgvector - [ ] Docker container builds and runs - [ ] Documentation complete ## Dependencies - Depends on: #55 (MCP Client Infrastructure), #56 (LLM Gateway - for embeddings) - Blocks: Phase 3 Agent Orchestration (knowledge retrieval) ## Assignable To backend-engineer agent
cardosofelipe added the databasemcpphase-2 labels 2026-01-03 01:25:45 +00:00
Sign in to join this conversation.