# Knowledge Base MCP Server

RAG capabilities with pgvector for semantic search, intelligent chunking, and collection management.

## Features

- **Semantic Search**: pgvector cosine similarity with HNSW indexing
- **Keyword Search**: PostgreSQL full-text search
- **Hybrid Search**: Reciprocal Rank Fusion combining both
- **Intelligent Chunking**: Code-aware, markdown-aware, and text chunking
- **Collection Management**: Per-project knowledge organization
- **Embedding Caching**: Redis deduplication for efficiency

## Quick Start

```bash
# Install dependencies
uv sync

# Run tests
IS_TEST=True uv run pytest -v

# Start server
uv run python server.py
```

## Configuration

Environment variables:
```bash
KB_HOST=0.0.0.0
KB_PORT=8002
KB_DEBUG=false
KB_DATABASE_URL=postgresql://user:pass@localhost:5432/syndarix
KB_REDIS_URL=redis://localhost:6379/2
KB_LLM_GATEWAY_URL=http://localhost:8001
```

## MCP Tools

### search_knowledge

Search the knowledge base.

```json
{
  "project_id": "proj-123",
  "agent_id": "agent-456",
  "query": "authentication flow",
  "search_type": "hybrid",
  "collection": "code",
  "limit": 10,
  "threshold": 0.7,
  "file_types": ["python", "typescript"]
}
```

### ingest_content

Add content to the knowledge base.

```json
{
  "project_id": "proj-123",
  "agent_id": "agent-456",
  "content": "def authenticate(user): ...",
  "source_path": "/src/auth.py",
  "collection": "code",
  "chunk_type": "code",
  "file_type": "python"
}
```

### delete_content

Remove content from the knowledge base.

```json
{
  "project_id": "proj-123",
  "agent_id": "agent-456",
  "source_path": "/src/old_file.py"
}
```

### list_collections

List all collections in a project.

```json
{
  "project_id": "proj-123",
  "agent_id": "agent-456"
}
```

### get_collection_stats

Get detailed collection statistics.

```json
{
  "project_id": "proj-123",
  "agent_id": "agent-456",
  "collection": "code"
}
```

### update_document

Atomically replace document content.

```json
{
  "project_id": "proj-123",
  "agent_id": "agent-456",
  "source_path": "/src/auth.py",
  "content": "def authenticate_v2(user): ...",
  "collection": "code",
  "chunk_type": "code",
  "file_type": "python"
}
```

## Chunking Strategies

### Code Chunking
- **Python**: AST-based (functions, classes, methods)
- **JavaScript/TypeScript**: Tree-sitter based
- **Go/Rust**: Tree-sitter based
- Target: ~500 tokens, 50 token overlap

### Markdown Chunking
- Heading-hierarchy aware
- Preserves code blocks
- Target: ~800 tokens, 100 token overlap

### Text Chunking
- Sentence-based splitting
- Target: ~400 tokens, 50 token overlap

## Search Types

### Semantic Search
Uses pgvector cosine similarity with HNSW indexing for fast approximate nearest neighbor search.

### Keyword Search
Uses PostgreSQL full-text search with ts_rank scoring.

### Hybrid Search
Combines semantic and keyword results using Reciprocal Rank Fusion (RRF):
- Default weights: 70% semantic, 30% keyword
- Configurable via settings

## Security

- Input validation for all IDs and paths
- Path traversal prevention
- Content size limits (default 10MB)
- Per-project data isolation

## Testing

```bash
# Full test suite with coverage
IS_TEST=True uv run pytest -v --cov=. --cov-report=term-missing

# Specific test file
IS_TEST=True uv run pytest tests/test_server.py -v
```

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check with dependency status |
| `/mcp/tools` | GET | List available tools |
| `/mcp` | POST | JSON-RPC 2.0 tool execution |