Added 7 new Architecture Decision Records completing the full set: - ADR-008: Knowledge Base and RAG (pgvector) - ADR-009: Agent Communication Protocol (structured messages) - ADR-010: Workflow State Machine (transitions + PostgreSQL) - ADR-011: Issue Synchronization (webhook-first + polling) - ADR-012: Cost Tracking (LiteLLM callbacks + Redis budgets) - ADR-013: Audit Logging (hash chaining + tiered storage) - ADR-014: Client Approval Flow (checkpoint-based) Added comprehensive ARCHITECTURE.md that: - Summarizes all 14 ADRs in decision matrix - Documents full system architecture with diagrams - Explains all component interactions - Details technology stack with self-hostability guarantee - Covers security, scalability, and deployment Updated IMPLEMENTATION_ROADMAP.md to mark Phase 0 completed items. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
171 lines
5.2 KiB
Markdown
171 lines
5.2 KiB
Markdown
# ADR-008: Knowledge Base and RAG Architecture
|
|
|
|
**Status:** Accepted
|
|
**Date:** 2025-12-29
|
|
**Deciders:** Architecture Team
|
|
**Related Spikes:** SPIKE-006
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Syndarix agents require access to project-specific knowledge bases for Retrieval-Augmented Generation (RAG). This enables agents to reference requirements, codebase context, documentation, and past decisions when performing tasks.
|
|
|
|
## Decision Drivers
|
|
|
|
- **Operational Simplicity:** Minimize infrastructure complexity
|
|
- **Performance:** Sub-100ms query latency
|
|
- **Isolation:** Per-project knowledge separation
|
|
- **Cost:** Avoid expensive dedicated vector databases
|
|
- **Flexibility:** Support multiple content types (code, docs, conversations)
|
|
|
|
## Considered Options
|
|
|
|
### Option 1: Dedicated Vector Database (Pinecone, Qdrant)
|
|
|
|
**Pros:**
|
|
- Purpose-built for vector search
|
|
- Excellent query performance at scale
|
|
- Managed offerings available
|
|
|
|
**Cons:**
|
|
- Additional infrastructure
|
|
- Cost at scale ($27-$70/month per 1M vectors)
|
|
- Data sync complexity with PostgreSQL
|
|
|
|
### Option 2: pgvector Extension (Selected)
|
|
|
|
**Pros:**
|
|
- Already using PostgreSQL (zero additional infrastructure)
|
|
- ACID transactions with application data
|
|
- Row-level security for multi-tenant isolation
|
|
- Handles 10-100M vectors effectively
|
|
- Hybrid search with PostgreSQL full-text
|
|
|
|
**Cons:**
|
|
- Less performant than dedicated solutions at billion-scale
|
|
- Requires PostgreSQL 15+
|
|
|
|
### Option 3: Weaviate (Self-hosted)
|
|
|
|
**Pros:**
|
|
- Multi-modal support
|
|
- Knowledge graph features
|
|
|
|
**Cons:**
|
|
- Additional service to manage
|
|
- Overkill for our scale
|
|
|
|
## Decision
|
|
|
|
**Adopt pgvector** as the vector store for RAG functionality.
|
|
|
|
Syndarix's per-project isolation means knowledge bases remain in the thousands to millions of vectors per tenant, well within pgvector's optimal range. The operational simplicity of using existing PostgreSQL infrastructure outweighs the performance benefits of dedicated vector databases.
|
|
|
|
## Implementation
|
|
|
|
### Embedding Model Strategy
|
|
|
|
| Content Type | Embedding Model | Dimensions | Rationale |
|
|
|-------------|-----------------|------------|-----------|
|
|
| Code files | voyage-code-3 | 1024 | State-of-art for code retrieval |
|
|
| Documentation | text-embedding-3-small | 1536 | Good balance cost/quality |
|
|
| Conversations | text-embedding-3-small | 1536 | General purpose |
|
|
|
|
### Chunking Strategy
|
|
|
|
| Content Type | Strategy | Chunk Size |
|
|
|--------------|----------|------------|
|
|
| Python/JS code | AST-based (function/class) | Per function |
|
|
| Markdown docs | Heading-based | Per section |
|
|
| PDF specs | Page-level + semantic | 1000 tokens |
|
|
| Conversations | Turn-based | Per exchange |
|
|
|
|
### Database Schema
|
|
|
|
```sql
|
|
CREATE TABLE knowledge_chunks (
|
|
id UUID PRIMARY KEY,
|
|
project_id UUID NOT NULL REFERENCES projects(id),
|
|
source_type VARCHAR(50) NOT NULL, -- 'code', 'doc', 'conversation'
|
|
source_path VARCHAR(500),
|
|
content TEXT NOT NULL,
|
|
embedding vector(1536),
|
|
metadata JSONB,
|
|
created_at TIMESTAMPTZ DEFAULT NOW(),
|
|
updated_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX ON knowledge_chunks USING ivfflat (embedding vector_cosine_ops)
|
|
WITH (lists = 100);
|
|
CREATE INDEX ON knowledge_chunks (project_id);
|
|
CREATE INDEX ON knowledge_chunks USING gin (metadata);
|
|
```
|
|
|
|
### Hybrid Search
|
|
|
|
```python
|
|
async def hybrid_search(
|
|
project_id: str,
|
|
query: str,
|
|
top_k: int = 10,
|
|
vector_weight: float = 0.7
|
|
) -> list[Chunk]:
|
|
"""Combine vector similarity with keyword matching."""
|
|
query_embedding = await embed(query)
|
|
|
|
results = await db.execute("""
|
|
WITH vector_results AS (
|
|
SELECT id, content, metadata,
|
|
1 - (embedding <=> $1) as vector_score
|
|
FROM knowledge_chunks
|
|
WHERE project_id = $2
|
|
ORDER BY embedding <=> $1
|
|
LIMIT $3 * 2
|
|
),
|
|
keyword_results AS (
|
|
SELECT id, content, metadata,
|
|
ts_rank(to_tsvector(content), plainto_tsquery($4)) as text_score
|
|
FROM knowledge_chunks
|
|
WHERE project_id = $2
|
|
AND to_tsvector(content) @@ plainto_tsquery($4)
|
|
LIMIT $3 * 2
|
|
)
|
|
SELECT DISTINCT ON (id) id, content, metadata,
|
|
COALESCE(v.vector_score, 0) * $5 +
|
|
COALESCE(k.text_score, 0) * (1 - $5) as combined_score
|
|
FROM vector_results v
|
|
FULL OUTER JOIN keyword_results k USING (id, content, metadata)
|
|
ORDER BY combined_score DESC
|
|
LIMIT $3
|
|
""", query_embedding, project_id, top_k, query, vector_weight)
|
|
|
|
return results
|
|
```
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- Zero additional infrastructure
|
|
- Transactional consistency with application data
|
|
- Unified backup/restore
|
|
- Row-level security for tenant isolation
|
|
|
|
### Negative
|
|
- May need migration to dedicated vector DB if scaling beyond 100M vectors
|
|
- Index tuning required for optimal performance
|
|
|
|
### Migration Path
|
|
If scale requires it, migrate to Qdrant (self-hosted, open-source) with the same embedding models, preserving vectors.
|
|
|
|
## Compliance
|
|
|
|
This decision aligns with:
|
|
- FR-103: Agent domain knowledge (RAG)
|
|
- NFR-501: Self-hostability requirement
|
|
- TC-001: PostgreSQL as primary database
|
|
|
|
---
|
|
|
|
*This ADR establishes the knowledge base and RAG architecture for Syndarix.*
|