4 Commits

Author SHA1 Message Date
Felipe Cardoso
6e3cdebbfb docs: add architecture decision records (ADRs) for key technical choices
- Added the following ADRs to `docs/adrs/` directory:
  - ADR-001: MCP Integration Architecture
  - ADR-002: Real-time Communication Architecture
  - ADR-003: Background Task Architecture
  - ADR-004: LLM Provider Abstraction
  - ADR-005: Technology Stack Selection
- Each ADR details the context, decision drivers, considered options, final decisions, and implementation plans.
- Documentation aligns technical choices with architecture principles and system requirements for Syndarix.
2025-12-29 13:16:02 +01:00
Felipe Cardoso
a6a336b66e docs: add spike findings for LLM abstraction, MCP integration, and real-time updates
- Added research findings and recommendations as separate SPIKE documents in `docs/spikes/`:
  - `SPIKE-005-llm-provider-abstraction.md`: Research on unified abstraction for LLM providers with failover, cost tracking, and caching strategies.
  - `SPIKE-001-mcp-integration-pattern.md`: Optimal pattern for integrating MCP with project/agent scoping and authentication strategies.
  - `SPIKE-003-realtime-updates.md`: Evaluation of SSE vs WebSocket for real-time updates, aligned with use-case needs.
- Focused on aligning implementation architectures with scalability, efficiency, and user needs.
- Documentation intended to inform upcoming ADRs.
2025-12-29 13:15:50 +01:00
Felipe Cardoso
9901dc7f51 docs: add Syndarix Requirements Document (v2.0)
- Created `SYNDARIX_REQUIREMENTS.md` in `docs/requirements/`.
- Document outlines Syndarix vision, objectives, functional/non-functional requirements, system architecture, user stories, and success metrics.
- Includes detailed descriptions of agent roles, workflows, autonomy levels, and configuration models.
- Approved by the Product Team, targeting enhanced transparency and structured development processes.
2025-12-29 13:14:53 +01:00
Felipe Cardoso
ac64d9505e chore: rebrand to Syndarix and set up initial structure
- Update README.md with Syndarix vision, features, and architecture
- Update CLAUDE.md with Syndarix-specific context
- Create documentation directory structure:
  - docs/requirements/ for requirements documents
  - docs/architecture/ for architecture documentation
  - docs/adrs/ for Architecture Decision Records
  - docs/spikes/ for spike research documents

Built on PragmaStack template.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-29 04:48:25 +01:00
18 changed files with 5717 additions and 606 deletions

View File

@@ -1,8 +1,71 @@
# CLAUDE.md # CLAUDE.md
Claude Code context for FastAPI + Next.js Full-Stack Template. Claude Code context for **Syndarix** - AI-Powered Software Consulting Agency.
**See [AGENTS.md](./AGENTS.md) for project context, architecture, and development commands.** **Built on PragmaStack.** See [AGENTS.md](./AGENTS.md) for base template context.
---
## Syndarix Project Context
### Vision
Syndarix is an autonomous platform that orchestrates specialized AI agents to deliver complete software solutions with minimal human intervention. It acts as a virtual consulting agency with AI agents playing roles like Product Owner, Architect, Engineers, QA, etc.
### Repository
- **URL:** https://gitea.pragmazest.com/cardosofelipe/syndarix
- **Issue Tracker:** Gitea Issues (primary)
- **CI/CD:** Gitea Actions
### Core Concepts
**Agent Types & Instances:**
- Agent Type = Template (base model, failover, expertise, personality)
- Agent Instance = Spawned from type, assigned to project
- Multiple instances of same type can work together
**Project Workflow:**
1. Requirements discovery with Product Owner agent
2. Architecture spike (PO + BA + Architect brainstorm)
3. Implementation planning and backlog creation
4. Autonomous sprint execution with checkpoints
5. Demo and client feedback
**Autonomy Levels:**
- `FULL_CONTROL`: Approve every action
- `MILESTONE`: Approve sprint boundaries
- `AUTONOMOUS`: Only major decisions
**MCP-First Architecture:**
All integrations via Model Context Protocol servers with explicit scoping:
```python
# All tools take project_id for scoping
search_knowledge(project_id="proj-123", query="auth flow")
create_issue(project_id="proj-123", title="Add login")
```
### Syndarix-Specific Directories
```
docs/
├── requirements/ # Requirements documents
├── architecture/ # Architecture documentation
├── adrs/ # Architecture Decision Records
└── spikes/ # Spike research documents
```
### Current Phase
**Architecture Spikes** - Validating key decisions before implementation.
### Key Extensions to Add (from PragmaStack base)
- Celery + Redis for agent job queue
- WebSocket/SSE for real-time updates
- pgvector for RAG knowledge base
- MCP server integration layer
---
## PragmaStack Development Guidelines
*The following guidelines are inherited from PragmaStack and remain applicable.*
## Claude Code-Specific Guidance ## Claude Code-Specific Guidance

724
README.md
View File

@@ -1,659 +1,175 @@
# <img src="frontend/public/logo.svg" alt="PragmaStack" width="32" height="32" style="vertical-align: middle" /> PragmaStack # Syndarix
> **The Pragmatic Full-Stack Template. Production-ready, security-first, and opinionated.** > **Your AI-Powered Software Consulting Agency**
>
> An autonomous platform that orchestrates specialized AI agents to deliver complete software solutions with minimal human intervention.
[![Backend Coverage](https://img.shields.io/badge/backend_coverage-97%25-brightgreen)](./backend/tests) [![Built on PragmaStack](https://img.shields.io/badge/Built_on-PragmaStack-blue)](https://gitea.pragmazest.com/cardosofelipe/fast-next-template)
[![Frontend Coverage](https://img.shields.io/badge/frontend_coverage-97%25-brightgreen)](./frontend/tests)
[![E2E Tests](https://img.shields.io/badge/e2e_tests-passing-success)](./frontend/e2e)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](./CONTRIBUTING.md)
![Landing Page](docs/images/landing.png)
--- ---
## Why PragmaStack? ## Vision
Building a modern full-stack application often leads to "analysis paralysis" or "boilerplate fatigue". You spend weeks setting up authentication, testing, and linting before writing a single line of business logic. Syndarix transforms the software development lifecycle by providing a **virtual consulting team** of AI agents that collaboratively plan, design, implement, test, and deliver complete software solutions.
**PragmaStack cuts through the noise.** **The Problem:** Even with AI coding assistants, developers spend as much time managing AI as doing the work themselves. Context switching, babysitting, and knowledge fragmentation limit productivity.
We provide a **pragmatic**, opinionated foundation that prioritizes: **The Solution:** A structured, autonomous agency where specialized AI agents handle different roles (Product Owner, Architect, Engineers, QA, etc.) with proper workflows, reviews, and quality gates.
- **Speed**: Ship features, not config files.
- **Robustness**: Security and testing are not optional.
- **Clarity**: Code that is easy to read and maintain.
Whether you're building a SaaS, an internal tool, or a side project, PragmaStack gives you a solid starting point without the bloat.
--- ---
## Features ## Key Features
### 🔐 **Authentication & Security** ### Multi-Agent Orchestration
- JWT-based authentication with access + refresh tokens - Configurable agent **types** with base model, failover, expertise, and personality
- **OAuth/Social Login** (Google, GitHub) with PKCE support - Spawn multiple **instances** from the same type (e.g., Dave, Ellis, Kate as Software Developers)
- **OAuth 2.0 Authorization Server** (MCP-ready) for third-party integrations - Agent-to-agent communication and collaboration
- Session management with device tracking and revocation - Per-instance customization with domain-specific knowledge
- Password reset flow (email integration ready)
- Secure password hashing (bcrypt)
- CSRF protection, rate limiting, and security headers
- Comprehensive security tests (JWT algorithm attacks, session hijacking, privilege escalation)
### 🔌 **OAuth Provider Mode (MCP Integration)** ### Complete SDLC Support
Full OAuth 2.0 Authorization Server for Model Context Protocol (MCP) and third-party clients: - **Requirements Discovery** → **Architecture Spike****Implementation Planning**
- **RFC 7636**: Authorization Code Flow with PKCE (S256 only) - **Sprint Management** with automated ceremonies
- **RFC 8414**: Server metadata discovery at `/.well-known/oauth-authorization-server` - **Issue Tracking** with Epic/Story/Task hierarchy
- **RFC 7662**: Token introspection endpoint - **Git Integration** with proper branch/PR workflows
- **RFC 7009**: Token revocation endpoint - **CI/CD Pipelines** with automated testing
- **JWT access tokens**: Self-contained, configurable lifetime
- **Opaque refresh tokens**: Secure rotation, database-backed revocation
- **Consent management**: Users can review and revoke app permissions
- **Client management**: Admin endpoints for registering OAuth clients
- **Scopes**: `openid`, `profile`, `email`, `read:users`, `write:users`, `admin`
### 👥 **Multi-Tenancy & Organizations** ### Configurable Autonomy
- Full organization system with role-based access control (Owner, Admin, Member) - From `FULL_CONTROL` (approve everything) to `AUTONOMOUS` (only major milestones)
- Invite/remove members, manage permissions - Client can intervene at any point
- Organization-scoped data access - Transparent progress visibility
- User can belong to multiple organizations
### 🛠️ **Admin Panel** ### MCP-First Architecture
- Complete user management (CRUD, activate/deactivate, bulk operations) - All integrations via **Model Context Protocol (MCP)** servers
- Organization management (create, edit, delete, member management) - Unified Knowledge Base with project/agent scoping
- Session monitoring across all users - Git providers (Gitea, GitHub, GitLab) via MCP
- Real-time statistics dashboard - Extensible through custom MCP tools
- Admin-only routes with proper authorization
### 🎨 **Modern Frontend** ### Project Complexity Wizard
- Next.js 16 with App Router and React 19 - **Script** → Minimal process, no repo needed
- **PragmaStack Design System** built on shadcn/ui + TailwindCSS - **Simple** → Single sprint, basic backlog
- Pre-configured theme with dark mode support (coming soon) - **Medium/Complex** → Full AGILE workflow with multiple sprints
- Responsive, accessible components (WCAG AA compliant)
- Rich marketing landing page with animated components
- Live component showcase and documentation at `/dev`
### 🌍 **Internationalization (i18n)**
- Built-in multi-language support with next-intl v4
- Locale-based routing (`/en/*`, `/it/*`)
- Seamless language switching with LocaleSwitcher component
- SEO-friendly URLs and metadata per locale
- Translation files for English and Italian (easily extensible)
- Type-safe translations throughout the app
### 🎯 **Content & UX Features**
- **Toast notifications** with Sonner for elegant user feedback
- **Smooth animations** powered by Framer Motion
- **Markdown rendering** with syntax highlighting (GitHub Flavored Markdown)
- **Charts and visualizations** ready with Recharts
- **SEO optimization** with dynamic sitemap and robots.txt generation
- **Session tracking UI** with device information and revocation controls
### 🧪 **Comprehensive Testing**
- **Backend Testing**: ~97% unit test coverage
- Unit, integration, and security tests
- Async database testing with SQLAlchemy
- API endpoint testing with fixtures
- Security vulnerability tests (JWT attacks, session hijacking, privilege escalation)
- **Frontend Unit Tests**: ~97% coverage with Jest
- Component testing
- Hook testing
- Utility function testing
- **End-to-End Tests**: Playwright with zero flaky tests
- Complete user flows (auth, navigation, settings)
- Parallel execution for speed
- Visual regression testing ready
### 📚 **Developer Experience**
- Auto-generated TypeScript API client from OpenAPI spec
- Interactive API documentation (Swagger + ReDoc)
- Database migrations with Alembic helper script
- Hot reload in development for both frontend and backend
- Comprehensive code documentation and design system docs
- Live component playground at `/dev` with code examples
- Docker support for easy deployment
- VSCode workspace settings included
### 📊 **Ready for Production**
- Docker + docker-compose setup
- Environment-based configuration
- Database connection pooling
- Error handling and logging
- Health check endpoints
- Production security headers
- Rate limiting on sensitive endpoints
- SEO optimization with dynamic sitemaps and robots.txt
- Multi-language SEO with locale-specific metadata
- Performance monitoring and bundle analysis
--- ---
## 📸 Screenshots ## Technology Stack
<details> Built on [PragmaStack](https://gitea.pragmazest.com/cardosofelipe/fast-next-template):
<summary>Click to view screenshots</summary>
### Landing Page | Component | Technology |
![Landing Page](docs/images/landing.png) |-----------|------------|
| Backend | FastAPI 0.115+ (Python 3.11+) |
| Frontend | Next.js 16 (React 19) |
| Database | PostgreSQL 15+ with pgvector |
| ORM | SQLAlchemy 2.0 |
| State Management | Zustand + TanStack Query |
| UI | shadcn/ui + Tailwind 4 |
| Auth | JWT dual-token + OAuth 2.0 |
| Testing | pytest + Jest + Playwright |
### Syndarix Extensions
| Component | Technology |
### Authentication |-----------|------------|
![Login Page](docs/images/login.png) | Task Queue | Celery + Redis |
| Real-time | FastAPI WebSocket / SSE |
| Vector DB | pgvector (PostgreSQL extension) |
| MCP SDK | Anthropic MCP SDK |
### Admin Dashboard
![Admin Dashboard](docs/images/admin-dashboard.png)
### Design System
![Components](docs/images/design-system.png)
</details>
--- ---
## 🎭 Demo Mode ## Project Status
**Try the frontend without a backend!** Perfect for: **Phase:** Architecture & Planning
- **Free deployment** on Vercel (no backend costs)
- **Portfolio showcasing** with live demos See [docs/requirements/](./docs/requirements/) for the comprehensive requirements document.
- **Client presentations** without infrastructure setup
### Current Milestones
- [x] Fork PragmaStack as foundation
- [x] Create requirements document
- [ ] Execute architecture spikes
- [ ] Create ADRs for key decisions
- [ ] Begin MVP implementation
---
## Documentation
- [Requirements Document](./docs/requirements/SYNDARIX_REQUIREMENTS.md)
- [Architecture Decisions](./docs/adrs/) (coming soon)
- [Spike Research](./docs/spikes/) (coming soon)
- [Architecture Overview](./docs/architecture/) (coming soon)
---
## Getting Started
### Prerequisites
- Docker & Docker Compose
- Node.js 20+
- Python 3.11+
- PostgreSQL 15+ (or use Docker)
### Quick Start ### Quick Start
```bash
cd frontend
echo "NEXT_PUBLIC_DEMO_MODE=true" > .env.local
npm run dev
```
**Demo Credentials:**
- Regular user: `demo@example.com` / `DemoPass123`
- Admin user: `admin@example.com` / `AdminPass123`
Demo mode uses [Mock Service Worker (MSW)](https://mswjs.io/) to intercept API calls in the browser. Your code remains unchanged - the same components work with both real and mocked backends.
**Key Features:**
- ✅ Zero backend required
- ✅ All features functional (auth, admin, stats)
- ✅ Realistic network delays and errors
- ✅ Does NOT interfere with tests (97%+ coverage maintained)
- ✅ One-line toggle: `NEXT_PUBLIC_DEMO_MODE=true`
📖 **[Complete Demo Mode Documentation](./frontend/docs/DEMO_MODE.md)**
---
## 🚀 Tech Stack
### Backend
- **[FastAPI](https://fastapi.tiangolo.com/)** - Modern async Python web framework
- **[SQLAlchemy 2.0](https://www.sqlalchemy.org/)** - Powerful ORM with async support
- **[PostgreSQL](https://www.postgresql.org/)** - Robust relational database
- **[Alembic](https://alembic.sqlalchemy.org/)** - Database migrations
- **[Pydantic v2](https://docs.pydantic.dev/)** - Data validation with type hints
- **[pytest](https://pytest.org/)** - Testing framework with async support
### Frontend
- **[Next.js 16](https://nextjs.org/)** - React framework with App Router
- **[React 19](https://react.dev/)** - UI library
- **[TypeScript](https://www.typescriptlang.org/)** - Type-safe JavaScript
- **[TailwindCSS](https://tailwindcss.com/)** - Utility-first CSS framework
- **[shadcn/ui](https://ui.shadcn.com/)** - Beautiful, accessible component library
- **[next-intl](https://next-intl.dev/)** - Internationalization (i18n) with type safety
- **[TanStack Query](https://tanstack.com/query)** - Powerful data fetching/caching
- **[Zustand](https://zustand-demo.pmnd.rs/)** - Lightweight state management
- **[Framer Motion](https://www.framer.com/motion/)** - Production-ready animation library
- **[Sonner](https://sonner.emilkowal.ski/)** - Beautiful toast notifications
- **[Recharts](https://recharts.org/)** - Composable charting library
- **[React Markdown](https://github.com/remarkjs/react-markdown)** - Markdown rendering with GFM support
- **[Playwright](https://playwright.dev/)** - End-to-end testing
### DevOps
- **[Docker](https://www.docker.com/)** - Containerization
- **[docker-compose](https://docs.docker.com/compose/)** - Multi-container orchestration
- **GitHub Actions** (coming soon) - CI/CD pipelines
---
## 📋 Prerequisites
- **Docker & Docker Compose** (recommended) - [Install Docker](https://docs.docker.com/get-docker/)
- **OR manually:**
- Python 3.12+
- Node.js 18+ (Node 20+ recommended)
- PostgreSQL 15+
---
## 🏃 Quick Start (Docker)
The fastest way to get started is with Docker:
```bash ```bash
# Clone the repository # Clone the repository
git clone https://github.com/cardosofelipe/pragma-stack.git git clone https://gitea.pragmazest.com/cardosofelipe/syndarix.git
cd fast-next-template cd syndarix
# Copy environment file # Copy environment template
cp .env.template .env cp .env.template .env
# Start all services (backend, frontend, database) # Start development environment
docker-compose up docker-compose -f docker-compose.dev.yml up -d
# In another terminal, run database migrations # Run database migrations
docker-compose exec backend alembic upgrade head make migrate
# Create first superuser (optional) # Start the development servers
docker-compose exec backend python -c "from app.init_db import init_db; import asyncio; asyncio.run(init_db())" make dev
```
**That's it! 🎉**
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
Default superuser credentials:
- Email: `admin@example.com`
- Password: `admin123`
**⚠️ Change these immediately in production!**
---
## 🛠️ Manual Setup (Development)
### Backend Setup
```bash
cd backend
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Setup environment
cp .env.example .env
# Edit .env with your database credentials
# Run migrations
alembic upgrade head
# Initialize database with first superuser
python -c "from app.init_db import init_db; import asyncio; asyncio.run(init_db())"
# Start development server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
### Frontend Setup
```bash
cd frontend
# Install dependencies
npm install
# Setup environment
cp .env.local.example .env.local
# Edit .env.local with your backend URL
# Generate API client
npm run generate:api
# Start development server
npm run dev
```
Visit http://localhost:3000 to see your app!
---
## 📂 Project Structure
```
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── api/ # API routes and dependencies
│ │ ├── core/ # Core functionality (auth, config, database)
│ │ ├── crud/ # Database operations
│ │ ├── models/ # SQLAlchemy models
│ │ ├── schemas/ # Pydantic schemas
│ │ ├── services/ # Business logic
│ │ └── utils/ # Utilities
│ ├── tests/ # Backend tests (97% coverage)
│ ├── alembic/ # Database migrations
│ └── docs/ # Backend documentation
├── frontend/ # Next.js frontend
│ ├── src/
│ │ ├── app/ # Next.js App Router pages
│ │ ├── components/ # React components
│ │ ├── lib/ # Libraries and utilities
│ │ │ ├── api/ # API client (auto-generated)
│ │ │ └── stores/ # Zustand stores
│ │ └── hooks/ # Custom React hooks
│ ├── e2e/ # Playwright E2E tests
│ ├── tests/ # Unit tests (Jest)
│ └── docs/ # Frontend documentation
│ └── design-system/ # Comprehensive design system docs
├── docker-compose.yml # Docker orchestration
├── docker-compose.dev.yml # Development with hot reload
└── README.md # You are here!
``` ```
--- ---
## 🧪 Testing ## Architecture Overview
This template takes testing seriously with comprehensive coverage across all layers:
### Backend Unit & Integration Tests
**High coverage (~97%)** across all critical paths including security-focused tests.
```bash
cd backend
# Run all tests
IS_TEST=True pytest
# Run with coverage report
IS_TEST=True pytest --cov=app --cov-report=term-missing
# Run specific test file
IS_TEST=True pytest tests/api/test_auth.py -v
# Generate HTML coverage report
IS_TEST=True pytest --cov=app --cov-report=html
open htmlcov/index.html
``` ```
+====================================================================+
**Test types:** | SYNDARIX CORE |
- **Unit tests**: CRUD operations, utilities, business logic +====================================================================+
- **Integration tests**: API endpoints with database | +------------------+ +------------------+ +------------------+ |
- **Security tests**: JWT algorithm attacks, session hijacking, privilege escalation | | Agent Orchestrator| | Project Manager | | Workflow Engine | |
- **Error handling tests**: Database failures, validation errors | +------------------+ +------------------+ +------------------+ |
+====================================================================+
### Frontend Unit Tests |
v
**High coverage (~97%)** with Jest and React Testing Library. +====================================================================+
| MCP ORCHESTRATION LAYER |
```bash | All integrations via unified MCP servers with project scoping |
cd frontend +====================================================================+
|
# Run unit tests +------------------------+------------------------+
npm test | | |
+----v----+ +----v----+ +----v----+ +----v----+ +----v----+
# Run with coverage | LLM | | Git | |Knowledge| | File | | Code |
npm run test:coverage | Providers| | MCP | |Base MCP | |Sys. MCP | |Analysis |
+---------+ +---------+ +---------+ +---------+ +---------+
# Watch mode
npm run test:watch
```
**Test types:**
- Component rendering and interactions
- Custom hooks behavior
- State management
- Utility functions
- API integration mocks
### End-to-End Tests
**Zero flaky tests** with Playwright covering complete user journeys.
```bash
cd frontend
# Run E2E tests
npm run test:e2e
# Run E2E tests in UI mode (recommended for development)
npm run test:e2e:ui
# Run specific test file
npx playwright test auth-login.spec.ts
# Generate test report
npx playwright show-report
```
**Test coverage:**
- Complete authentication flows
- Navigation and routing
- Form submissions and validation
- Settings and profile management
- Session management
- Admin panel workflows (in progress)
---
## 🤖 AI-Friendly Documentation
This project includes comprehensive documentation designed for AI coding assistants:
- **[AGENTS.md](./AGENTS.md)** - Framework-agnostic AI assistant context for PragmaStack
- **[CLAUDE.md](./CLAUDE.md)** - Claude Code-specific guidance
These files provide AI assistants with the **PragmaStack** architecture, patterns, and best practices.
---
## 🗄️ Database Migrations
The template uses Alembic for database migrations:
```bash
cd backend
# Generate migration from model changes
python migrate.py generate "description of changes"
# Apply migrations
python migrate.py apply
# Or do both in one command
python migrate.py auto "description"
# View migration history
python migrate.py list
# Check current revision
python migrate.py current
``` ```
--- ---
## 📖 Documentation ## Contributing
### AI Assistant Documentation See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines.
- **[AGENTS.md](./AGENTS.md)** - Framework-agnostic AI coding assistant context
- **[CLAUDE.md](./CLAUDE.md)** - Claude Code-specific guidance and preferences
### Backend Documentation
- **[ARCHITECTURE.md](./backend/docs/ARCHITECTURE.md)** - System architecture and design patterns
- **[CODING_STANDARDS.md](./backend/docs/CODING_STANDARDS.md)** - Code quality standards
- **[COMMON_PITFALLS.md](./backend/docs/COMMON_PITFALLS.md)** - Common mistakes to avoid
- **[FEATURE_EXAMPLE.md](./backend/docs/FEATURE_EXAMPLE.md)** - Step-by-step feature guide
### Frontend Documentation
- **[PragmaStack Design System](./frontend/docs/design-system/)** - Complete design system guide
- Quick start, foundations (colors, typography, spacing)
- Component library guide
- Layout patterns, spacing philosophy
- Forms, accessibility, AI guidelines
- **[E2E Testing Guide](./frontend/e2e/README.md)** - E2E testing setup and best practices
### API Documentation
When the backend is running:
- **Swagger UI**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **OpenAPI JSON**: http://localhost:8000/api/v1/openapi.json
--- ---
## 🚢 Deployment ## License
### Docker Production Deployment MIT License - see [LICENSE](./LICENSE) for details.
```bash
# Build and start all services
docker-compose up -d
# Run migrations
docker-compose exec backend alembic upgrade head
# View logs
docker-compose logs -f
# Stop services
docker-compose down
```
### Production Checklist
- [ ] Change default superuser credentials
- [ ] Set strong `SECRET_KEY` in backend `.env`
- [ ] Configure production database (PostgreSQL)
- [ ] Set `ENVIRONMENT=production` in backend
- [ ] Configure CORS origins for your domain
- [ ] Setup SSL/TLS certificates
- [ ] Configure email service for password resets
- [ ] Setup monitoring and logging
- [ ] Configure backup strategy
- [ ] Review and adjust rate limits
- [ ] Test security headers
--- ---
## 🛣️ Roadmap & Status ## Acknowledgments
### ✅ Completed - Built on [PragmaStack](https://gitea.pragmazest.com/cardosofelipe/fast-next-template)
- [x] Authentication system (JWT, refresh tokens, session management, OAuth) - Powered by Claude and the Anthropic API
- [x] User management (CRUD, profile, password change)
- [x] Organization system with RBAC (Owner, Admin, Member)
- [x] Admin panel (users, organizations, sessions, statistics)
- [x] **Internationalization (i18n)** with next-intl (English + Italian)
- [x] Backend testing infrastructure (~97% coverage)
- [x] Frontend unit testing infrastructure (~97% coverage)
- [x] Frontend E2E testing (Playwright, zero flaky tests)
- [x] Design system documentation
- [x] **Marketing landing page** with animated components
- [x] **`/dev` documentation portal** with live component examples
- [x] **Toast notifications** system (Sonner)
- [x] **Charts and visualizations** (Recharts)
- [x] **Animation system** (Framer Motion)
- [x] **Markdown rendering** with syntax highlighting
- [x] **SEO optimization** (sitemap, robots.txt, locale-aware metadata)
- [x] Database migrations with helper script
- [x] Docker deployment
- [x] API documentation (OpenAPI/Swagger)
### 🚧 In Progress
- [ ] Email integration (templates ready, SMTP pending)
### 🔮 Planned
- [ ] GitHub Actions CI/CD pipelines
- [ ] Dynamic test coverage badges from CI
- [ ] E2E test coverage reporting
- [ ] OAuth token encryption at rest (security hardening)
- [ ] Additional languages (Spanish, French, German, etc.)
- [ ] SSO/SAML authentication
- [ ] Real-time notifications with WebSockets
- [ ] Webhook system
- [ ] File upload/storage (S3-compatible)
- [ ] Audit logging system
- [ ] API versioning example
---
## 🤝 Contributing
Contributions are welcome! Whether you're fixing bugs, improving documentation, or proposing new features, we'd love your help.
### How to Contribute
1. **Fork the repository**
2. **Create a feature branch** (`git checkout -b feature/amazing-feature`)
3. **Make your changes**
- Follow existing code style
- Add tests for new features
- Update documentation as needed
4. **Run tests** to ensure everything works
5. **Commit your changes** (`git commit -m 'Add amazing feature'`)
6. **Push to your branch** (`git push origin feature/amazing-feature`)
7. **Open a Pull Request**
### Development Guidelines
- Write tests for new features (aim for >90% coverage)
- Follow the existing architecture patterns
- Update documentation when adding features
- Keep commits atomic and well-described
- Be respectful and constructive in discussions
### Reporting Issues
Found a bug? Have a suggestion? [Open an issue](https://github.com/cardosofelipe/pragma-stack/issues)!
Please include:
- Clear description of the issue/suggestion
- Steps to reproduce (for bugs)
- Expected vs. actual behavior
- Environment details (OS, Python/Node version, etc.)
---
## 📄 License
This project is licensed under the **MIT License** - see the [LICENSE](./LICENSE) file for details.
**TL;DR**: You can use this template for any purpose, commercial or non-commercial. Attribution is appreciated but not required!
---
## 🙏 Acknowledgments
This template is built on the shoulders of giants:
- [FastAPI](https://fastapi.tiangolo.com/) by Sebastián Ramírez
- [Next.js](https://nextjs.org/) by Vercel
- [shadcn/ui](https://ui.shadcn.com/) by shadcn
- [TanStack Query](https://tanstack.com/query) by Tanner Linsley
- [Playwright](https://playwright.dev/) by Microsoft
- And countless other open-source projects that make modern development possible
---
## 💬 Questions?
- **Documentation**: Check the `/docs` folders in backend and frontend
- **Issues**: [GitHub Issues](https://github.com/cardosofelipe/pragma-stack/issues)
- **Discussions**: [GitHub Discussions](https://github.com/cardosofelipe/pragma-stack/discussions)
---
## ⭐ Star This Repo
If this template saves you time, consider giving it a star! It helps others discover the project and motivates continued development.
**Happy coding! 🚀**
---
<div align="center">
Made with ❤️ by a developer who got tired of rebuilding the same boilerplate
</div>

0
docs/adrs/.gitkeep Normal file
View File

View File

@@ -0,0 +1,134 @@
# ADR-001: MCP Integration Architecture
**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-001
---
## Context
Syndarix requires integration with multiple external services (LLM providers, Git, issue trackers, file systems, CI/CD). The Model Context Protocol (MCP) was identified as the standard for tool integration in AI applications. We need to decide on:
1. The MCP framework to use
2. Server deployment pattern (singleton vs per-project)
3. Scoping mechanism for multi-project/multi-agent access
## Decision Drivers
- **Simplicity:** Minimize operational complexity
- **Resource Efficiency:** Avoid spawning redundant processes
- **Consistency:** Unified interface across all integrations
- **Scalability:** Support 10+ concurrent projects
- **Maintainability:** Easy to add new MCP servers
## Considered Options
### Option 1: Per-Project MCP Servers
Spawn dedicated MCP server instances for each project.
**Pros:**
- Complete isolation between projects
- Simple access control (project owns server)
**Cons:**
- Resource heavy (7 servers × N projects)
- Complex orchestration
- Difficult to share cross-project resources
### Option 2: Unified Singleton MCP Servers (Selected)
Single instance of each MCP server type, with explicit project/agent scoping.
**Pros:**
- Resource efficient (7 total servers)
- Simpler deployment
- Enables cross-project learning (if desired)
- Consistent management
**Cons:**
- Requires explicit scoping in all tools
- Shared state requires careful design
### Option 3: Hybrid (MCP Proxy)
Single proxy that routes to per-project backends.
**Pros:**
- Balance of isolation and efficiency
**Cons:**
- Added complexity
- Routing overhead
## Decision
**Adopt Option 2: Unified Singleton MCP Servers with explicit scoping.**
All MCP servers are deployed as singletons. Every tool accepts `project_id` and `agent_id` parameters for:
- Access control validation
- Audit logging
- Context filtering
## Implementation
### MCP Server Registry
| Server | Port | Purpose |
|--------|------|---------|
| LLM Gateway | 9001 | Route LLM requests with failover |
| Git MCP | 9002 | Git operations across providers |
| Knowledge Base MCP | 9003 | RAG and document search |
| Issues MCP | 9004 | Issue tracking operations |
| File System MCP | 9005 | Workspace file operations |
| Code Analysis MCP | 9006 | Static analysis, linting |
| CI/CD MCP | 9007 | Pipeline operations |
### Framework Selection
Use **FastMCP 2.0** for all MCP server implementations:
- Decorator-based tool registration
- Built-in async support
- Compatible with SSE transport
- Type-safe with Pydantic
### Tool Signature Pattern
```python
@mcp.tool()
def tool_name(
project_id: str, # Required: project scope
agent_id: str, # Required: calling agent
# ... tool-specific params
) -> Result:
validate_access(agent_id, project_id)
log_tool_usage(agent_id, project_id, "tool_name")
# ... implementation
```
## Consequences
### Positive
- Single deployment per MCP type simplifies operations
- Consistent interface across all tools
- Easy to add monitoring/logging centrally
- Cross-project analytics possible
### Negative
- All tools must include scoping parameters
- Shared state requires careful design
- Single point of failure per MCP type (mitigated by multiple instances)
### Neutral
- Requires MCP client manager in FastAPI backend
- Authentication handled internally (service tokens for v1)
## Compliance
This decision aligns with:
- FR-802: MCP-first architecture requirement
- NFR-201: Horizontal scalability requirement
- NFR-602: Centralized logging requirement
---
*This ADR supersedes any previous decisions regarding MCP architecture.*

View File

@@ -0,0 +1,160 @@
# ADR-002: Real-time Communication Architecture
**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-003
---
## Context
Syndarix requires real-time communication for:
- Agent activity streams
- Project progress updates
- Build/pipeline status
- Client approval requests
- Issue change notifications
- Interactive chat with agents
We need to decide between WebSocket and Server-Sent Events (SSE) for real-time data delivery.
## Decision Drivers
- **Simplicity:** Minimize implementation complexity
- **Reliability:** Built-in reconnection handling
- **Scalability:** Support 200+ concurrent connections
- **Compatibility:** Work through proxies and load balancers
- **Use Case Fit:** Match communication patterns
## Considered Options
### Option 1: WebSocket Only
Use WebSocket for all real-time communication.
**Pros:**
- Bidirectional communication
- Single protocol to manage
- Well-supported in FastAPI
**Cons:**
- Manual reconnection logic required
- More complex through proxies
- Overkill for server-to-client streams
### Option 2: SSE Only
Use Server-Sent Events for all real-time communication.
**Pros:**
- Built-in automatic reconnection
- Native HTTP (proxy-friendly)
- Simpler implementation
**Cons:**
- Unidirectional only
- Browser connection limits per domain
### Option 3: SSE Primary + WebSocket for Chat (Selected)
Use SSE for server-to-client events, WebSocket for bidirectional chat.
**Pros:**
- Best tool for each use case
- SSE simplicity for 90% of needs
- WebSocket only where truly needed
**Cons:**
- Two protocols to manage
## Decision
**Adopt Option 3: SSE as primary transport, WebSocket for interactive chat.**
### SSE Use Cases (90%)
- Agent activity streams
- Project progress updates
- Build/pipeline status
- Approval request notifications
- Issue change notifications
### WebSocket Use Cases (10%)
- Interactive chat with agents
- Real-time debugging sessions
- Future collaboration features
## Implementation
### Event Bus with Redis Pub/Sub
```
FastAPI Backend ──publish──> Redis Pub/Sub ──subscribe──> SSE Endpoints
└──> Other Backend Instances
```
### SSE Endpoint Pattern
```python
@router.get("/projects/{project_id}/events")
async def project_events(project_id: str, request: Request):
async def event_generator():
subscriber = await event_bus.subscribe(f"project:{project_id}")
try:
while not await request.is_disconnected():
event = await asyncio.wait_for(
subscriber.get_event(), timeout=30.0
)
yield f"event: {event.type}\ndata: {event.json()}\n\n"
finally:
await subscriber.unsubscribe()
return StreamingResponse(
event_generator(),
media_type="text/event-stream"
)
```
### Event Types
| Category | Event Types |
|----------|-------------|
| Agent | `agent_started`, `agent_activity`, `agent_completed`, `agent_error` |
| Project | `issue_created`, `issue_updated`, `issue_closed` |
| Git | `branch_created`, `commit_pushed`, `pr_created`, `pr_merged` |
| Workflow | `approval_required`, `sprint_started`, `sprint_completed` |
| Pipeline | `pipeline_started`, `pipeline_completed`, `pipeline_failed` |
### Client Implementation
- Single SSE connection per project
- Event multiplexing through event types
- Exponential backoff on reconnection
- Native `EventSource` API with automatic reconnect
## Consequences
### Positive
- Simpler implementation for server-to-client streams
- Automatic reconnection reduces client complexity
- Works through all HTTP proxies
- Reduced server resource usage vs WebSocket
### Negative
- Two protocols to maintain
- WebSocket requires manual reconnect logic
- SSE limited to ~6 connections per domain (HTTP/1.1)
### Mitigation
- Use HTTP/2 where possible (higher connection limits)
- Multiplex all project events on single connection
- WebSocket only for interactive chat sessions
## Compliance
This decision aligns with:
- FR-105: Real-time agent activity monitoring
- NFR-102: 200+ concurrent connections requirement
- NFR-501: Responsive UI updates
---
*This ADR supersedes any previous decisions regarding real-time communication.*

View File

@@ -0,0 +1,179 @@
# ADR-003: Background Task Architecture
**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-004
---
## Context
Syndarix requires background task processing for:
- Agent actions (LLM calls, code generation)
- Git operations (clone, commit, push, PR creation)
- External synchronization (issue sync with Gitea/GitHub/GitLab)
- CI/CD pipeline triggers
- Long-running workflows (sprints, story implementation)
These tasks are too slow for synchronous API responses and need proper queuing, retry, and monitoring.
## Decision Drivers
- **Reliability:** Tasks must complete even if workers restart
- **Visibility:** Progress tracking for long-running operations
- **Scalability:** Handle concurrent agent operations
- **Rate Limiting:** Respect LLM API rate limits
- **Async Compatibility:** Work with async FastAPI
## Considered Options
### Option 1: FastAPI BackgroundTasks
Use FastAPI's built-in background tasks.
**Pros:**
- Simple, no additional infrastructure
- Direct async integration
**Cons:**
- No persistence (lost on restart)
- No retry mechanism
- No distributed workers
### Option 2: Celery + Redis (Selected)
Use Celery for task queue with Redis as broker/backend.
**Pros:**
- Mature, battle-tested
- Persistent task queue
- Built-in retry with backoff
- Distributed workers
- Task chaining and workflows
- Monitoring with Flower
**Cons:**
- Additional infrastructure
- Sync-only task execution (bridge needed for async)
### Option 3: Dramatiq + Redis
Use Dramatiq as a simpler Celery alternative.
**Pros:**
- Simpler API than Celery
- Good async support
**Cons:**
- Less mature ecosystem
- Fewer monitoring tools
### Option 4: ARQ (Async Redis Queue)
Use ARQ for native async task processing.
**Pros:**
- Native async
- Simple API
**Cons:**
- Less feature-rich
- Smaller community
## Decision
**Adopt Option 2: Celery + Redis.**
Celery provides the reliability, monitoring, and ecosystem maturity needed for production workloads. Redis serves as both broker and result backend.
## Implementation
### Queue Architecture
```
┌─────────────────────────────────────────────────┐
│ Redis (Broker + Backend) │
├─────────────┬─────────────┬─────────────────────┤
│ agent_queue │ git_queue │ sync_queue │
│ (prefetch=1)│ (prefetch=4)│ (prefetch=4) │
└──────┬──────┴──────┬──────┴──────────┬──────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Agent │ │ Git │ │ Sync │
│ Workers │ │ Workers │ │ Workers │
└─────────┘ └─────────┘ └─────────┘
```
### Queue Configuration
| Queue | Prefetch | Concurrency | Purpose |
|-------|----------|-------------|---------|
| `agent_queue` | 1 | 4 | LLM-based tasks (rate limited) |
| `git_queue` | 4 | 8 | Git operations |
| `sync_queue` | 4 | 4 | External sync |
| `cicd_queue` | 4 | 4 | Pipeline operations |
### Task Patterns
**Progress Reporting:**
```python
@celery_app.task(bind=True)
def implement_story(self, story_id: str, agent_id: str, project_id: str):
for i, step in enumerate(steps):
self.update_state(
state="PROGRESS",
meta={"current": i + 1, "total": len(steps)}
)
# Publish SSE event for real-time UI update
event_bus.publish(f"project:{project_id}", {
"type": "agent_progress",
"step": i + 1,
"total": len(steps)
})
execute_step(step)
```
**Task Chaining:**
```python
workflow = chain(
analyze_requirements.s(story_id),
design_solution.s(),
implement_code.s(),
run_tests.s(),
create_pr.s()
)
```
### Monitoring
- **Flower:** Web UI for task monitoring (port 5555)
- **Prometheus:** Metrics export for alerting
- **Dead Letter Queue:** Failed tasks for investigation
## Consequences
### Positive
- Reliable task execution with persistence
- Automatic retry with exponential backoff
- Progress tracking for long operations
- Distributed workers for scalability
- Rich monitoring and debugging tools
### Negative
- Additional infrastructure (Redis, workers)
- Celery is synchronous (event_loop bridge for async calls)
- Learning curve for task patterns
### Mitigation
- Use existing Redis instance (already needed for SSE)
- Wrap async calls with `asyncio.run()` or `sync_to_async`
- Document common task patterns
## Compliance
This decision aligns with:
- FR-304: Long-running implementation workflow
- NFR-102: 500+ background jobs per minute
- NFR-402: Task reliability and fault tolerance
---
*This ADR supersedes any previous decisions regarding background task processing.*

View File

@@ -0,0 +1,189 @@
# ADR-004: LLM Provider Abstraction
**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-005
---
## Context
Syndarix agents require access to large language models (LLMs) from multiple providers:
- **Anthropic** (Claude) - Primary provider
- **OpenAI** (GPT-4) - Fallback provider
- **Local models** (Ollama/Llama) - Cost optimization, privacy
We need a unified abstraction layer that provides:
- Consistent API across providers
- Automatic failover on errors
- Usage tracking and cost management
- Rate limiting compliance
## Decision Drivers
- **Reliability:** Automatic failover on provider outages
- **Cost Control:** Track and limit API spending
- **Flexibility:** Easy to add/swap providers
- **Consistency:** Single interface for all agents
- **Async Support:** Compatible with async FastAPI
## Considered Options
### Option 1: Direct Provider SDKs
Use Anthropic and OpenAI SDKs directly with custom abstraction.
**Pros:**
- Full control over implementation
- No external dependencies
**Cons:**
- Significant development effort
- Must maintain failover logic
- Must track token costs manually
### Option 2: LiteLLM (Selected)
Use LiteLLM as unified abstraction layer.
**Pros:**
- Unified API for 100+ providers
- Built-in failover and routing
- Automatic token counting
- Cost tracking built-in
- Redis caching support
- Active community
**Cons:**
- External dependency
- May lag behind provider SDK updates
### Option 3: LangChain
Use LangChain's LLM abstraction.
**Pros:**
- Large ecosystem
- Many integrations
**Cons:**
- Heavy dependency
- Overkill for just LLM abstraction
- Complexity overhead
## Decision
**Adopt Option 2: LiteLLM for unified LLM provider abstraction.**
LiteLLM provides the reliability, monitoring, and multi-provider support needed with minimal overhead.
## Implementation
### Model Groups
| Group Name | Use Case | Primary Model | Fallback |
|------------|----------|---------------|----------|
| `high-reasoning` | Complex analysis, architecture | Claude 3.5 Sonnet | GPT-4 Turbo |
| `fast-response` | Quick tasks, simple queries | Claude 3 Haiku | GPT-4o Mini |
| `cost-optimized` | High-volume, non-critical | Local Llama 3 | Claude 3 Haiku |
### Failover Chain
```
Claude 3.5 Sonnet (Anthropic)
▼ (on failure)
GPT-4 Turbo (OpenAI)
▼ (on failure)
Llama 3 (Ollama/Local)
▼ (on failure)
Error with retry
```
### LLM Gateway Service
```python
class LLMGateway:
def __init__(self):
self.router = Router(
model_list=model_list,
fallbacks=[
{"high-reasoning": ["high-reasoning", "local-fallback"]},
],
routing_strategy="latency-based-routing",
num_retries=3,
)
async def complete(
self,
agent_id: str,
project_id: str,
messages: list[dict],
model_preference: str = "high-reasoning",
) -> dict:
response = await self.router.acompletion(
model=model_preference,
messages=messages,
)
await self._track_usage(agent_id, project_id, response)
return response
```
### Cost Tracking
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|-------|----------------------|------------------------|
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| GPT-4 Turbo | $10.00 | $30.00 |
| GPT-4o Mini | $0.15 | $0.60 |
| Ollama (local) | $0.00 | $0.00 |
### Agent Type Mapping
| Agent Type | Model Preference | Rationale |
|------------|------------------|-----------|
| Product Owner | high-reasoning | Complex requirements analysis |
| Software Architect | high-reasoning | Architecture decisions |
| Software Engineer | high-reasoning | Code generation |
| QA Engineer | fast-response | Test case generation |
| DevOps Engineer | fast-response | Config generation |
| Project Manager | fast-response | Status updates |
### Caching Strategy
- **Redis-backed cache** for repeated queries
- **TTL:** 1 hour for general queries
- **Skip cache:** For context-dependent generation
- **Cache key:** Hash of (model, messages, temperature)
## Consequences
### Positive
- Single interface for all LLM operations
- Automatic failover improves reliability
- Built-in cost tracking and budgeting
- Easy to add new providers
- Caching reduces API costs
### Negative
- Dependency on LiteLLM library
- May lag behind provider SDK features
- Additional abstraction layer
### Mitigation
- Pin LiteLLM version, test before upgrades
- Direct SDK access available if needed
- Monitor LiteLLM updates for breaking changes
## Compliance
This decision aligns with:
- FR-101: Agent type model configuration
- NFR-103: Agent response time targets
- NFR-402: Failover requirements
- TR-001: LLM API unavailability mitigation
---
*This ADR supersedes any previous decisions regarding LLM integration.*

View File

@@ -0,0 +1,156 @@
# ADR-005: Technology Stack Selection
**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
---
## Context
Syndarix needs a robust, modern technology stack that can support:
- Multi-agent orchestration with real-time communication
- Full-stack web application with API backend
- Background task processing for long-running operations
- Vector search for RAG (Retrieval-Augmented Generation)
- Multiple external integrations via MCP
The decision was made to build upon **PragmaStack** as the foundation, extending it with Syndarix-specific components.
## Decision Drivers
- **Productivity:** Rapid development with modern frameworks
- **Type Safety:** Minimize runtime errors
- **Async Performance:** Handle concurrent agent operations
- **Ecosystem:** Rich library support
- **Familiarity:** Team expertise with selected technologies
- **Production-Ready:** Proven technologies for production workloads
## Decision
**Adopt PragmaStack as foundation with Syndarix-specific extensions.**
### Core Stack (from PragmaStack)
| Layer | Technology | Version | Rationale |
|-------|------------|---------|-----------|
| **Backend** | FastAPI | 0.115+ | Async, OpenAPI, type hints |
| **Backend Language** | Python | 3.11+ | Type hints, async/await, ecosystem |
| **Frontend** | Next.js | 16 | React 19, server components, App Router |
| **Frontend Language** | TypeScript | 5.0+ | Type safety, IDE support |
| **Database** | PostgreSQL | 15+ | Robust, extensible, pgvector |
| **ORM** | SQLAlchemy | 2.0+ | Async support, type hints |
| **Validation** | Pydantic | 2.0+ | Data validation, serialization |
| **State Management** | Zustand | 4.0+ | Simple, performant |
| **Data Fetching** | TanStack Query | 5.0+ | Caching, invalidation |
| **UI Components** | shadcn/ui | Latest | Accessible, customizable |
| **CSS** | Tailwind CSS | 4.0+ | Utility-first, fast styling |
| **Auth** | JWT | - | Dual-token (access + refresh) |
### Syndarix Extensions
| Component | Technology | Version | Purpose |
|-----------|------------|---------|---------|
| **Task Queue** | Celery | 5.3+ | Background job processing |
| **Message Broker** | Redis | 7.0+ | Celery broker, caching, pub/sub |
| **Vector Store** | pgvector | Latest | Embeddings for RAG |
| **MCP Framework** | FastMCP | 2.0+ | MCP server development |
| **LLM Abstraction** | LiteLLM | Latest | Multi-provider LLM access |
| **Real-time** | SSE + WebSocket | - | Event streaming, chat |
### Testing Stack
| Type | Technology | Purpose |
|------|------------|---------|
| **Backend Unit** | pytest | 8.0+ | Python testing |
| **Backend Async** | pytest-asyncio | Async test support |
| **Backend Coverage** | coverage.py | Code coverage |
| **Frontend Unit** | Jest | 29+ | React testing |
| **Frontend Components** | React Testing Library | Component testing |
| **E2E** | Playwright | 1.40+ | Browser automation |
### DevOps Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| **Containerization** | Docker | 24+ | Application packaging |
| **Orchestration** | Docker Compose | Local development |
| **CI/CD** | Gitea Actions | Automated pipelines |
| **Database Migrations** | Alembic | Schema versioning |
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (Next.js 16) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Pages │ │ Components │ │ Stores │ │
│ │ (App Router)│ │ (shadcn/ui) │ │ (Zustand) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│ REST + SSE + WebSocket
┌─────────────────────────────────────────────────────────────────┐
│ Backend (FastAPI 0.115+) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ API │ │ Services │ │ CRUD │ │
│ │ Routes │ │ Layer │ │ Layer │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ LLM Gateway │ │ MCP Client │ │ Event Bus │ │
│ │ (LiteLLM) │ │ Manager │ │ (Redis) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────────────────┐
│ PostgreSQL │ │ Redis │ │ MCP Servers │
│ + pgvector │ │ (Cache/Queue) │ │ (LLM, Git, KB, Issues...) │
└───────────────┘ └───────────────┘ └───────────────────────────┘
┌───────────────┐
│ Celery │
│ Workers │
└───────────────┘
```
## Consequences
### Positive
- Proven, production-ready stack
- Strong typing throughout (Python + TypeScript)
- Excellent async performance
- Rich ecosystem for extensions
- Team familiarity reduces learning curve
### Negative
- Python GIL limits CPU-bound concurrency (mitigated by Celery)
- Multiple languages (Python + TypeScript) to maintain
- PostgreSQL requires management (vs serverless options)
### Neutral
- PragmaStack provides solid foundation but may include unused features
- Stack is opinionated, limiting some technology choices
## Version Pinning Strategy
| Component | Strategy | Rationale |
|-----------|----------|-----------|
| Python | 3.11+ (specific minor) | Stability |
| Node.js | 20 LTS | Long-term support |
| FastAPI | 0.115+ | Latest stable |
| Next.js | 16 | Current major |
| PostgreSQL | 15+ | Required for features |
## Compliance
This decision aligns with:
- NFR-601: Code quality standards (TypeScript, type hints)
- NFR-603: Docker containerization requirement
- TC-001 through TC-006: Technical constraints
---
*This ADR establishes the foundational technology choices for Syndarix.*

View File

@@ -0,0 +1,260 @@
# ADR-006: Agent Orchestration Architecture
**Status:** Accepted
**Date:** 2025-12-29
**Deciders:** Architecture Team
**Related Spikes:** SPIKE-002
---
## Context
Syndarix requires an agent orchestration system that can:
- Define reusable agent types with specific capabilities
- Spawn multiple instances of the same type with unique identities
- Manage agent state, context, and conversation history
- Route messages between agents
- Handle agent failover and recovery
- Track resource usage per agent
## Decision Drivers
- **Flexibility:** Support diverse agent roles and capabilities
- **Scalability:** Handle 50+ concurrent agent instances
- **Isolation:** Each instance maintains separate state
- **Observability:** Full visibility into agent activities
- **Reliability:** Graceful handling of failures
## Decision
**Adopt a Type-Instance pattern** where:
- **Agent Types** define templates (model, expertise, personality)
- **Agent Instances** are spawned from types with unique identities
- **Agent Orchestrator** manages lifecycle and communication
## Architecture
### Agent Type Definition
```python
class AgentType(Base):
id = Column(UUID, primary_key=True)
name = Column(String(50), unique=True) # "Software Engineer"
role = Column(Enum(AgentRole)) # ENGINEER
base_model = Column(String(100)) # "claude-3-5-sonnet-20241022"
failover_model = Column(String(100)) # "gpt-4-turbo"
expertise = Column(ARRAY(String)) # ["python", "fastapi", "testing"]
personality = Column(JSONB) # {"style": "detailed", "tone": "professional"}
system_prompt = Column(Text) # Base system prompt template
capabilities = Column(ARRAY(String)) # ["code_generation", "code_review"]
is_active = Column(Boolean, default=True)
```
### Agent Instance Definition
```python
class AgentInstance(Base):
id = Column(UUID, primary_key=True)
name = Column(String(50)) # "Dave"
agent_type_id = Column(UUID, ForeignKey)
project_id = Column(UUID, ForeignKey)
status = Column(Enum(InstanceStatus)) # ACTIVE, IDLE, TERMINATED
context = Column(JSONB) # Current working context
conversation_id = Column(UUID) # Active conversation
rag_collection_id = Column(String) # Domain knowledge collection
token_usage = Column(JSONB) # {"prompt": 0, "completion": 0}
last_active_at = Column(DateTime)
created_at = Column(DateTime)
terminated_at = Column(DateTime)
```
### Orchestrator Service
```python
class AgentOrchestrator:
"""Central service for agent lifecycle management."""
async def spawn_agent(
self,
agent_type_id: UUID,
project_id: UUID,
name: str,
domain_knowledge: list[str] = None
) -> AgentInstance:
"""Spawn a new agent instance from a type definition."""
agent_type = await self.get_agent_type(agent_type_id)
instance = AgentInstance(
name=name,
agent_type_id=agent_type_id,
project_id=project_id,
status=InstanceStatus.ACTIVE,
context={"initialized_at": datetime.utcnow().isoformat()},
)
# Initialize RAG collection if domain knowledge provided
if domain_knowledge:
instance.rag_collection_id = await self._init_rag_collection(
instance.id, domain_knowledge
)
await self.db.add(instance)
await self.db.commit()
# Publish spawn event
await self.event_bus.publish(f"project:{project_id}", {
"type": "agent_spawned",
"agent_id": str(instance.id),
"name": name,
"role": agent_type.role.value
})
return instance
async def terminate_agent(self, instance_id: UUID) -> None:
"""Terminate an agent instance and release resources."""
instance = await self.get_instance(instance_id)
instance.status = InstanceStatus.TERMINATED
instance.terminated_at = datetime.utcnow()
# Cleanup RAG collection
if instance.rag_collection_id:
await self._cleanup_rag_collection(instance.rag_collection_id)
await self.db.commit()
async def send_message(
self,
from_id: UUID,
to_id: UUID,
message: AgentMessage
) -> None:
"""Route a message from one agent to another."""
# Validate both agents exist and are active
sender = await self.get_instance(from_id)
recipient = await self.get_instance(to_id)
# Persist message
await self.message_store.save(message)
# If recipient is idle, trigger action
if recipient.status == InstanceStatus.IDLE:
await self._trigger_agent_action(recipient.id, message)
# Publish for real-time tracking
await self.event_bus.publish(f"project:{sender.project_id}", {
"type": "agent_message",
"from": str(from_id),
"to": str(to_id),
"preview": message.content[:100]
})
async def broadcast(
self,
from_id: UUID,
target_role: AgentRole,
message: AgentMessage
) -> None:
"""Broadcast a message to all agents of a specific role."""
sender = await self.get_instance(from_id)
recipients = await self.get_instances_by_role(
sender.project_id, target_role
)
for recipient in recipients:
await self.send_message(from_id, recipient.id, message)
```
### Agent Execution Pattern
```python
class AgentRunner:
"""Executes agent actions using LLM."""
def __init__(self, instance: AgentInstance, llm_gateway: LLMGateway):
self.instance = instance
self.llm = llm_gateway
async def execute(self, action: str, context: dict) -> dict:
"""Execute an action using the agent's configured model."""
agent_type = await self.get_agent_type(self.instance.agent_type_id)
# Build messages with system prompt and context
messages = [
{"role": "system", "content": self._build_system_prompt(agent_type)},
*self._get_conversation_history(),
{"role": "user", "content": self._build_action_prompt(action, context)}
]
# Add RAG context if available
if self.instance.rag_collection_id:
rag_context = await self._query_rag(action, context)
messages.insert(1, {
"role": "system",
"content": f"Relevant context:\n{rag_context}"
})
# Execute with failover
response = await self.llm.complete(
agent_id=str(self.instance.id),
project_id=str(self.instance.project_id),
messages=messages,
model_preference=self._get_model_preference(agent_type)
)
# Update instance context
self.instance.context = {
**self.instance.context,
"last_action": action,
"last_response_at": datetime.utcnow().isoformat()
}
return response
```
### Agent Roles
| Role | Instances | Primary Capabilities |
|------|-----------|---------------------|
| Product Owner | 1 | requirements, prioritization, client_communication |
| Project Manager | 1 | planning, tracking, coordination |
| Business Analyst | 1 | analysis, documentation, process_modeling |
| Software Architect | 1 | design, architecture_decisions, tech_selection |
| Software Engineer | 1-5 | code_generation, code_review, testing |
| UI/UX Designer | 1 | design, wireframes, accessibility |
| QA Engineer | 1-2 | test_planning, test_automation, bug_reporting |
| DevOps Engineer | 1 | cicd, infrastructure, deployment |
| AI/ML Engineer | 1 | ml_development, model_training, mlops |
| Security Expert | 1 | security_review, vulnerability_assessment |
## Consequences
### Positive
- Clear separation between type definition and instance runtime
- Multiple instances share type configuration (DRY)
- Easy to add new agent roles
- Full observability through events
- Graceful failure handling with model failover
### Negative
- Complexity in managing instance lifecycle
- State synchronization across instances
- Memory overhead for context storage
### Mitigation
- Context archival for long-running instances
- Periodic cleanup of terminated instances
- State compression for large contexts
## Compliance
This decision aligns with:
- FR-101: Agent type configuration
- FR-102: Agent instance spawning
- FR-103: Agent domain knowledge (RAG)
- FR-104: Inter-agent communication
- FR-105: Agent activity monitoring
---
*This ADR establishes the agent orchestration architecture for Syndarix.*

View File

View File

@@ -0,0 +1,487 @@
# Syndarix Architecture Overview
**Version:** 1.0
**Date:** 2025-12-29
**Status:** Draft
---
## Table of Contents
1. [Executive Summary](#1-executive-summary)
2. [System Context](#2-system-context)
3. [High-Level Architecture](#3-high-level-architecture)
4. [Core Components](#4-core-components)
5. [Data Architecture](#5-data-architecture)
6. [Integration Architecture](#6-integration-architecture)
7. [Security Architecture](#7-security-architecture)
8. [Deployment Architecture](#8-deployment-architecture)
9. [Cross-Cutting Concerns](#9-cross-cutting-concerns)
10. [Architecture Decisions](#10-architecture-decisions)
---
## 1. Executive Summary
Syndarix is an AI-powered software consulting agency platform that orchestrates specialized AI agents to deliver complete software solutions autonomously. This document describes the technical architecture that enables:
- **Multi-Agent Orchestration:** 10 specialized agent roles collaborating on projects
- **MCP-First Integration:** All external tools via Model Context Protocol
- **Real-time Visibility:** SSE-based event streaming for progress tracking
- **Autonomous Workflows:** Configurable autonomy levels from full control to autonomous
- **Full Artifact Delivery:** Code, documentation, tests, and ADRs
### Architecture Principles
1. **MCP-First:** All integrations through unified MCP servers
2. **Event-Driven:** Async communication via Redis Pub/Sub
3. **Type-Safe:** Full typing in Python and TypeScript
4. **Stateless Services:** Horizontal scaling through stateless design
5. **Explicit Scoping:** All operations scoped to project/agent
---
## 2. System Context
### Context Diagram
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXTERNAL ACTORS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Client │ │ Admin │ │ LLM APIs │ │ Git Hosts │ │
│ │ (Human) │ │ (Human) │ │ (Anthropic) │ │ (Gitea) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
└─────────│──────────────────│──────────────────│──────────────────│──────────┘
│ │ │ │
│ Web UI │ Admin UI │ API │ API
│ SSE │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ SYNDARIX PLATFORM │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Agent Orchestration │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ PO │ │ PM │ │ Arch │ │ Eng │ │ QA │ ... │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│ │ │ │
│ Storage │ Events │ Tasks │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ Celery │ │MCP Servers │ │
│ │ + pgvector │ │ Pub/Sub │ │ Workers │ │ (7 types) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Key Actors
| Actor | Type | Interaction |
|-------|------|-------------|
| Client | Human | Web UI, approvals, feedback |
| Admin | Human | Configuration, monitoring |
| LLM Providers | External | Claude, GPT-4, local models |
| Git Hosts | External | Gitea, GitHub, GitLab |
| CI/CD Systems | External | Gitea Actions, etc. |
---
## 3. High-Level Architecture
### Layered Architecture
```
┌───────────────────────────────────────────────────────────────────┐
│ PRESENTATION LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Next.js 16 Frontend │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │Dashboard │ │ Projects │ │ Agents │ │ Issues │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
│ REST + SSE + WebSocket
┌───────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ FastAPI Backend │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Auth │ │ API │ │ Services │ │ Events │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ ORCHESTRATION LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ Agent │ │ Workflow │ │ Project │ │ │
│ │ │ Orchestrator │ │ Engine │ │ Manager │ │ │
│ │ └───────────────┘ └───────────────┘ └───────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ INTEGRATION LAYER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ MCP Client Manager │ │
│ │ Connects to: LLM, Git, KB, Issues, FS, Code, CI/CD MCPs │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ File Store │ │
│ │ + pgvector │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└───────────────────────────────────────────────────────────────────┘
```
---
## 4. Core Components
### 4.1 Agent Orchestrator
**Purpose:** Manages agent lifecycle, spawning, communication, and coordination.
**Responsibilities:**
- Spawn agent instances from type definitions
- Route messages between agents
- Manage agent context and memory
- Handle agent failover
- Track resource usage
**Key Patterns:**
- Type-Instance pattern (types define templates, instances are runtime)
- Message routing with priority queues
- Context compression for long-running agents
See: [ADR-006: Agent Orchestration](../adrs/ADR-006-agent-orchestration.md)
### 4.2 Workflow Engine
**Purpose:** Orchestrates multi-step workflows and agent collaboration.
**Responsibilities:**
- Execute workflow templates (requirements discovery, sprint, etc.)
- Track workflow state and progress
- Handle branching and conditions
- Manage approval gates
**Workflow Types:**
- Requirements Discovery
- Architecture Spike
- Sprint Planning
- Implementation
- Sprint Demo
### 4.3 Project Manager (Component)
**Purpose:** Manages project lifecycle, configuration, and state.
**Responsibilities:**
- Create and configure projects
- Manage complexity levels
- Track project status
- Generate reports
### 4.4 LLM Gateway
**Purpose:** Unified LLM access with failover and cost tracking.
**Implementation:** LiteLLM-based router with:
- Multiple model groups (high-reasoning, fast-response)
- Automatic failover chain
- Per-agent token tracking
- Redis-backed caching
See: [ADR-004: LLM Provider Abstraction](../adrs/ADR-004-llm-provider-abstraction.md)
### 4.5 MCP Client Manager
**Purpose:** Connects to all MCP servers and routes tool calls.
**Implementation:**
- SSE connections to 7 MCP server types
- Automatic reconnection
- Request/response correlation
- Scoped tool calls with project_id/agent_id
See: [ADR-001: MCP Integration Architecture](../adrs/ADR-001-mcp-integration-architecture.md)
### 4.6 Event Bus
**Purpose:** Real-time event distribution using Redis Pub/Sub.
**Channels:**
- `project:{project_id}` - Project-scoped events
- `agent:{agent_id}` - Agent-specific events
- `system` - System-wide announcements
See: [ADR-002: Real-time Communication](../adrs/ADR-002-realtime-communication.md)
---
## 5. Data Architecture
### 5.1 Entity Model
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ User │───1:N─│ Project │───1:N─│ Sprint │
└─────────────┘ └─────────────┘ └─────────────┘
│ 1:N │ 1:N
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ │ │ │
┌──────┴──────┐ ┌────┴────┐ │ ┌─────┴─────┐
│ AgentInstance│ │Repository│ │ │ Issue │
└─────────────┘ └─────────┘ │ └───────────┘
│ │ │ │
│ 1:N │ 1:N │ │ 1:N
┌──────┴──────┐ ┌──────┴────┐│ ┌──────┴──────┐
│ Message │ │PullRequest│└───────│IssueComment │
└─────────────┘ └───────────┘ └─────────────┘
```
### 5.2 Key Entities
| Entity | Purpose | Key Fields |
|--------|---------|------------|
| User | Human users | email, auth |
| Project | Work containers | name, complexity, autonomy_level |
| AgentType | Agent templates | base_model, expertise, system_prompt |
| AgentInstance | Running agents | name, project_id, context |
| Issue | Work items | type, status, external_tracker_fields |
| Sprint | Time-boxed iterations | goal, velocity |
| Repository | Git repos | provider, clone_url |
| KnowledgeDocument | RAG documents | content, embedding_id |
### 5.3 Vector Storage
**pgvector** extension for:
- Document embeddings (RAG)
- Semantic search across knowledge base
- Agent context similarity
---
## 6. Integration Architecture
### 6.1 MCP Server Registry
| Server | Port | Purpose | Priority Providers |
|--------|------|---------|-------------------|
| LLM Gateway | 9001 | LLM routing | Anthropic, OpenAI, Ollama |
| Git MCP | 9002 | Git operations | Gitea, GitHub, GitLab |
| Knowledge Base | 9003 | RAG search | pgvector |
| Issues MCP | 9004 | Issue tracking | Gitea, GitHub, GitLab |
| File System | 9005 | Workspace files | Local FS |
| Code Analysis | 9006 | Static analysis | Ruff, ESLint |
| CI/CD MCP | 9007 | Pipelines | Gitea Actions |
### 6.2 External Integration Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ Syndarix Backend │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ MCP Client Manager │ │
│ │ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ LLM │ │ Git │ │ KB │ │ Issues │ │ CI/CD │ │ │
│ │ │ Client │ │ Client │ │ Client │ │ Client │ │ Client │ │ │
│ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │
│ └──────│──────────│──────────│──────────│──────────│──────┘ │
└─────────│──────────│──────────│──────────│──────────│──────────┘
│ │ │ │ │
│ SSE │ SSE │ SSE │ SSE │ SSE
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ LLM │ │ Git │ │ KB │ │ Issues │ │ CI/CD │
│ MCP │ │ MCP │ │ MCP │ │ MCP │ │ MCP │
│ Server │ │ Server │ │ Server │ │ Server │ │ Server │
└───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│Anthropic│ │ Gitea │ │pgvector│ │ Gitea │ │ Gitea │
│ OpenAI │ │ GitHub │ │ │ │ Issues │ │Actions │
│ Ollama │ │ GitLab │ │ │ │ │ │ │
└─────────┘ └────────┘ └────────┘ └────────┘ └────────┘
```
---
## 7. Security Architecture
### 7.1 Authentication
- **JWT Dual-Token:** Access token (15 min) + Refresh token (7 days)
- **OAuth 2.0 Provider:** For MCP client authentication
- **Service Tokens:** Internal service-to-service auth
### 7.2 Authorization
- **RBAC:** Role-based access control
- **Project Scoping:** All operations scoped to projects
- **Agent Permissions:** Agents operate within project scope
### 7.3 Data Protection
- **TLS 1.3:** All external communications
- **Encryption at Rest:** Database encryption
- **Secrets Management:** Environment-based, never in code
---
## 8. Deployment Architecture
### 8.1 Container Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Docker Compose │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Frontend │ │ Backend │ │ Workers │ │ Flower │ │
│ │ (Next.js)│ │ (FastAPI)│ │ (Celery) │ │(Monitor) │ │
│ │ :3000 │ │ :8000 │ │ │ │ :5555 │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ LLM MCP │ │ Git MCP │ │ KB MCP │ │Issues MCP│ │
│ │ :9001 │ │ :9002 │ │ :9003 │ │ :9004 │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ FS MCP │ │ Code MCP │ │CI/CD MCP │ │
│ │ :9005 │ │ :9006 │ │ :9007 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Infrastructure │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │PostgreSQL│ │ Redis │ │ │
│ │ │ :5432 │ │ :6379 │ │ │
│ │ └──────────┘ └──────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
### 8.2 Scaling Strategy
| Component | Scaling | Strategy |
|-----------|---------|----------|
| Frontend | Horizontal | Stateless, behind LB |
| Backend | Horizontal | Stateless, behind LB |
| Celery Workers | Horizontal | Queue-based routing |
| MCP Servers | Horizontal | Stateless singletons |
| PostgreSQL | Vertical + Read Replicas | Primary/replica |
| Redis | Cluster | Sentinel or Cluster mode |
---
## 9. Cross-Cutting Concerns
### 9.1 Logging
- **Format:** Structured JSON
- **Correlation:** Request IDs across services
- **Levels:** DEBUG, INFO, WARNING, ERROR, CRITICAL
### 9.2 Monitoring
- **Metrics:** Prometheus-compatible export
- **Traces:** OpenTelemetry (future)
- **Dashboards:** Grafana (optional)
### 9.3 Error Handling
- **Agent Errors:** Logged, published via SSE
- **Task Failures:** Celery retry with backoff
- **Integration Errors:** Circuit breaker pattern
---
## 10. Architecture Decisions
### Summary of ADRs
| ADR | Title | Status |
|-----|-------|--------|
| [ADR-001](../adrs/ADR-001-mcp-integration-architecture.md) | MCP Integration Architecture | Accepted |
| [ADR-002](../adrs/ADR-002-realtime-communication.md) | Real-time Communication | Accepted |
| [ADR-003](../adrs/ADR-003-background-task-architecture.md) | Background Task Architecture | Accepted |
| [ADR-004](../adrs/ADR-004-llm-provider-abstraction.md) | LLM Provider Abstraction | Accepted |
| [ADR-005](../adrs/ADR-005-tech-stack-selection.md) | Tech Stack Selection | Accepted |
| [ADR-006](../adrs/ADR-006-agent-orchestration.md) | Agent Orchestration | Accepted |
### Key Decisions Summary
1. **Unified Singleton MCP Servers** with project/agent scoping
2. **SSE for real-time events**, WebSocket only for chat
3. **Celery + Redis** for background tasks
4. **LiteLLM** for unified LLM abstraction with failover
5. **PragmaStack** as foundation with Syndarix extensions
6. **Type-Instance pattern** for agent orchestration
---
## Appendix A: Technology Stack Quick Reference
| Layer | Technology |
|-------|------------|
| Frontend | Next.js 16, React 19, TypeScript, Tailwind, shadcn/ui |
| Backend | FastAPI, Python 3.11+, SQLAlchemy 2.0, Pydantic 2.0 |
| Database | PostgreSQL 15+ with pgvector |
| Cache/Queue | Redis 7.0+ |
| Task Queue | Celery 5.3+ |
| MCP | FastMCP 2.0 |
| LLM | LiteLLM (Claude, GPT-4, Ollama) |
| Testing | pytest, Jest, Playwright |
| Container | Docker, Docker Compose |
---
## Appendix B: Port Reference
| Service | Port |
|---------|------|
| Frontend | 3000 |
| Backend | 8000 |
| PostgreSQL | 5432 |
| Redis | 6379 |
| Flower | 5555 |
| LLM MCP | 9001 |
| Git MCP | 9002 |
| KB MCP | 9003 |
| Issues MCP | 9004 |
| FS MCP | 9005 |
| Code MCP | 9006 |
| CI/CD MCP | 9007 |
---
*This document provides the comprehensive architecture overview for Syndarix. For detailed decisions, see the individual ADRs.*

View File

File diff suppressed because it is too large Load Diff

0
docs/spikes/.gitkeep Normal file
View File

View File

@@ -0,0 +1,288 @@
# SPIKE-001: MCP Integration Pattern
**Status:** Completed
**Date:** 2025-12-29
**Author:** Architecture Team
**Related Issue:** #1
---
## Objective
Research the optimal pattern for integrating Model Context Protocol (MCP) servers with FastAPI backend, focusing on unified singleton servers with project/agent scoping.
## Research Questions
1. What is the recommended MCP SDK for Python/FastAPI?
2. How should we structure unified MCP servers vs per-project servers?
3. What is the best pattern for project/agent scoping in MCP tools?
4. How do we handle authentication between Syndarix and MCP servers?
## Findings
### 1. FastMCP 2.0 - Recommended Framework
**FastMCP** is a high-level, Pythonic framework for building MCP servers that significantly reduces boilerplate compared to the low-level MCP SDK.
**Key Features:**
- Decorator-based tool registration (`@mcp.tool()`)
- Built-in context management for resources and prompts
- Support for server-sent events (SSE) and stdio transports
- Type-safe with Pydantic model support
- Async-first design compatible with FastAPI
**Installation:**
```bash
pip install fastmcp
```
**Basic Example:**
```python
from fastmcp import FastMCP
mcp = FastMCP("syndarix-knowledge-base")
@mcp.tool()
def search_knowledge(
project_id: str,
query: str,
scope: str = "project"
) -> list[dict]:
"""Search the knowledge base with project scoping."""
# Implementation here
return results
@mcp.resource("project://{project_id}/config")
def get_project_config(project_id: str) -> dict:
"""Get project configuration."""
return config
```
### 2. Unified Singleton Pattern (Recommended)
**Decision:** Use unified singleton MCP servers instead of per-project servers.
**Architecture:**
```
┌─────────────────────────────────────────────────────────┐
│ Syndarix Backend │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Agent 1 │ │ Agent 2 │ │ Agent 3 │ │
│ │ (project A) │ │ (project A) │ │ (project B) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ MCP Client (Singleton) │ │
│ │ Maintains connections to all MCP servers │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────┬──────────────────────────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Git MCP │ │ KB MCP │ │ LLM MCP │
│ (Singleton)│ │ (Singleton)│ │ (Singleton)│
└────────────┘ └────────────┘ └────────────┘
```
**Why Singleton:**
- Resource efficiency (one process per MCP type)
- Shared connection pools
- Centralized logging and monitoring
- Simpler deployment (7 services vs N×7)
- Cross-project learning possible (if needed)
**Scoping Pattern:**
```python
@mcp.tool()
def search_knowledge(
project_id: str, # Required - scopes to project
agent_id: str, # Required - identifies calling agent
query: str,
scope: Literal["project", "global"] = "project"
) -> SearchResults:
"""
All tools accept project_id and agent_id for:
- Access control validation
- Audit logging
- Context filtering
"""
# Validate agent has access to project
validate_access(agent_id, project_id)
# Log the access
log_tool_usage(agent_id, project_id, "search_knowledge")
# Perform scoped search
if scope == "project":
return search_project_kb(project_id, query)
else:
return search_global_kb(query)
```
### 3. MCP Server Registry Architecture
```python
# mcp/registry.py
from dataclasses import dataclass
from typing import Dict
@dataclass
class MCPServerConfig:
name: str
port: int
transport: str # "sse" or "stdio"
enabled: bool = True
MCP_SERVERS: Dict[str, MCPServerConfig] = {
"llm_gateway": MCPServerConfig("llm-gateway", 9001, "sse"),
"git": MCPServerConfig("git-mcp", 9002, "sse"),
"knowledge_base": MCPServerConfig("kb-mcp", 9003, "sse"),
"issues": MCPServerConfig("issues-mcp", 9004, "sse"),
"file_system": MCPServerConfig("fs-mcp", 9005, "sse"),
"code_analysis": MCPServerConfig("code-mcp", 9006, "sse"),
"cicd": MCPServerConfig("cicd-mcp", 9007, "sse"),
}
```
### 4. Authentication Pattern
**MCP OAuth 2.0 Integration:**
```python
from fastmcp import FastMCP
from fastmcp.auth import OAuth2Bearer
mcp = FastMCP(
"syndarix-mcp",
auth=OAuth2Bearer(
token_url="https://syndarix.local/oauth/token",
scopes=["mcp:read", "mcp:write"]
)
)
```
**Internal Service Auth (Recommended for v1):**
```python
# For internal deployment, use service tokens
@mcp.tool()
def create_issue(
service_token: str, # Validated internally
project_id: str,
title: str,
body: str
) -> Issue:
validate_service_token(service_token)
# ... implementation
```
### 5. FastAPI Integration Pattern
```python
# app/mcp/client.py
from mcp import ClientSession
from mcp.client.sse import sse_client
from contextlib import asynccontextmanager
class MCPClientManager:
def __init__(self):
self._sessions: dict[str, ClientSession] = {}
async def connect_all(self):
"""Connect to all configured MCP servers."""
for name, config in MCP_SERVERS.items():
if config.enabled:
session = await self._connect_server(config)
self._sessions[name] = session
async def call_tool(
self,
server: str,
tool_name: str,
arguments: dict
) -> Any:
"""Call a tool on a specific MCP server."""
session = self._sessions[server]
result = await session.call_tool(tool_name, arguments)
return result.content
# Usage in FastAPI
mcp_client = MCPClientManager()
@app.on_event("startup")
async def startup():
await mcp_client.connect_all()
@app.post("/api/v1/knowledge/search")
async def search_knowledge(request: SearchRequest):
result = await mcp_client.call_tool(
"knowledge_base",
"search_knowledge",
{
"project_id": request.project_id,
"agent_id": request.agent_id,
"query": request.query
}
)
return result
```
## Recommendations
### Immediate Actions
1. **Use FastMCP 2.0** for all MCP server implementations
2. **Implement unified singleton pattern** with explicit scoping
3. **Use SSE transport** for MCP server connections
4. **Service tokens** for internal auth (v1), OAuth 2.0 for future
### MCP Server Priority
1. **LLM Gateway** - Critical for agent operation
2. **Knowledge Base** - Required for RAG functionality
3. **Git MCP** - Required for code delivery
4. **Issues MCP** - Required for project management
5. **File System** - Required for workspace operations
6. **Code Analysis** - Enhance code quality
7. **CI/CD** - Automate deployments
### Code Organization
```
syndarix/
├── backend/
│ └── app/
│ └── mcp/
│ ├── __init__.py
│ ├── client.py # MCP client manager
│ ├── registry.py # Server configurations
│ └── schemas.py # Tool argument schemas
└── mcp_servers/
├── llm_gateway/
│ ├── __init__.py
│ ├── server.py
│ └── tools.py
├── knowledge_base/
├── git/
├── issues/
├── file_system/
├── code_analysis/
└── cicd/
```
## References
- [FastMCP Documentation](https://gofastmcp.com)
- [MCP Protocol Specification](https://spec.modelcontextprotocol.io)
- [Anthropic MCP SDK](https://github.com/anthropics/anthropic-sdk-mcp)
## Decision
**Adopt FastMCP 2.0** with unified singleton servers and explicit project/agent scoping for all MCP integrations.
---
*Spike completed. Findings will inform ADR-001: MCP Integration Architecture.*

View File

@@ -0,0 +1,338 @@
# SPIKE-003: Real-time Updates Architecture
**Status:** Completed
**Date:** 2025-12-29
**Author:** Architecture Team
**Related Issue:** #3
---
## Objective
Evaluate WebSocket vs Server-Sent Events (SSE) for real-time updates in Syndarix, focusing on agent activity streams, progress updates, and client notifications.
## Research Questions
1. What are the trade-offs between WebSocket and SSE?
2. Which pattern best fits Syndarix's use cases?
3. How do we handle reconnection and reliability?
4. What is the FastAPI implementation approach?
## Findings
### 1. Use Case Analysis
| Use Case | Direction | Frequency | Latency Req |
|----------|-----------|-----------|-------------|
| Agent activity feed | Server → Client | High | Low |
| Sprint progress | Server → Client | Medium | Low |
| Build status | Server → Client | Low | Medium |
| Client approval requests | Server → Client | Low | High |
| Client messages | Client → Server | Low | Medium |
| Issue updates | Server → Client | Medium | Low |
**Key Insight:** 90%+ of real-time communication is **server-to-client** (unidirectional).
### 2. Technology Comparison
| Feature | Server-Sent Events (SSE) | WebSocket |
|---------|-------------------------|-----------|
| Direction | Unidirectional (server → client) | Bidirectional |
| Protocol | HTTP/1.1 or HTTP/2 | Custom (ws://) |
| Reconnection | Built-in automatic | Manual implementation |
| Connection limits | Limited per domain | Similar limits |
| Browser support | Excellent | Excellent |
| Through proxies | Native HTTP | May require config |
| Complexity | Simple | More complex |
| FastAPI support | Native | Native |
### 3. Recommendation: SSE for Primary, WebSocket for Chat
**SSE (Recommended for 90% of use cases):**
- Agent activity streams
- Progress updates
- Build/pipeline status
- Issue change notifications
- Approval request alerts
**WebSocket (For bidirectional needs):**
- Live chat with agents
- Interactive debugging sessions
- Real-time collaboration (future)
### 4. FastAPI SSE Implementation
```python
# app/api/v1/events.py
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
from app.services.events import EventBus
import asyncio
router = APIRouter()
@router.get("/projects/{project_id}/events")
async def project_events(
project_id: str,
request: Request,
current_user: User = Depends(get_current_user)
):
"""Stream real-time events for a project."""
async def event_generator():
event_bus = EventBus()
subscriber = await event_bus.subscribe(
channel=f"project:{project_id}",
user_id=current_user.id
)
try:
while True:
# Check if client disconnected
if await request.is_disconnected():
break
# Wait for next event (with timeout for keepalive)
try:
event = await asyncio.wait_for(
subscriber.get_event(),
timeout=30.0
)
yield f"event: {event.type}\ndata: {event.json()}\n\n"
except asyncio.TimeoutError:
# Send keepalive comment
yield ": keepalive\n\n"
finally:
await event_bus.unsubscribe(subscriber)
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no", # Disable nginx buffering
}
)
```
### 5. Event Bus Architecture with Redis
```python
# app/services/events.py
from dataclasses import dataclass
from typing import AsyncIterator
import redis.asyncio as redis
import json
@dataclass
class Event:
type: str
data: dict
project_id: str
agent_id: str | None = None
timestamp: float = None
class EventBus:
def __init__(self, redis_url: str):
self.redis = redis.from_url(redis_url)
self.pubsub = self.redis.pubsub()
async def publish(self, channel: str, event: Event):
"""Publish an event to a channel."""
await self.redis.publish(
channel,
json.dumps(event.__dict__)
)
async def subscribe(self, channel: str) -> "Subscriber":
"""Subscribe to a channel."""
await self.pubsub.subscribe(channel)
return Subscriber(self.pubsub, channel)
class Subscriber:
def __init__(self, pubsub, channel: str):
self.pubsub = pubsub
self.channel = channel
async def get_event(self) -> Event:
"""Get the next event (blocking)."""
while True:
message = await self.pubsub.get_message(
ignore_subscribe_messages=True,
timeout=1.0
)
if message and message["type"] == "message":
data = json.loads(message["data"])
return Event(**data)
async def unsubscribe(self):
await self.pubsub.unsubscribe(self.channel)
```
### 6. Client-Side Implementation
```typescript
// frontend/lib/events.ts
class EventSource {
private eventSource: EventSource | null = null;
private reconnectDelay = 1000;
private maxReconnectDelay = 30000;
connect(projectId: string, onEvent: (event: ProjectEvent) => void) {
const url = `/api/v1/projects/${projectId}/events`;
this.eventSource = new EventSource(url, {
withCredentials: true
});
this.eventSource.onopen = () => {
console.log('SSE connected');
this.reconnectDelay = 1000; // Reset on success
};
this.eventSource.addEventListener('agent_activity', (e) => {
onEvent({ type: 'agent_activity', data: JSON.parse(e.data) });
});
this.eventSource.addEventListener('issue_update', (e) => {
onEvent({ type: 'issue_update', data: JSON.parse(e.data) });
});
this.eventSource.addEventListener('approval_required', (e) => {
onEvent({ type: 'approval_required', data: JSON.parse(e.data) });
});
this.eventSource.onerror = () => {
this.eventSource?.close();
// Exponential backoff reconnect
setTimeout(() => this.connect(projectId, onEvent), this.reconnectDelay);
this.reconnectDelay = Math.min(
this.reconnectDelay * 2,
this.maxReconnectDelay
);
};
}
disconnect() {
this.eventSource?.close();
this.eventSource = null;
}
}
```
### 7. Event Types
```python
# app/schemas/events.py
from enum import Enum
from pydantic import BaseModel
from datetime import datetime
class EventType(str, Enum):
# Agent Events
AGENT_STARTED = "agent_started"
AGENT_ACTIVITY = "agent_activity"
AGENT_COMPLETED = "agent_completed"
AGENT_ERROR = "agent_error"
# Project Events
ISSUE_CREATED = "issue_created"
ISSUE_UPDATED = "issue_updated"
ISSUE_CLOSED = "issue_closed"
# Git Events
BRANCH_CREATED = "branch_created"
COMMIT_PUSHED = "commit_pushed"
PR_CREATED = "pr_created"
PR_MERGED = "pr_merged"
# Workflow Events
APPROVAL_REQUIRED = "approval_required"
SPRINT_STARTED = "sprint_started"
SPRINT_COMPLETED = "sprint_completed"
# Pipeline Events
PIPELINE_STARTED = "pipeline_started"
PIPELINE_COMPLETED = "pipeline_completed"
PIPELINE_FAILED = "pipeline_failed"
class ProjectEvent(BaseModel):
id: str
type: EventType
project_id: str
agent_id: str | None
data: dict
timestamp: datetime
```
### 8. WebSocket for Chat (Secondary)
```python
# app/api/v1/chat.py
from fastapi import WebSocket, WebSocketDisconnect
from app.services.agent_chat import AgentChatService
@router.websocket("/projects/{project_id}/agents/{agent_id}/chat")
async def agent_chat(
websocket: WebSocket,
project_id: str,
agent_id: str
):
"""Bidirectional chat with an agent."""
await websocket.accept()
chat_service = AgentChatService(project_id, agent_id)
try:
while True:
# Receive message from client
message = await websocket.receive_json()
# Stream response from agent
async for chunk in chat_service.get_response(message):
await websocket.send_json({
"type": "chunk",
"content": chunk
})
await websocket.send_json({"type": "done"})
except WebSocketDisconnect:
pass
```
## Performance Considerations
### Connection Limits
- Browser limit: ~6 connections per domain (HTTP/1.1)
- Recommendation: Use single SSE connection per project, multiplex events
### Scalability
- Redis Pub/Sub handles cross-instance event distribution
- Consider Redis Streams for message persistence (audit/replay)
### Keepalive
- Send comment every 30 seconds to prevent timeout
- Client reconnects automatically on disconnect
## Recommendations
1. **Use SSE for all server-to-client events** (simpler, auto-reconnect)
2. **Use WebSocket only for interactive chat** with agents
3. **Redis Pub/Sub for event distribution** across instances
4. **Single SSE connection per project** with event multiplexing
5. **Exponential backoff** for client reconnection
## References
- [FastAPI SSE](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
- [MDN EventSource](https://developer.mozilla.org/en-US/docs/Web/API/EventSource)
- [Redis Pub/Sub](https://redis.io/topics/pubsub)
## Decision
**Adopt SSE as the primary real-time transport** with WebSocket reserved for bidirectional chat. Use Redis Pub/Sub for event distribution.
---
*Spike completed. Findings will inform ADR-002: Real-time Communication Architecture.*

View File

@@ -0,0 +1,420 @@
# SPIKE-004: Celery + Redis Integration
**Status:** Completed
**Date:** 2025-12-29
**Author:** Architecture Team
**Related Issue:** #4
---
## Objective
Research best practices for integrating Celery with FastAPI for background task processing, focusing on agent orchestration, long-running workflows, and task monitoring.
## Research Questions
1. How to properly integrate Celery with async FastAPI?
2. What is the optimal task queue architecture for Syndarix?
3. How to handle long-running agent tasks?
4. What monitoring and visibility patterns should we use?
## Findings
### 1. Celery + FastAPI Integration Pattern
**Challenge:** Celery is synchronous, FastAPI is async.
**Solution:** Use `celery.result.AsyncResult` with async polling or callbacks.
```python
# app/core/celery.py
from celery import Celery
from app.core.config import settings
celery_app = Celery(
"syndarix",
broker=settings.REDIS_URL,
backend=settings.REDIS_URL,
include=[
"app.tasks.agent_tasks",
"app.tasks.git_tasks",
"app.tasks.sync_tasks",
]
)
celery_app.conf.update(
task_serializer="json",
accept_content=["json"],
result_serializer="json",
timezone="UTC",
enable_utc=True,
task_track_started=True,
task_time_limit=3600, # 1 hour max
task_soft_time_limit=3300, # 55 min soft limit
worker_prefetch_multiplier=1, # One task at a time for LLM tasks
task_acks_late=True, # Acknowledge after completion
task_reject_on_worker_lost=True, # Retry if worker dies
)
```
### 2. Task Queue Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Backend │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ API Layer │ │ Services │ │ Events │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────┐ │
│ │ Task Dispatcher │ │
│ │ (Celery send_task) │ │
│ └────────────────┬───────────────┘ │
└──────────────────────────┼──────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Redis (Broker + Backend) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ agent_queue │ │ git_queue │ │ sync_queue │ │
│ │ (priority) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Worker │ │ Worker │ │ Worker │
│ (agents) │ │ (git) │ │ (sync) │
│ prefetch=1 │ │ prefetch=4 │ │ prefetch=4 │
└────────────┘ └────────────┘ └────────────┘
```
### 3. Queue Configuration
```python
# app/core/celery.py
celery_app.conf.task_queues = [
Queue("agent_queue", routing_key="agent.#"),
Queue("git_queue", routing_key="git.#"),
Queue("sync_queue", routing_key="sync.#"),
Queue("cicd_queue", routing_key="cicd.#"),
]
celery_app.conf.task_routes = {
"app.tasks.agent_tasks.*": {"queue": "agent_queue"},
"app.tasks.git_tasks.*": {"queue": "git_queue"},
"app.tasks.sync_tasks.*": {"queue": "sync_queue"},
"app.tasks.cicd_tasks.*": {"queue": "cicd_queue"},
}
```
### 4. Agent Task Implementation
```python
# app/tasks/agent_tasks.py
from celery import Task
from app.core.celery import celery_app
from app.services.agent_runner import AgentRunner
from app.services.events import EventBus
class AgentTask(Task):
"""Base class for agent tasks with retry and monitoring."""
autoretry_for = (ConnectionError, TimeoutError)
retry_backoff = True
retry_backoff_max = 600
retry_jitter = True
max_retries = 3
def on_failure(self, exc, task_id, args, kwargs, einfo):
"""Handle task failure."""
project_id = kwargs.get("project_id")
agent_id = kwargs.get("agent_id")
EventBus().publish(f"project:{project_id}", {
"type": "agent_error",
"agent_id": agent_id,
"error": str(exc)
})
@celery_app.task(bind=True, base=AgentTask)
def run_agent_action(
self,
agent_id: str,
project_id: str,
action: str,
context: dict
) -> dict:
"""
Execute an agent action as a background task.
Args:
agent_id: The agent instance ID
project_id: The project context
action: The action to perform
context: Action-specific context
Returns:
Action result dictionary
"""
runner = AgentRunner(agent_id, project_id)
# Update task state for monitoring
self.update_state(
state="RUNNING",
meta={"agent_id": agent_id, "action": action}
)
# Publish start event
EventBus().publish(f"project:{project_id}", {
"type": "agent_started",
"agent_id": agent_id,
"action": action,
"task_id": self.request.id
})
try:
result = runner.execute(action, context)
# Publish completion event
EventBus().publish(f"project:{project_id}", {
"type": "agent_completed",
"agent_id": agent_id,
"action": action,
"result_summary": result.get("summary")
})
return result
except Exception as e:
# Will trigger on_failure
raise
```
### 5. Long-Running Task Patterns
**Progress Reporting:**
```python
@celery_app.task(bind=True)
def implement_story(self, story_id: str, agent_id: str, project_id: str):
"""Implement a user story with progress reporting."""
steps = [
("analyzing", "Analyzing requirements"),
("designing", "Designing solution"),
("implementing", "Writing code"),
("testing", "Running tests"),
("documenting", "Updating documentation"),
]
for i, (state, description) in enumerate(steps):
self.update_state(
state="PROGRESS",
meta={
"current": i + 1,
"total": len(steps),
"status": description
}
)
# Do the actual work
execute_step(state, story_id, agent_id)
# Publish progress event
EventBus().publish(f"project:{project_id}", {
"type": "agent_progress",
"agent_id": agent_id,
"step": i + 1,
"total": len(steps),
"description": description
})
return {"status": "completed", "story_id": story_id}
```
**Task Chaining:**
```python
from celery import chain, group
# Sequential workflow
workflow = chain(
analyze_requirements.s(story_id),
design_solution.s(),
implement_code.s(),
run_tests.s(),
create_pr.s()
)
# Parallel execution
parallel_tests = group(
run_unit_tests.s(project_id),
run_integration_tests.s(project_id),
run_linting.s(project_id)
)
```
### 6. FastAPI Integration
```python
# app/api/v1/agents.py
from fastapi import APIRouter, BackgroundTasks
from app.tasks.agent_tasks import run_agent_action
from celery.result import AsyncResult
router = APIRouter()
@router.post("/agents/{agent_id}/actions")
async def trigger_agent_action(
agent_id: str,
action: AgentActionRequest,
background_tasks: BackgroundTasks
):
"""Trigger an agent action as a background task."""
# Dispatch to Celery
task = run_agent_action.delay(
agent_id=agent_id,
project_id=action.project_id,
action=action.action,
context=action.context
)
return {
"task_id": task.id,
"status": "queued"
}
@router.get("/tasks/{task_id}")
async def get_task_status(task_id: str):
"""Get the status of a background task."""
result = AsyncResult(task_id)
if result.state == "PENDING":
return {"status": "pending"}
elif result.state == "RUNNING":
return {"status": "running", **result.info}
elif result.state == "PROGRESS":
return {"status": "progress", **result.info}
elif result.state == "SUCCESS":
return {"status": "completed", "result": result.result}
elif result.state == "FAILURE":
return {"status": "failed", "error": str(result.result)}
return {"status": result.state}
```
### 7. Worker Configuration
```bash
# Run different workers for different queues
# Agent worker (single task at a time for LLM rate limiting)
celery -A app.core.celery worker \
-Q agent_queue \
-c 4 \
--prefetch-multiplier=1 \
-n agent_worker@%h
# Git worker (can handle multiple concurrent tasks)
celery -A app.core.celery worker \
-Q git_queue \
-c 8 \
--prefetch-multiplier=4 \
-n git_worker@%h
# Sync worker
celery -A app.core.celery worker \
-Q sync_queue \
-c 4 \
--prefetch-multiplier=4 \
-n sync_worker@%h
```
### 8. Monitoring with Flower
```python
# docker-compose.yml
services:
flower:
image: mher/flower:latest
command: celery flower --broker=redis://redis:6379/0
ports:
- "5555:5555"
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- FLOWER_BASIC_AUTH=admin:password
```
### 9. Task Scheduling (Celery Beat)
```python
# app/core/celery.py
from celery.schedules import crontab
celery_app.conf.beat_schedule = {
# Sync issues every minute
"sync-external-issues": {
"task": "app.tasks.sync_tasks.sync_all_issues",
"schedule": 60.0,
},
# Health check every 5 minutes
"agent-health-check": {
"task": "app.tasks.agent_tasks.health_check_all_agents",
"schedule": 300.0,
},
# Daily cleanup at midnight
"cleanup-old-tasks": {
"task": "app.tasks.maintenance.cleanup_old_tasks",
"schedule": crontab(hour=0, minute=0),
},
}
```
## Best Practices
1. **One task per LLM call** - Avoid rate limiting issues
2. **Progress reporting** - Update state for long-running tasks
3. **Idempotent tasks** - Handle retries gracefully
4. **Separate queues** - Isolate slow tasks from fast ones
5. **Task result expiry** - Set `result_expires` to avoid Redis bloat
6. **Soft time limits** - Allow graceful shutdown before hard kill
## Recommendations
1. **Use Celery for all long-running operations**
- Agent actions
- Git operations
- External sync
- CI/CD triggers
2. **Use Redis as both broker and backend**
- Simplifies infrastructure
- Fast enough for our scale
3. **Configure separate queues**
- `agent_queue` with prefetch=1
- `git_queue` with prefetch=4
- `sync_queue` with prefetch=4
4. **Implement proper monitoring**
- Flower for web UI
- Prometheus metrics export
- Dead letter queue for failed tasks
## References
- [Celery Documentation](https://docs.celeryq.dev/)
- [FastAPI Background Tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/)
- [Celery Best Practices](https://docs.celeryq.dev/en/stable/userguide/tasks.html#tips-and-best-practices)
## Decision
**Adopt Celery + Redis** for all background task processing with queue-based routing and progress reporting via Redis Pub/Sub events.
---
*Spike completed. Findings will inform ADR-003: Background Task Architecture.*

View File

@@ -0,0 +1,516 @@
# SPIKE-005: LLM Provider Abstraction
**Status:** Completed
**Date:** 2025-12-29
**Author:** Architecture Team
**Related Issue:** #5
---
## Objective
Research the best approach for unified LLM provider abstraction with support for multiple providers, automatic failover, and cost tracking.
## Research Questions
1. What libraries exist for unified LLM access?
2. How to implement automatic failover between providers?
3. How to track token usage and costs per agent/project?
4. What caching strategies can reduce API costs?
## Findings
### 1. LiteLLM - Recommended Solution
**LiteLLM** provides a unified interface to 100+ LLM providers using the OpenAI SDK format.
**Key Features:**
- Unified API across providers (Anthropic, OpenAI, local, etc.)
- Built-in failover and load balancing
- Token counting and cost tracking
- Streaming support
- Async support
- Caching with Redis
**Installation:**
```bash
pip install litellm
```
### 2. Basic Usage
```python
from litellm import completion, acompletion
import litellm
# Configure providers
litellm.api_key = os.getenv("ANTHROPIC_API_KEY")
litellm.set_verbose = True # For debugging
# Synchronous call
response = completion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello!"}]
)
# Async call (for FastAPI)
response = await acompletion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello!"}]
)
```
### 3. Model Naming Convention
LiteLLM uses prefixed model names:
| Provider | Model Format |
|----------|--------------|
| Anthropic | `claude-3-5-sonnet-20241022` |
| OpenAI | `gpt-4-turbo` |
| Azure OpenAI | `azure/deployment-name` |
| Ollama | `ollama/llama3` |
| Together AI | `together_ai/togethercomputer/llama-2-70b` |
### 4. Failover Configuration
```python
from litellm import Router
# Define model list with fallbacks
model_list = [
{
"model_name": "primary-agent",
"litellm_params": {
"model": "claude-3-5-sonnet-20241022",
"api_key": os.getenv("ANTHROPIC_API_KEY"),
},
"model_info": {"id": 1}
},
{
"model_name": "primary-agent", # Same name = fallback
"litellm_params": {
"model": "gpt-4-turbo",
"api_key": os.getenv("OPENAI_API_KEY"),
},
"model_info": {"id": 2}
},
{
"model_name": "primary-agent",
"litellm_params": {
"model": "ollama/llama3",
"api_base": "http://localhost:11434",
},
"model_info": {"id": 3}
}
]
# Initialize router with failover
router = Router(
model_list=model_list,
fallbacks=[
{"primary-agent": ["primary-agent"]} # Try all models with same name
],
routing_strategy="simple-shuffle", # or "latency-based-routing"
num_retries=3,
retry_after=5, # seconds
timeout=60,
)
# Use router
response = await router.acompletion(
model="primary-agent",
messages=[{"role": "user", "content": "Hello!"}]
)
```
### 5. Syndarix LLM Gateway Architecture
```python
# app/services/llm_gateway.py
from litellm import Router, acompletion
from app.core.config import settings
from app.models.agent import AgentType
from app.services.cost_tracker import CostTracker
from app.services.events import EventBus
class LLMGateway:
"""Unified LLM gateway with failover and cost tracking."""
def __init__(self):
self.router = self._build_router()
self.cost_tracker = CostTracker()
self.event_bus = EventBus()
def _build_router(self) -> Router:
"""Build LiteLLM router from configuration."""
model_list = []
# Add Anthropic models
if settings.ANTHROPIC_API_KEY:
model_list.extend([
{
"model_name": "high-reasoning",
"litellm_params": {
"model": "claude-3-5-sonnet-20241022",
"api_key": settings.ANTHROPIC_API_KEY,
}
},
{
"model_name": "fast-response",
"litellm_params": {
"model": "claude-3-haiku-20240307",
"api_key": settings.ANTHROPIC_API_KEY,
}
}
])
# Add OpenAI fallbacks
if settings.OPENAI_API_KEY:
model_list.extend([
{
"model_name": "high-reasoning",
"litellm_params": {
"model": "gpt-4-turbo",
"api_key": settings.OPENAI_API_KEY,
}
},
{
"model_name": "fast-response",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": settings.OPENAI_API_KEY,
}
}
])
# Add local models (Ollama)
if settings.OLLAMA_URL:
model_list.append({
"model_name": "local-fallback",
"litellm_params": {
"model": "ollama/llama3",
"api_base": settings.OLLAMA_URL,
}
})
return Router(
model_list=model_list,
fallbacks=[
{"high-reasoning": ["high-reasoning", "local-fallback"]},
{"fast-response": ["fast-response", "local-fallback"]},
],
routing_strategy="latency-based-routing",
num_retries=3,
timeout=120,
)
async def complete(
self,
agent_id: str,
project_id: str,
messages: list[dict],
model_preference: str = "high-reasoning",
stream: bool = False,
**kwargs
) -> dict:
"""
Generate a completion with automatic failover and cost tracking.
Args:
agent_id: The calling agent's ID
project_id: The project context
messages: Chat messages
model_preference: "high-reasoning" or "fast-response"
stream: Whether to stream the response
**kwargs: Additional LiteLLM parameters
Returns:
Completion response dictionary
"""
try:
if stream:
return self._stream_completion(
agent_id, project_id, messages, model_preference, **kwargs
)
response = await self.router.acompletion(
model=model_preference,
messages=messages,
**kwargs
)
# Track usage
await self._track_usage(
agent_id=agent_id,
project_id=project_id,
model=response.model,
usage=response.usage,
)
return {
"content": response.choices[0].message.content,
"model": response.model,
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens,
}
}
except Exception as e:
# Publish error event
await self.event_bus.publish(f"project:{project_id}", {
"type": "llm_error",
"agent_id": agent_id,
"error": str(e)
})
raise
async def _stream_completion(
self,
agent_id: str,
project_id: str,
messages: list[dict],
model_preference: str,
**kwargs
):
"""Stream a completion response."""
response = await self.router.acompletion(
model=model_preference,
messages=messages,
stream=True,
**kwargs
)
async for chunk in response:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
async def _track_usage(
self,
agent_id: str,
project_id: str,
model: str,
usage: dict
):
"""Track token usage and costs."""
await self.cost_tracker.record_usage(
agent_id=agent_id,
project_id=project_id,
model=model,
prompt_tokens=usage.prompt_tokens,
completion_tokens=usage.completion_tokens,
)
```
### 6. Cost Tracking
```python
# app/services/cost_tracker.py
from sqlalchemy.ext.asyncio import AsyncSession
from app.models.usage import TokenUsage
from datetime import datetime
# Cost per 1M tokens (approximate)
MODEL_COSTS = {
"claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
"claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
"gpt-4-turbo": {"input": 10.00, "output": 30.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"ollama/llama3": {"input": 0.00, "output": 0.00}, # Local
}
class CostTracker:
def __init__(self, db: AsyncSession):
self.db = db
async def record_usage(
self,
agent_id: str,
project_id: str,
model: str,
prompt_tokens: int,
completion_tokens: int,
):
"""Record token usage and calculate cost."""
costs = MODEL_COSTS.get(model, {"input": 0, "output": 0})
input_cost = (prompt_tokens / 1_000_000) * costs["input"]
output_cost = (completion_tokens / 1_000_000) * costs["output"]
total_cost = input_cost + output_cost
usage = TokenUsage(
agent_id=agent_id,
project_id=project_id,
model=model,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
cost_usd=total_cost,
timestamp=datetime.utcnow(),
)
self.db.add(usage)
await self.db.commit()
async def get_project_usage(
self,
project_id: str,
start_date: datetime = None,
end_date: datetime = None,
) -> dict:
"""Get usage summary for a project."""
# Query aggregated usage
...
async def check_budget(
self,
project_id: str,
budget_limit: float,
) -> bool:
"""Check if project is within budget."""
usage = await self.get_project_usage(project_id)
return usage["total_cost_usd"] < budget_limit
```
### 7. Caching with Redis
```python
import litellm
from litellm import Cache
# Configure Redis cache
litellm.cache = Cache(
type="redis",
host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
password=settings.REDIS_PASSWORD,
)
# Enable caching
litellm.enable_cache()
# Cached completions (same input = cached response)
response = await litellm.acompletion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "What is 2+2?"}],
cache={"ttl": 3600} # Cache for 1 hour
)
```
### 8. Agent Type Model Mapping
```python
# app/models/agent_type.py
from sqlalchemy import Column, String, Enum as SQLEnum
from app.db.base import Base
class ModelPreference(str, Enum):
HIGH_REASONING = "high-reasoning"
FAST_RESPONSE = "fast-response"
COST_OPTIMIZED = "cost-optimized"
class AgentType(Base):
__tablename__ = "agent_types"
id = Column(UUID, primary_key=True)
name = Column(String(50), unique=True)
role = Column(String(50))
# LLM configuration
model_preference = Column(
SQLEnum(ModelPreference),
default=ModelPreference.HIGH_REASONING
)
max_tokens = Column(Integer, default=4096)
temperature = Column(Float, default=0.7)
# System prompt
system_prompt = Column(Text)
# Mapping agent types to models
AGENT_MODEL_MAPPING = {
"Product Owner": ModelPreference.HIGH_REASONING,
"Project Manager": ModelPreference.FAST_RESPONSE,
"Business Analyst": ModelPreference.HIGH_REASONING,
"Software Architect": ModelPreference.HIGH_REASONING,
"Software Engineer": ModelPreference.HIGH_REASONING,
"UI/UX Designer": ModelPreference.HIGH_REASONING,
"QA Engineer": ModelPreference.FAST_RESPONSE,
"DevOps Engineer": ModelPreference.FAST_RESPONSE,
"AI/ML Engineer": ModelPreference.HIGH_REASONING,
"Security Expert": ModelPreference.HIGH_REASONING,
}
```
## Rate Limiting Strategy
```python
from litellm import Router
import asyncio
# Configure rate limits per model
router = Router(
model_list=model_list,
redis_host=settings.REDIS_HOST,
redis_port=settings.REDIS_PORT,
routing_strategy="usage-based-routing", # Route based on rate limits
)
# Custom rate limiter
class RateLimiter:
def __init__(self, requests_per_minute: int = 60):
self.rpm = requests_per_minute
self.semaphore = asyncio.Semaphore(requests_per_minute)
async def acquire(self):
await self.semaphore.acquire()
# Release after 60 seconds
asyncio.create_task(self._release_after(60))
async def _release_after(self, seconds: int):
await asyncio.sleep(seconds)
self.semaphore.release()
```
## Recommendations
1. **Use LiteLLM as the unified abstraction layer**
- Simplifies multi-provider support
- Built-in failover and retry
- Consistent API across providers
2. **Configure model groups by use case**
- `high-reasoning`: Complex analysis, architecture decisions
- `fast-response`: Quick tasks, simple queries
- `cost-optimized`: Non-critical, high-volume tasks
3. **Implement automatic failover chain**
- Primary: Claude 3.5 Sonnet
- Fallback 1: GPT-4 Turbo
- Fallback 2: Local Llama 3 (if available)
4. **Track all usage and costs**
- Per agent, per project
- Set budget alerts
- Generate usage reports
5. **Cache frequently repeated queries**
- Use Redis-backed cache
- Cache embeddings for RAG
- Cache deterministic transformations
## References
- [LiteLLM Documentation](https://docs.litellm.ai/)
- [LiteLLM Router](https://docs.litellm.ai/docs/routing)
- [Anthropic Rate Limits](https://docs.anthropic.com/en/api/rate-limits)
## Decision
**Adopt LiteLLM** as the unified LLM abstraction layer with automatic failover, usage-based routing, and Redis-backed caching.
---
*Spike completed. Findings will inform ADR-004: LLM Provider Integration Architecture.*