From ce4d0c7b0dff3b487290635bc0a9510c65b66ef1 Mon Sep 17 00:00:00 2001 From: Felipe Cardoso Date: Sun, 1 Mar 2026 16:30:06 +0100 Subject: [PATCH] feat(backend): enhance performance benchmarking with baseline detection and documentation - Updated `make benchmark-check` in Makefile to detect and handle missing baselines, creating them if not found. - Added `.benchmarks` directory to `.gitignore` for local baseline exclusions. - Linked benchmarking documentation in `ARCHITECTURE.md` and added comprehensive `BENCHMARKS.md` guide. --- .gitignore | 2 +- backend/Makefile | 11 +- backend/docs/ARCHITECTURE.md | 2 + backend/docs/BENCHMARKS.md | 302 +++++++++++++++++++++++++++++++++++ 4 files changed, 314 insertions(+), 3 deletions(-) create mode 100644 backend/docs/BENCHMARKS.md diff --git a/.gitignore b/.gitignore index e0dad4a..2ab89b4 100755 --- a/.gitignore +++ b/.gitignore @@ -187,7 +187,7 @@ coverage.xml .hypothesis/ .pytest_cache/ cover/ - +backend/.benchmarks # Translations *.mo *.pot diff --git a/backend/Makefile b/backend/Makefile index f42d010..7fb35c5 100644 --- a/backend/Makefile +++ b/backend/Makefile @@ -186,8 +186,15 @@ benchmark-save: benchmark-check: @echo "⏱️ Running benchmarks and comparing against baseline..." - @IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-compare=0001_baseline --benchmark-sort=mean --benchmark-compare-fail=mean:200% -p no:xdist --override-ini='addopts=' - @echo "✅ No performance regressions detected!" + @if find .benchmarks -name '*_baseline*' -print -quit 2>/dev/null | grep -q .; then \ + IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-compare=0001_baseline --benchmark-sort=mean --benchmark-compare-fail=mean:200% -p no:xdist --override-ini='addopts='; \ + echo "✅ No performance regressions detected!"; \ + else \ + echo "⚠️ No benchmark baseline found. Run 'make benchmark-save' first to create one."; \ + echo " Running benchmarks without comparison..."; \ + IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-save=baseline --benchmark-sort=mean -p no:xdist --override-ini='addopts='; \ + echo "✅ Benchmark baseline created. Future runs of 'make benchmark-check' will compare against it."; \ + fi test-all: @echo "🧪 Running ALL tests (unit + E2E)..." diff --git a/backend/docs/ARCHITECTURE.md b/backend/docs/ARCHITECTURE.md index c4fe7d7..b9b0259 100644 --- a/backend/docs/ARCHITECTURE.md +++ b/backend/docs/ARCHITECTURE.md @@ -1169,6 +1169,8 @@ app.add_middleware( ## Performance Considerations +> 📖 For the full benchmarking guide (how to run, read results, write new benchmarks, and manage baselines), see **[BENCHMARKS.md](BENCHMARKS.md)**. + ### Database Connection Pooling - Pool size: 20 connections diff --git a/backend/docs/BENCHMARKS.md b/backend/docs/BENCHMARKS.md new file mode 100644 index 0000000..c418dcc --- /dev/null +++ b/backend/docs/BENCHMARKS.md @@ -0,0 +1,302 @@ +# Performance Benchmarks Guide + +Automated performance benchmarking infrastructure using **pytest-benchmark** to detect latency regressions in critical API endpoints. + +## Table of Contents + +- [Why Benchmark?](#why-benchmark) +- [Quick Start](#quick-start) +- [How It Works](#how-it-works) +- [Understanding Results](#understanding-results) +- [Test Organization](#test-organization) +- [Writing Benchmark Tests](#writing-benchmark-tests) +- [Baseline Management](#baseline-management) +- [CI/CD Integration](#cicd-integration) +- [Troubleshooting](#troubleshooting) + +--- + +## Why Benchmark? + +Performance regressions are silent bugs — they don't break tests or cause errors, but they degrade the user experience over time. Common causes include: + +- **Unintended N+1 queries** after adding a relationship +- **Heavier serialization** after adding new fields to a response model +- **Middleware overhead** from new security headers or logging +- **Dependency upgrades** that introduce slower code paths + +Without automated benchmarks, these regressions go unnoticed until users complain. Performance benchmarks serve as an **early warning system** — they measure endpoint latency on every run and flag significant deviations from an established baseline. + +### What benchmarks give you + +| Benefit | Description | +|---------|-------------| +| **Regression detection** | Automatically flags when an endpoint becomes significantly slower | +| **Baseline tracking** | Stores known-good performance numbers for comparison | +| **Confidence in refactors** | Verify that code changes don't degrade response times | +| **Visibility** | Makes performance a first-class, measurable quality attribute | + +--- + +## Quick Start + +```bash +# Run benchmarks (no comparison, just see current numbers) +make benchmark + +# Save current results as the baseline +make benchmark-save + +# Run benchmarks and compare against the saved baseline +make benchmark-check +``` + +--- + +## How It Works + +The benchmarking system has three layers: + +### 1. pytest-benchmark integration + +[pytest-benchmark](https://pytest-benchmark.readthedocs.io/) is a pytest plugin that provides a `benchmark` fixture. It handles: + +- **Calibration**: Automatically determines how many iterations to run for statistical significance +- **Timing**: Uses `time.perf_counter` for high-resolution measurements +- **Statistics**: Computes min, max, mean, median, standard deviation, IQR, and outlier detection +- **Comparison**: Compares current results against saved baselines and flags regressions + +### 2. Benchmark types + +The test suite includes two categories of performance tests: + +| Type | How it works | Examples | +|------|-------------|----------| +| **pytest-benchmark tests** | Uses the `benchmark` fixture for precise, multi-round timing | `test_health_endpoint_performance`, `test_openapi_schema_performance` | +| **Manual latency tests** | Uses `time.perf_counter` with explicit thresholds (for async endpoints that pytest-benchmark doesn't support natively) | `test_login_latency`, `test_get_current_user_latency` | + +### 3. Regression detection + +When running `make benchmark-check`, the system: + +1. Runs all benchmark tests +2. Compares results against the saved baseline (`.benchmarks/` directory) +3. **Fails the build** if any test's mean time exceeds **200%** of the baseline (i.e., 3× slower) + +The `200%` threshold in `--benchmark-compare-fail=mean:200%` means "fail if the mean increased by more than 200% relative to the baseline." This is deliberately generous to avoid false positives from normal run-to-run variance while still catching real regressions. + +--- + +## Understanding Results + +A typical benchmark output looks like this: + +``` +--------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------- +Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +test_health_endpoint_performance 0.9841 (1.0) 1.5513 (1.0) 1.1390 (1.0) 0.1098 (1.0) 1.1151 (1.0) 0.1672 (1.0) 39;2 877.9666 (1.0) 133 1 +test_openapi_schema_performance 1.6523 (1.68) 2.0892 (1.35) 1.7843 (1.57) 0.1553 (1.41) 1.7200 (1.54) 0.1727 (1.03) 2;0 560.4471 (0.64) 10 1 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` + +### Column reference + +| Column | Meaning | +|--------|---------| +| **Min** | Fastest single execution | +| **Max** | Slowest single execution | +| **Mean** | Average across all rounds — the primary metric for regression detection | +| **StdDev** | How much results vary between rounds (lower = more stable) | +| **Median** | Middle value, less sensitive to outliers than mean | +| **IQR** | Interquartile range — spread of the middle 50% of results | +| **Outliers** | Format `A;B` — A = within 1 StdDev, B = within 1.5 IQR from quartiles | +| **OPS** | Operations per second (`1 / Mean`) | +| **Rounds** | How many times the test was executed (auto-calibrated) | +| **Iterations** | Iterations per round (usually 1 for ms-scale tests) | + +### The ratio numbers `(1.0)`, `(1.68)`, etc. + +These show how each test compares **to the best result in that column**. The fastest test is always `(1.0)`, and others show their relative factor. For example, `(1.68)` means "1.68× slower than the fastest." + +### Color coding + +- **Green**: The fastest (best) value in each column +- **Red**: The slowest (worst) value in each column + +This is a **relative ranking within the current run** — red does NOT mean the test failed or that performance is bad. It simply highlights which endpoint is the slower one in the group. + +### What's "normal"? + +For this project's current endpoints: + +| Endpoint | Expected range | Why | +|----------|---------------|-----| +| `GET /health` | ~1–1.5ms | Minimal logic, mocked DB check | +| `GET /api/v1/openapi.json` | ~1.5–2.5ms | Serializes entire API schema | +| `POST /api/v1/auth/login` | < 500ms threshold | Includes bcrypt password hashing | +| `GET /api/v1/users/me` | < 200ms threshold | DB lookup + token validation | + +--- + +## Test Organization + +``` +backend/tests/ +├── benchmarks/ +│ └── test_endpoint_performance.py # All performance benchmark tests +│ +backend/.benchmarks/ # Saved baselines (auto-generated) +└── Linux-CPython-3.12-64bit/ + └── 0001_baseline.json # Platform-specific baseline file +``` + +### Test markers + +All benchmark tests use the `@pytest.mark.benchmark` marker. The `--benchmark-only` flag ensures that only tests using the `benchmark` fixture are executed during benchmark runs, while manual latency tests (async) are skipped. + +--- + +## Writing Benchmark Tests + +### Stateless endpoint (using pytest-benchmark fixture) + +```python +import pytest +from fastapi.testclient import TestClient + +def test_my_endpoint_performance(sync_client, benchmark): + """Benchmark: GET /my-endpoint should respond within acceptable latency.""" + result = benchmark(sync_client.get, "/my-endpoint") + assert result.status_code == 200 +``` + +The `benchmark` fixture handles all timing, calibration, and statistics automatically. Just pass it the callable and arguments. + +### Async / DB-dependent endpoint (manual timing) + +For async endpoints that require database access, use manual timing with an explicit threshold: + +```python +import time +import pytest + +MAX_RESPONSE_MS = 300 + +@pytest.mark.asyncio +async def test_my_async_endpoint_latency(client, setup_fixture): + """Performance: endpoint must respond under threshold.""" + iterations = 5 + total_ms = 0.0 + + for _ in range(iterations): + start = time.perf_counter() + response = await client.get("/api/v1/my-endpoint") + elapsed_ms = (time.perf_counter() - start) * 1000 + total_ms += elapsed_ms + assert response.status_code == 200 + + mean_ms = total_ms / iterations + assert mean_ms < MAX_RESPONSE_MS, ( + f"Latency regression: {mean_ms:.1f}ms exceeds {MAX_RESPONSE_MS}ms threshold" + ) +``` + +### Guidelines for new benchmarks + +1. **Benchmark critical paths** — endpoints users hit frequently or where latency matters most +2. **Mock external dependencies** for stateless tests to isolate endpoint overhead +3. **Set generous thresholds** for manual tests — account for CI variability +4. **Keep benchmarks fast** — they run on every check, so avoid heavy setup + +--- + +## Baseline Management + +### Saving a baseline + +```bash +make benchmark-save +``` + +This runs all benchmarks and saves results to `.benchmarks//0001_baseline.json`. The baseline captures: +- Mean, min, max, median, stddev for each test +- Machine info (CPU, OS, Python version) +- Timestamp + +### Comparing against baseline + +```bash +make benchmark-check +``` + +If no baseline exists, this command automatically creates one and prints a warning. On subsequent runs, it compares current results against the saved baseline. + +### When to update the baseline + +- **After intentional performance changes** (e.g., you optimized an endpoint — save the new, faster baseline) +- **After infrastructure changes** (e.g., new CI runner, different hardware) +- **After adding new benchmark tests** (the new tests need a baseline entry) + +```bash +# Update the baseline after intentional changes +make benchmark-save +``` + +### Version control + +The `.benchmarks/` directory can be committed to version control so that CI pipelines can compare against a known-good baseline. However, since benchmark results are machine-specific, you may prefer to generate baselines in CI rather than committing local results. + +--- + +## CI/CD Integration + +Add benchmark checking to your CI pipeline to catch regressions on every PR: + +```yaml +# Example GitHub Actions step +- name: Performance regression check + run: | + cd backend + make benchmark-save # Create baseline from main branch + # ... apply PR changes ... + make benchmark-check # Compare PR against baseline +``` + +A more robust approach: +1. Save the baseline on the `main` branch after each merge +2. On PR branches, run `make benchmark-check` against the `main` baseline +3. The pipeline fails if any endpoint regresses beyond the 200% threshold + +--- + +## Troubleshooting + +### "No benchmark baseline found" warning + +``` +⚠️ No benchmark baseline found. Run 'make benchmark-save' first to create one. +``` + +This means no baseline file exists yet. The command will auto-create one. Future runs of `make benchmark-check` will compare against it. + +### Machine info mismatch warning + +``` +WARNING: benchmark machine_info is different +``` + +This is expected when comparing baselines generated on a different machine or OS. The comparison still works, but absolute numbers may differ. Re-save the baseline on the current machine if needed. + +### High variance (large StdDev) + +If StdDev is high relative to the Mean, results may be unreliable. Common causes: +- System under load during benchmark run +- Garbage collection interference +- Thermal throttling + +Try running benchmarks on an idle system or increasing `min_rounds` in `pyproject.toml`. + +### Only 2 of 4 tests run + +The async tests (`test_login_latency`, `test_get_current_user_latency`) are skipped during `--benchmark-only` runs because they don't use the `benchmark` fixture. They run as part of the normal test suite (`make test`) with manual threshold assertions.