Compare commits

...

2 Commits

Author SHA1 Message Date
Felipe Cardoso
0760a8284d feat(tests): add comprehensive benchmarks for auth and performance-critical endpoints
- Introduced benchmarks for password hashing, verification, and JWT token operations.
- Added latency tests for `/register`, `/refresh`, `/sessions`, and `/users/me` endpoints.
- Updated `BENCHMARKS.md` with new tests, thresholds, and execution details.
2026-03-01 17:01:44 +01:00
Felipe Cardoso
ce4d0c7b0d feat(backend): enhance performance benchmarking with baseline detection and documentation
- Updated `make benchmark-check` in Makefile to detect and handle missing baselines, creating them if not found.
- Added `.benchmarks` directory to `.gitignore` for local baseline exclusions.
- Linked benchmarking documentation in `ARCHITECTURE.md` and added comprehensive `BENCHMARKS.md` guide.
2026-03-01 16:30:06 +01:00
5 changed files with 502 additions and 5 deletions

2
.gitignore vendored
View File

@@ -187,7 +187,7 @@ coverage.xml
.hypothesis/
.pytest_cache/
cover/
backend/.benchmarks
# Translations
*.mo
*.pot

View File

@@ -186,8 +186,15 @@ benchmark-save:
benchmark-check:
@echo "⏱️ Running benchmarks and comparing against baseline..."
@IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-compare=0001_baseline --benchmark-sort=mean --benchmark-compare-fail=mean:200% -p no:xdist --override-ini='addopts='
@echo "✅ No performance regressions detected!"
@if find .benchmarks -name '*_baseline*' -print -quit 2>/dev/null | grep -q .; then \
IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-compare=0001_baseline --benchmark-sort=mean --benchmark-compare-fail=mean:200% -p no:xdist --override-ini='addopts='; \
echo "✅ No performance regressions detected!"; \
else \
echo "⚠️ No benchmark baseline found. Run 'make benchmark-save' first to create one."; \
echo " Running benchmarks without comparison..."; \
IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-save=baseline --benchmark-sort=mean -p no:xdist --override-ini='addopts='; \
echo "✅ Benchmark baseline created. Future runs of 'make benchmark-check' will compare against it."; \
fi
test-all:
@echo "🧪 Running ALL tests (unit + E2E)..."

View File

@@ -1169,6 +1169,8 @@ app.add_middleware(
## Performance Considerations
> 📖 For the full benchmarking guide (how to run, read results, write new benchmarks, and manage baselines), see **[BENCHMARKS.md](BENCHMARKS.md)**.
### Database Connection Pooling
- Pool size: 20 connections

311
backend/docs/BENCHMARKS.md Normal file
View File

@@ -0,0 +1,311 @@
# Performance Benchmarks Guide
Automated performance benchmarking infrastructure using **pytest-benchmark** to detect latency regressions in critical API endpoints.
## Table of Contents
- [Why Benchmark?](#why-benchmark)
- [Quick Start](#quick-start)
- [How It Works](#how-it-works)
- [Understanding Results](#understanding-results)
- [Test Organization](#test-organization)
- [Writing Benchmark Tests](#writing-benchmark-tests)
- [Baseline Management](#baseline-management)
- [CI/CD Integration](#cicd-integration)
- [Troubleshooting](#troubleshooting)
---
## Why Benchmark?
Performance regressions are silent bugs — they don't break tests or cause errors, but they degrade the user experience over time. Common causes include:
- **Unintended N+1 queries** after adding a relationship
- **Heavier serialization** after adding new fields to a response model
- **Middleware overhead** from new security headers or logging
- **Dependency upgrades** that introduce slower code paths
Without automated benchmarks, these regressions go unnoticed until users complain. Performance benchmarks serve as an **early warning system** — they measure endpoint latency on every run and flag significant deviations from an established baseline.
### What benchmarks give you
| Benefit | Description |
|---------|-------------|
| **Regression detection** | Automatically flags when an endpoint becomes significantly slower |
| **Baseline tracking** | Stores known-good performance numbers for comparison |
| **Confidence in refactors** | Verify that code changes don't degrade response times |
| **Visibility** | Makes performance a first-class, measurable quality attribute |
---
## Quick Start
```bash
# Run benchmarks (no comparison, just see current numbers)
make benchmark
# Save current results as the baseline
make benchmark-save
# Run benchmarks and compare against the saved baseline
make benchmark-check
```
---
## How It Works
The benchmarking system has three layers:
### 1. pytest-benchmark integration
[pytest-benchmark](https://pytest-benchmark.readthedocs.io/) is a pytest plugin that provides a `benchmark` fixture. It handles:
- **Calibration**: Automatically determines how many iterations to run for statistical significance
- **Timing**: Uses `time.perf_counter` for high-resolution measurements
- **Statistics**: Computes min, max, mean, median, standard deviation, IQR, and outlier detection
- **Comparison**: Compares current results against saved baselines and flags regressions
### 2. Benchmark types
The test suite includes two categories of performance tests:
| Type | How it works | Examples |
|------|-------------|----------|
| **pytest-benchmark tests** | Uses the `benchmark` fixture for precise, multi-round timing | `test_health_endpoint_performance`, `test_openapi_schema_performance`, `test_password_hashing_performance`, `test_password_verification_performance`, `test_access_token_creation_performance`, `test_refresh_token_creation_performance`, `test_token_decode_performance` |
| **Manual latency tests** | Uses `time.perf_counter` with explicit thresholds (for async endpoints that pytest-benchmark doesn't support natively) | `test_login_latency`, `test_get_current_user_latency`, `test_register_latency`, `test_token_refresh_latency`, `test_sessions_list_latency`, `test_user_profile_update_latency` |
### 3. Regression detection
When running `make benchmark-check`, the system:
1. Runs all benchmark tests
2. Compares results against the saved baseline (`.benchmarks/` directory)
3. **Fails the build** if any test's mean time exceeds **200%** of the baseline (i.e., 3× slower)
The `200%` threshold in `--benchmark-compare-fail=mean:200%` means "fail if the mean increased by more than 200% relative to the baseline." This is deliberately generous to avoid false positives from normal run-to-run variance while still catching real regressions.
---
## Understanding Results
A typical benchmark output looks like this:
```
--------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_health_endpoint_performance 0.9841 (1.0) 1.5513 (1.0) 1.1390 (1.0) 0.1098 (1.0) 1.1151 (1.0) 0.1672 (1.0) 39;2 877.9666 (1.0) 133 1
test_openapi_schema_performance 1.6523 (1.68) 2.0892 (1.35) 1.7843 (1.57) 0.1553 (1.41) 1.7200 (1.54) 0.1727 (1.03) 2;0 560.4471 (0.64) 10 1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```
### Column reference
| Column | Meaning |
|--------|---------|
| **Min** | Fastest single execution |
| **Max** | Slowest single execution |
| **Mean** | Average across all rounds — the primary metric for regression detection |
| **StdDev** | How much results vary between rounds (lower = more stable) |
| **Median** | Middle value, less sensitive to outliers than mean |
| **IQR** | Interquartile range — spread of the middle 50% of results |
| **Outliers** | Format `A;B` — A = within 1 StdDev, B = within 1.5 IQR from quartiles |
| **OPS** | Operations per second (`1 / Mean`) |
| **Rounds** | How many times the test was executed (auto-calibrated) |
| **Iterations** | Iterations per round (usually 1 for ms-scale tests) |
### The ratio numbers `(1.0)`, `(1.68)`, etc.
These show how each test compares **to the best result in that column**. The fastest test is always `(1.0)`, and others show their relative factor. For example, `(1.68)` means "1.68× slower than the fastest."
### Color coding
- **Green**: The fastest (best) value in each column
- **Red**: The slowest (worst) value in each column
This is a **relative ranking within the current run** — red does NOT mean the test failed or that performance is bad. It simply highlights which endpoint is the slower one in the group.
### What's "normal"?
For this project's current endpoints:
| Test | Expected range | Why |
|------|---------------|-----|
| `GET /health` | ~11.5ms | Minimal logic, mocked DB check |
| `GET /api/v1/openapi.json` | ~1.52.5ms | Serializes entire API schema |
| `get_password_hash` | ~200ms | CPU-bound bcrypt hashing |
| `verify_password` | ~200ms | CPU-bound bcrypt verification |
| `create_access_token` | ~1720µs | JWT encoding with HMAC-SHA256 |
| `create_refresh_token` | ~1720µs | JWT encoding with HMAC-SHA256 |
| `decode_token` | ~2025µs | JWT decoding and claim validation |
| `POST /api/v1/auth/login` | < 500ms threshold | Includes bcrypt password verification |
| `POST /api/v1/auth/register` | < 500ms threshold | Includes bcrypt password hashing |
| `POST /api/v1/auth/refresh` | < 200ms threshold | Token rotation + DB session update |
| `GET /api/v1/users/me` | < 200ms threshold | DB lookup + token validation |
| `GET /api/v1/sessions/me` | < 200ms threshold | Session list query + token validation |
| `PATCH /api/v1/users/me` | < 200ms threshold | DB update + token validation |
---
## Test Organization
```
backend/tests/
├── benchmarks/
│ └── test_endpoint_performance.py # All performance benchmark tests
backend/.benchmarks/ # Saved baselines (auto-generated)
└── Linux-CPython-3.12-64bit/
└── 0001_baseline.json # Platform-specific baseline file
```
### Test markers
All benchmark tests use the `@pytest.mark.benchmark` marker. The `--benchmark-only` flag ensures that only tests using the `benchmark` fixture are executed during benchmark runs, while manual latency tests (async) are skipped.
---
## Writing Benchmark Tests
### Stateless endpoint (using pytest-benchmark fixture)
```python
import pytest
from fastapi.testclient import TestClient
def test_my_endpoint_performance(sync_client, benchmark):
"""Benchmark: GET /my-endpoint should respond within acceptable latency."""
result = benchmark(sync_client.get, "/my-endpoint")
assert result.status_code == 200
```
The `benchmark` fixture handles all timing, calibration, and statistics automatically. Just pass it the callable and arguments.
### Async / DB-dependent endpoint (manual timing)
For async endpoints that require database access, use manual timing with an explicit threshold:
```python
import time
import pytest
MAX_RESPONSE_MS = 300
@pytest.mark.asyncio
async def test_my_async_endpoint_latency(client, setup_fixture):
"""Performance: endpoint must respond under threshold."""
iterations = 5
total_ms = 0.0
for _ in range(iterations):
start = time.perf_counter()
response = await client.get("/api/v1/my-endpoint")
elapsed_ms = (time.perf_counter() - start) * 1000
total_ms += elapsed_ms
assert response.status_code == 200
mean_ms = total_ms / iterations
assert mean_ms < MAX_RESPONSE_MS, (
f"Latency regression: {mean_ms:.1f}ms exceeds {MAX_RESPONSE_MS}ms threshold"
)
```
### Guidelines for new benchmarks
1. **Benchmark critical paths** — endpoints users hit frequently or where latency matters most
2. **Mock external dependencies** for stateless tests to isolate endpoint overhead
3. **Set generous thresholds** for manual tests — account for CI variability
4. **Keep benchmarks fast** — they run on every check, so avoid heavy setup
---
## Baseline Management
### Saving a baseline
```bash
make benchmark-save
```
This runs all benchmarks and saves results to `.benchmarks/<platform>/0001_baseline.json`. The baseline captures:
- Mean, min, max, median, stddev for each test
- Machine info (CPU, OS, Python version)
- Timestamp
### Comparing against baseline
```bash
make benchmark-check
```
If no baseline exists, this command automatically creates one and prints a warning. On subsequent runs, it compares current results against the saved baseline.
### When to update the baseline
- **After intentional performance changes** (e.g., you optimized an endpoint — save the new, faster baseline)
- **After infrastructure changes** (e.g., new CI runner, different hardware)
- **After adding new benchmark tests** (the new tests need a baseline entry)
```bash
# Update the baseline after intentional changes
make benchmark-save
```
### Version control
The `.benchmarks/` directory can be committed to version control so that CI pipelines can compare against a known-good baseline. However, since benchmark results are machine-specific, you may prefer to generate baselines in CI rather than committing local results.
---
## CI/CD Integration
Add benchmark checking to your CI pipeline to catch regressions on every PR:
```yaml
# Example GitHub Actions step
- name: Performance regression check
run: |
cd backend
make benchmark-save # Create baseline from main branch
# ... apply PR changes ...
make benchmark-check # Compare PR against baseline
```
A more robust approach:
1. Save the baseline on the `main` branch after each merge
2. On PR branches, run `make benchmark-check` against the `main` baseline
3. The pipeline fails if any endpoint regresses beyond the 200% threshold
---
## Troubleshooting
### "No benchmark baseline found" warning
```
⚠️ No benchmark baseline found. Run 'make benchmark-save' first to create one.
```
This means no baseline file exists yet. The command will auto-create one. Future runs of `make benchmark-check` will compare against it.
### Machine info mismatch warning
```
WARNING: benchmark machine_info is different
```
This is expected when comparing baselines generated on a different machine or OS. The comparison still works, but absolute numbers may differ. Re-save the baseline on the current machine if needed.
### High variance (large StdDev)
If StdDev is high relative to the Mean, results may be unreliable. Common causes:
- System under load during benchmark run
- Garbage collection interference
- Thermal throttling
Try running benchmarks on an idle system or increasing `min_rounds` in `pyproject.toml`.
### Only 7 of 13 tests run
The async tests (`test_login_latency`, `test_get_current_user_latency`, `test_register_latency`, `test_token_refresh_latency`, `test_sessions_list_latency`, `test_user_profile_update_latency`) are skipped during `--benchmark-only` runs because they don't use the `benchmark` fixture. They run as part of the normal test suite (`make test`) with manual threshold assertions.

View File

@@ -2,7 +2,7 @@
Performance Benchmark Tests.
These tests establish baseline performance metrics for critical API endpoints
and detect regressions when response times degrade significantly.
and core operations, detecting regressions when response times degrade.
Usage:
make benchmark # Run benchmarks and save baseline
@@ -20,10 +20,21 @@ import pytest
import pytest_asyncio
from fastapi.testclient import TestClient
from app.core.auth import (
create_access_token,
create_refresh_token,
decode_token,
get_password_hash,
verify_password,
)
from app.main import app
pytestmark = [pytest.mark.benchmark]
# Pre-computed hash for sync benchmarks (avoids hashing in every iteration)
_BENCH_PASSWORD = "BenchPass123!"
_BENCH_HASH = get_password_hash(_BENCH_PASSWORD)
# =============================================================================
# Fixtures
@@ -55,6 +66,50 @@ def test_openapi_schema_performance(sync_client, benchmark):
assert result.status_code == 200
# =============================================================================
# Core Crypto & Token Benchmarks (no DB required)
#
# These benchmark the CPU-intensive operations that underpin auth:
# password hashing, verification, and JWT creation/decoding.
# =============================================================================
def test_password_hashing_performance(benchmark):
"""Benchmark: bcrypt password hashing (CPU-bound, ~100ms expected)."""
result = benchmark(get_password_hash, _BENCH_PASSWORD)
assert result.startswith("$2b$")
def test_password_verification_performance(benchmark):
"""Benchmark: bcrypt password verification against a known hash."""
result = benchmark(verify_password, _BENCH_PASSWORD, _BENCH_HASH)
assert result is True
def test_access_token_creation_performance(benchmark):
"""Benchmark: JWT access token generation."""
user_id = str(uuid.uuid4())
token = benchmark(create_access_token, user_id)
assert isinstance(token, str)
assert len(token) > 0
def test_refresh_token_creation_performance(benchmark):
"""Benchmark: JWT refresh token generation."""
user_id = str(uuid.uuid4())
token = benchmark(create_refresh_token, user_id)
assert isinstance(token, str)
assert len(token) > 0
def test_token_decode_performance(benchmark):
"""Benchmark: JWT token decoding and validation."""
user_id = str(uuid.uuid4())
token = create_access_token(user_id)
payload = benchmark(decode_token, token, "access")
assert payload.sub == user_id
# =============================================================================
# Database-dependent Endpoint Benchmarks (async, manual timing)
#
@@ -65,12 +120,15 @@ def test_openapi_schema_performance(sync_client, benchmark):
MAX_LOGIN_MS = 500
MAX_GET_USER_MS = 200
MAX_REGISTER_MS = 500
MAX_TOKEN_REFRESH_MS = 200
MAX_SESSIONS_LIST_MS = 200
MAX_USER_UPDATE_MS = 200
@pytest_asyncio.fixture
async def bench_user(async_test_db):
"""Create a test user for benchmark tests."""
from app.core.auth import get_password_hash
from app.models.user import User
_test_engine, AsyncTestingSessionLocal = async_test_db
@@ -102,6 +160,17 @@ async def bench_token(client, bench_user):
return response.json()["access_token"]
@pytest_asyncio.fixture
async def bench_refresh_token(client, bench_user):
"""Get a refresh token for the benchmark user."""
response = await client.post(
"/api/v1/auth/login",
json={"email": "bench@example.com", "password": "BenchPass123!"},
)
assert response.status_code == 200, f"Login failed: {response.text}"
return response.json()["refresh_token"]
@pytest.mark.asyncio
async def test_login_latency(client, bench_user):
"""Performance: POST /api/v1/auth/login must respond under threshold."""
@@ -148,3 +217,111 @@ async def test_get_current_user_latency(client, bench_token):
assert mean_ms < MAX_GET_USER_MS, (
f"Get user latency regression: {mean_ms:.1f}ms exceeds {MAX_GET_USER_MS}ms threshold"
)
@pytest.mark.asyncio
async def test_register_latency(client):
"""Performance: POST /api/v1/auth/register must respond under threshold."""
iterations = 3
total_ms = 0.0
for i in range(iterations):
start = time.perf_counter()
response = await client.post(
"/api/v1/auth/register",
json={
"email": f"benchreg{i}@example.com",
"password": "BenchRegPass123!",
"first_name": "Bench",
"last_name": "Register",
},
)
elapsed_ms = (time.perf_counter() - start) * 1000
total_ms += elapsed_ms
assert response.status_code == 201, f"Register failed: {response.text}"
mean_ms = total_ms / iterations
print(
f"\n Register mean latency: {mean_ms:.1f}ms (threshold: {MAX_REGISTER_MS}ms)"
)
assert mean_ms < MAX_REGISTER_MS, (
f"Register latency regression: {mean_ms:.1f}ms exceeds {MAX_REGISTER_MS}ms threshold"
)
@pytest.mark.asyncio
async def test_token_refresh_latency(client, bench_refresh_token):
"""Performance: POST /api/v1/auth/refresh must respond under threshold."""
iterations = 5
total_ms = 0.0
for _ in range(iterations):
start = time.perf_counter()
response = await client.post(
"/api/v1/auth/refresh",
json={"refresh_token": bench_refresh_token},
)
elapsed_ms = (time.perf_counter() - start) * 1000
total_ms += elapsed_ms
assert response.status_code == 200, f"Refresh failed: {response.text}"
# Use the new refresh token for the next iteration
bench_refresh_token = response.json()["refresh_token"]
mean_ms = total_ms / iterations
print(
f"\n Token refresh mean latency: {mean_ms:.1f}ms (threshold: {MAX_TOKEN_REFRESH_MS}ms)"
)
assert mean_ms < MAX_TOKEN_REFRESH_MS, (
f"Token refresh latency regression: {mean_ms:.1f}ms exceeds {MAX_TOKEN_REFRESH_MS}ms threshold"
)
@pytest.mark.asyncio
async def test_sessions_list_latency(client, bench_token):
"""Performance: GET /api/v1/sessions must respond under threshold."""
iterations = 10
total_ms = 0.0
for _ in range(iterations):
start = time.perf_counter()
response = await client.get(
"/api/v1/sessions/me",
headers={"Authorization": f"Bearer {bench_token}"},
)
elapsed_ms = (time.perf_counter() - start) * 1000
total_ms += elapsed_ms
assert response.status_code == 200
mean_ms = total_ms / iterations
print(
f"\n Sessions list mean latency: {mean_ms:.1f}ms (threshold: {MAX_SESSIONS_LIST_MS}ms)"
)
assert mean_ms < MAX_SESSIONS_LIST_MS, (
f"Sessions list latency regression: {mean_ms:.1f}ms exceeds {MAX_SESSIONS_LIST_MS}ms threshold"
)
@pytest.mark.asyncio
async def test_user_profile_update_latency(client, bench_token):
"""Performance: PATCH /api/v1/users/me must respond under threshold."""
iterations = 5
total_ms = 0.0
for i in range(iterations):
start = time.perf_counter()
response = await client.patch(
"/api/v1/users/me",
headers={"Authorization": f"Bearer {bench_token}"},
json={"first_name": f"Bench{i}"},
)
elapsed_ms = (time.perf_counter() - start) * 1000
total_ms += elapsed_ms
assert response.status_code == 200, f"Update failed: {response.text}"
mean_ms = total_ms / iterations
print(
f"\n User update mean latency: {mean_ms:.1f}ms (threshold: {MAX_USER_UPDATE_MS}ms)"
)
assert mean_ms < MAX_USER_UPDATE_MS, (
f"User update latency regression: {mean_ms:.1f}ms exceeds {MAX_USER_UPDATE_MS}ms threshold"
)