feat(tests): add comprehensive benchmarks for auth and performance-critical endpoints

- Introduced benchmarks for password hashing, verification, and JWT token operations. - Added latency tests for `/register`, `/refresh`, `/sessions`, and `/users/me` endpoints. - Updated `BENCHMARKS.md` with new tests, thresholds, and execution details.
feat(backend): enhance performance benchmarking with baseline detection and documentation
2026-03-01 17:01:44 +01:00 · 2026-03-01 16:30:06 +01:00
5 changed files with 502 additions and 5 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -187,7 +187,7 @@ coverage.xml
 .hypothesis/
 .pytest_cache/
 cover/
-
+backend/.benchmarks
 # Translations
 *.mo
 *.pot
--- a/backend/Makefile
+++ b/backend/Makefile
@@ -186,8 +186,15 @@ benchmark-save:

 benchmark-check:
 	@echo "⏱️  Running benchmarks and comparing against baseline..."
-	@IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-compare=0001_baseline --benchmark-sort=mean --benchmark-compare-fail=mean:200% -p no:xdist --override-ini='addopts='
-	@echo "✅ No performance regressions detected!"
+	@if find .benchmarks -name '*_baseline*' -print -quit 2>/dev/null | grep -q .; then \
+		IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-compare=0001_baseline --benchmark-sort=mean --benchmark-compare-fail=mean:200% -p no:xdist --override-ini='addopts='; \
+		echo "✅ No performance regressions detected!"; \
+	else \
+		echo "⚠️  No benchmark baseline found. Run 'make benchmark-save' first to create one."; \
+		echo "   Running benchmarks without comparison..."; \
+		IS_TEST=True PYTHONPATH=. uv run pytest tests/benchmarks/ -v --benchmark-only --benchmark-save=baseline --benchmark-sort=mean -p no:xdist --override-ini='addopts='; \
+		echo "✅ Benchmark baseline created. Future runs of 'make benchmark-check' will compare against it."; \
+	fi

 test-all:
 	@echo "🧪 Running ALL tests (unit + E2E)..."
--- a/backend/docs/ARCHITECTURE.md
+++ b/backend/docs/ARCHITECTURE.md
@@ -1169,6 +1169,8 @@ app.add_middleware(

 ## Performance Considerations

+> 📖 For the full benchmarking guide (how to run, read results, write new benchmarks, and manage baselines), see **[BENCHMARKS.md](BENCHMARKS.md)**.
+
 ### Database Connection Pooling

 - Pool size: 20 connections
--- a/backend/docs/BENCHMARKS.md
+++ b/backend/docs/BENCHMARKS.md
@@ -0,0 +1,311 @@
+# Performance Benchmarks Guide
+
+Automated performance benchmarking infrastructure using **pytest-benchmark** to detect latency regressions in critical API endpoints.
+
+## Table of Contents
+
+- [Why Benchmark?](#why-benchmark)
+- [Quick Start](#quick-start)
+- [How It Works](#how-it-works)
+- [Understanding Results](#understanding-results)
+- [Test Organization](#test-organization)
+- [Writing Benchmark Tests](#writing-benchmark-tests)
+- [Baseline Management](#baseline-management)
+- [CI/CD Integration](#cicd-integration)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Why Benchmark?
+
+Performance regressions are silent bugs — they don't break tests or cause errors, but they degrade the user experience over time. Common causes include:
+
+- **Unintended N+1 queries** after adding a relationship
+- **Heavier serialization** after adding new fields to a response model
+- **Middleware overhead** from new security headers or logging
+- **Dependency upgrades** that introduce slower code paths
+
+Without automated benchmarks, these regressions go unnoticed until users complain. Performance benchmarks serve as an **early warning system** — they measure endpoint latency on every run and flag significant deviations from an established baseline.
+
+### What benchmarks give you
+
+| Benefit | Description |
+|---------|-------------|
+| **Regression detection** | Automatically flags when an endpoint becomes significantly slower |
+| **Baseline tracking** | Stores known-good performance numbers for comparison |
+| **Confidence in refactors** | Verify that code changes don't degrade response times |
+| **Visibility** | Makes performance a first-class, measurable quality attribute |
+
+---
+
+## Quick Start
+
+```bash
+# Run benchmarks (no comparison, just see current numbers)
+make benchmark
+
+# Save current results as the baseline
+make benchmark-save
+
+# Run benchmarks and compare against the saved baseline
+make benchmark-check
+```
+
+---
+
+## How It Works
+
+The benchmarking system has three layers:
+
+### 1. pytest-benchmark integration
+
+[pytest-benchmark](https://pytest-benchmark.readthedocs.io/) is a pytest plugin that provides a `benchmark` fixture. It handles:
+
+- **Calibration**: Automatically determines how many iterations to run for statistical significance
+- **Timing**: Uses `time.perf_counter` for high-resolution measurements
+- **Statistics**: Computes min, max, mean, median, standard deviation, IQR, and outlier detection
+- **Comparison**: Compares current results against saved baselines and flags regressions
+
+### 2. Benchmark types
+
+The test suite includes two categories of performance tests:
+
+| Type | How it works | Examples |
+|------|-------------|----------|
+| **pytest-benchmark tests** | Uses the `benchmark` fixture for precise, multi-round timing | `test_health_endpoint_performance`, `test_openapi_schema_performance`, `test_password_hashing_performance`, `test_password_verification_performance`, `test_access_token_creation_performance`, `test_refresh_token_creation_performance`, `test_token_decode_performance` |
+| **Manual latency tests** | Uses `time.perf_counter` with explicit thresholds (for async endpoints that pytest-benchmark doesn't support natively) | `test_login_latency`, `test_get_current_user_latency`, `test_register_latency`, `test_token_refresh_latency`, `test_sessions_list_latency`, `test_user_profile_update_latency` |
+
+### 3. Regression detection
+
+When running `make benchmark-check`, the system:
+
+1. Runs all benchmark tests
+2. Compares results against the saved baseline (`.benchmarks/` directory)
+3. **Fails the build** if any test's mean time exceeds **200%** of the baseline (i.e., 3× slower)
+
+The `200%` threshold in `--benchmark-compare-fail=mean:200%` means "fail if the mean increased by more than 200% relative to the baseline." This is deliberately generous to avoid false positives from normal run-to-run variance while still catching real regressions.
+
+---
+
+## Understanding Results
+
+A typical benchmark output looks like this:
+
+```
+--------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------
+Name (time in ms)                       Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+test_health_endpoint_performance     0.9841 (1.0)      1.5513 (1.0)      1.1390 (1.0)      0.1098 (1.0)      1.1151 (1.0)      0.1672 (1.0)          39;2  877.9666 (1.0)         133           1
+test_openapi_schema_performance      1.6523 (1.68)     2.0892 (1.35)     1.7843 (1.57)     0.1553 (1.41)     1.7200 (1.54)     0.1727 (1.03)          2;0  560.4471 (0.64)         10           1
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+```
+
+### Column reference
+
+| Column | Meaning |
+|--------|---------|
+| **Min** | Fastest single execution |
+| **Max** | Slowest single execution |
+| **Mean** | Average across all rounds — the primary metric for regression detection |
+| **StdDev** | How much results vary between rounds (lower = more stable) |
+| **Median** | Middle value, less sensitive to outliers than mean |
+| **IQR** | Interquartile range — spread of the middle 50% of results |
+| **Outliers** | Format `A;B` — A = within 1 StdDev, B = within 1.5 IQR from quartiles |
+| **OPS** | Operations per second (`1 / Mean`) |
+| **Rounds** | How many times the test was executed (auto-calibrated) |
+| **Iterations** | Iterations per round (usually 1 for ms-scale tests) |
+
+### The ratio numbers `(1.0)`, `(1.68)`, etc.
+
+These show how each test compares **to the best result in that column**. The fastest test is always `(1.0)`, and others show their relative factor. For example, `(1.68)` means "1.68× slower than the fastest."
+
+### Color coding
+
+- **Green**: The fastest (best) value in each column
+- **Red**: The slowest (worst) value in each column
+
+This is a **relative ranking within the current run** — red does NOT mean the test failed or that performance is bad. It simply highlights which endpoint is the slower one in the group.
+
+### What's "normal"?
+
+For this project's current endpoints:
+
+| Test | Expected range | Why |
+|------|---------------|-----|
+| `GET /health` | ~1–1.5ms | Minimal logic, mocked DB check |
+| `GET /api/v1/openapi.json` | ~1.5–2.5ms | Serializes entire API schema |
+| `get_password_hash` | ~200ms | CPU-bound bcrypt hashing |
+| `verify_password` | ~200ms | CPU-bound bcrypt verification |
+| `create_access_token` | ~17–20µs | JWT encoding with HMAC-SHA256 |
+| `create_refresh_token` | ~17–20µs | JWT encoding with HMAC-SHA256 |
+| `decode_token` | ~20–25µs | JWT decoding and claim validation |
+| `POST /api/v1/auth/login` | < 500ms threshold | Includes bcrypt password verification |
+| `POST /api/v1/auth/register` | < 500ms threshold | Includes bcrypt password hashing |
+| `POST /api/v1/auth/refresh` | < 200ms threshold | Token rotation + DB session update |
+| `GET /api/v1/users/me` | < 200ms threshold | DB lookup + token validation |
+| `GET /api/v1/sessions/me` | < 200ms threshold | Session list query + token validation |
+| `PATCH /api/v1/users/me` | < 200ms threshold | DB update + token validation |
+
+---
+
+## Test Organization
+
+```
+backend/tests/
+├── benchmarks/
+│   └── test_endpoint_performance.py   # All performance benchmark tests
+│
+backend/.benchmarks/                    # Saved baselines (auto-generated)
+└── Linux-CPython-3.12-64bit/
+    └── 0001_baseline.json             # Platform-specific baseline file
+```
+
+### Test markers
+
+All benchmark tests use the `@pytest.mark.benchmark` marker. The `--benchmark-only` flag ensures that only tests using the `benchmark` fixture are executed during benchmark runs, while manual latency tests (async) are skipped.
+
+---
+
+## Writing Benchmark Tests
+
+### Stateless endpoint (using pytest-benchmark fixture)
+
+```python
+import pytest
+from fastapi.testclient import TestClient
+
+def test_my_endpoint_performance(sync_client, benchmark):
+    """Benchmark: GET /my-endpoint should respond within acceptable latency."""
+    result = benchmark(sync_client.get, "/my-endpoint")
+    assert result.status_code == 200
+```
+
+The `benchmark` fixture handles all timing, calibration, and statistics automatically. Just pass it the callable and arguments.
+
+### Async / DB-dependent endpoint (manual timing)
+
+For async endpoints that require database access, use manual timing with an explicit threshold:
+
+```python
+import time
+import pytest
+
+MAX_RESPONSE_MS = 300
+
+@pytest.mark.asyncio
+async def test_my_async_endpoint_latency(client, setup_fixture):
+    """Performance: endpoint must respond under threshold."""
+    iterations = 5
+    total_ms = 0.0
+
+    for _ in range(iterations):
+        start = time.perf_counter()
+        response = await client.get("/api/v1/my-endpoint")
+        elapsed_ms = (time.perf_counter() - start) * 1000
+        total_ms += elapsed_ms
+        assert response.status_code == 200
+
+    mean_ms = total_ms / iterations
+    assert mean_ms < MAX_RESPONSE_MS, (
+        f"Latency regression: {mean_ms:.1f}ms exceeds {MAX_RESPONSE_MS}ms threshold"
+    )
+```
+
+### Guidelines for new benchmarks
+
+1. **Benchmark critical paths** — endpoints users hit frequently or where latency matters most
+2. **Mock external dependencies** for stateless tests to isolate endpoint overhead
+3. **Set generous thresholds** for manual tests — account for CI variability
+4. **Keep benchmarks fast** — they run on every check, so avoid heavy setup
+
+---
+
+## Baseline Management
+
+### Saving a baseline
+
+```bash
+make benchmark-save
+```
+
+This runs all benchmarks and saves results to `.benchmarks/<platform>/0001_baseline.json`. The baseline captures:
+- Mean, min, max, median, stddev for each test
+- Machine info (CPU, OS, Python version)
+- Timestamp
+
+### Comparing against baseline
+
+```bash
+make benchmark-check
+```
+
+If no baseline exists, this command automatically creates one and prints a warning. On subsequent runs, it compares current results against the saved baseline.
+
+### When to update the baseline
+
+- **After intentional performance changes** (e.g., you optimized an endpoint — save the new, faster baseline)
+- **After infrastructure changes** (e.g., new CI runner, different hardware)
+- **After adding new benchmark tests** (the new tests need a baseline entry)
+
+```bash
+# Update the baseline after intentional changes
+make benchmark-save
+```
+
+### Version control
+
+The `.benchmarks/` directory can be committed to version control so that CI pipelines can compare against a known-good baseline. However, since benchmark results are machine-specific, you may prefer to generate baselines in CI rather than committing local results.
+
+---
+
+## CI/CD Integration
+
+Add benchmark checking to your CI pipeline to catch regressions on every PR:
+
+```yaml
+# Example GitHub Actions step
+- name: Performance regression check
+  run: |
+    cd backend
+    make benchmark-save   # Create baseline from main branch
+    # ... apply PR changes ...
+    make benchmark-check  # Compare PR against baseline
+```
+
+A more robust approach:
+1. Save the baseline on the `main` branch after each merge
+2. On PR branches, run `make benchmark-check` against the `main` baseline
+3. The pipeline fails if any endpoint regresses beyond the 200% threshold
+
+---
+
+## Troubleshooting
+
+### "No benchmark baseline found" warning
+
+```
+⚠️  No benchmark baseline found. Run 'make benchmark-save' first to create one.
+```
+
+This means no baseline file exists yet. The command will auto-create one. Future runs of `make benchmark-check` will compare against it.
+
+### Machine info mismatch warning
+
+```
+WARNING: benchmark machine_info is different
+```
+
+This is expected when comparing baselines generated on a different machine or OS. The comparison still works, but absolute numbers may differ. Re-save the baseline on the current machine if needed.
+
+### High variance (large StdDev)
+
+If StdDev is high relative to the Mean, results may be unreliable. Common causes:
+- System under load during benchmark run
+- Garbage collection interference
+- Thermal throttling
+
+Try running benchmarks on an idle system or increasing `min_rounds` in `pyproject.toml`.
+
+### Only 7 of 13 tests run
+
+The async tests (`test_login_latency`, `test_get_current_user_latency`, `test_register_latency`, `test_token_refresh_latency`, `test_sessions_list_latency`, `test_user_profile_update_latency`) are skipped during `--benchmark-only` runs because they don't use the `benchmark` fixture. They run as part of the normal test suite (`make test`) with manual threshold assertions.
--- a/backend/tests/benchmarks/test_endpoint_performance.py
+++ b/backend/tests/benchmarks/test_endpoint_performance.py
@@ -2,7 +2,7 @@
 Performance Benchmark Tests.

 These tests establish baseline performance metrics for critical API endpoints
-and detect regressions when response times degrade significantly.
+and core operations, detecting regressions when response times degrade.

 Usage:
    make benchmark          # Run benchmarks and save baseline
@@ -20,10 +20,21 @@ import pytest
 import pytest_asyncio
 from fastapi.testclient import TestClient

+from app.core.auth import (
+    create_access_token,
+    create_refresh_token,
+    decode_token,
+    get_password_hash,
+    verify_password,
+)
 from app.main import app

 pytestmark = [pytest.mark.benchmark]

+# Pre-computed hash for sync benchmarks (avoids hashing in every iteration)
+_BENCH_PASSWORD = "BenchPass123!"
+_BENCH_HASH = get_password_hash(_BENCH_PASSWORD)
+

 # =============================================================================
 # Fixtures
@@ -55,6 +66,50 @@ def test_openapi_schema_performance(sync_client, benchmark):
    assert result.status_code == 200


+# =============================================================================
+# Core Crypto & Token Benchmarks (no DB required)
+#
+# These benchmark the CPU-intensive operations that underpin auth:
+# password hashing, verification, and JWT creation/decoding.
+# =============================================================================
+
+
+def test_password_hashing_performance(benchmark):
+    """Benchmark: bcrypt password hashing (CPU-bound, ~100ms expected)."""
+    result = benchmark(get_password_hash, _BENCH_PASSWORD)
+    assert result.startswith("$2b$")
+
+
+def test_password_verification_performance(benchmark):
+    """Benchmark: bcrypt password verification against a known hash."""
+    result = benchmark(verify_password, _BENCH_PASSWORD, _BENCH_HASH)
+    assert result is True
+
+
+def test_access_token_creation_performance(benchmark):
+    """Benchmark: JWT access token generation."""
+    user_id = str(uuid.uuid4())
+    token = benchmark(create_access_token, user_id)
+    assert isinstance(token, str)
+    assert len(token) > 0
+
+
+def test_refresh_token_creation_performance(benchmark):
+    """Benchmark: JWT refresh token generation."""
+    user_id = str(uuid.uuid4())
+    token = benchmark(create_refresh_token, user_id)
+    assert isinstance(token, str)
+    assert len(token) > 0
+
+
+def test_token_decode_performance(benchmark):
+    """Benchmark: JWT token decoding and validation."""
+    user_id = str(uuid.uuid4())
+    token = create_access_token(user_id)
+    payload = benchmark(decode_token, token, "access")
+    assert payload.sub == user_id
+
+
 # =============================================================================
 # Database-dependent Endpoint Benchmarks (async, manual timing)
 #
@@ -65,12 +120,15 @@ def test_openapi_schema_performance(sync_client, benchmark):

 MAX_LOGIN_MS = 500
 MAX_GET_USER_MS = 200
+MAX_REGISTER_MS = 500
+MAX_TOKEN_REFRESH_MS = 200
+MAX_SESSIONS_LIST_MS = 200
+MAX_USER_UPDATE_MS = 200


@pytest_asyncio.fixture
 async def bench_user(async_test_db):
    """Create a test user for benchmark tests."""
-    from app.core.auth import get_password_hash
    from app.models.user import User

    _test_engine, AsyncTestingSessionLocal = async_test_db
@@ -102,6 +160,17 @@ async def bench_token(client, bench_user):
    return response.json()["access_token"]


+@pytest_asyncio.fixture
+async def bench_refresh_token(client, bench_user):
+    """Get a refresh token for the benchmark user."""
+    response = await client.post(
+        "/api/v1/auth/login",
+        json={"email": "bench@example.com", "password": "BenchPass123!"},
+    )
+    assert response.status_code == 200, f"Login failed: {response.text}"
+    return response.json()["refresh_token"]
+
+
@pytest.mark.asyncio
 async def test_login_latency(client, bench_user):
    """Performance: POST /api/v1/auth/login must respond under threshold."""
@@ -148,3 +217,111 @@ async def test_get_current_user_latency(client, bench_token):
    assert mean_ms < MAX_GET_USER_MS, (
        f"Get user latency regression: {mean_ms:.1f}ms exceeds {MAX_GET_USER_MS}ms threshold"
    )
+
+
+@pytest.mark.asyncio
+async def test_register_latency(client):
+    """Performance: POST /api/v1/auth/register must respond under threshold."""
+    iterations = 3
+    total_ms = 0.0
+
+    for i in range(iterations):
+        start = time.perf_counter()
+        response = await client.post(
+            "/api/v1/auth/register",
+            json={
+                "email": f"benchreg{i}@example.com",
+                "password": "BenchRegPass123!",
+                "first_name": "Bench",
+                "last_name": "Register",
+            },
+        )
+        elapsed_ms = (time.perf_counter() - start) * 1000
+        total_ms += elapsed_ms
+        assert response.status_code == 201, f"Register failed: {response.text}"
+
+    mean_ms = total_ms / iterations
+    print(
+        f"\n  Register mean latency: {mean_ms:.1f}ms (threshold: {MAX_REGISTER_MS}ms)"
+    )
+    assert mean_ms < MAX_REGISTER_MS, (
+        f"Register latency regression: {mean_ms:.1f}ms exceeds {MAX_REGISTER_MS}ms threshold"
+    )
+
+
+@pytest.mark.asyncio
+async def test_token_refresh_latency(client, bench_refresh_token):
+    """Performance: POST /api/v1/auth/refresh must respond under threshold."""
+    iterations = 5
+    total_ms = 0.0
+
+    for _ in range(iterations):
+        start = time.perf_counter()
+        response = await client.post(
+            "/api/v1/auth/refresh",
+            json={"refresh_token": bench_refresh_token},
+        )
+        elapsed_ms = (time.perf_counter() - start) * 1000
+        total_ms += elapsed_ms
+        assert response.status_code == 200, f"Refresh failed: {response.text}"
+        # Use the new refresh token for the next iteration
+        bench_refresh_token = response.json()["refresh_token"]
+
+    mean_ms = total_ms / iterations
+    print(
+        f"\n  Token refresh mean latency: {mean_ms:.1f}ms (threshold: {MAX_TOKEN_REFRESH_MS}ms)"
+    )
+    assert mean_ms < MAX_TOKEN_REFRESH_MS, (
+        f"Token refresh latency regression: {mean_ms:.1f}ms exceeds {MAX_TOKEN_REFRESH_MS}ms threshold"
+    )
+
+
+@pytest.mark.asyncio
+async def test_sessions_list_latency(client, bench_token):
+    """Performance: GET /api/v1/sessions must respond under threshold."""
+    iterations = 10
+    total_ms = 0.0
+
+    for _ in range(iterations):
+        start = time.perf_counter()
+        response = await client.get(
+            "/api/v1/sessions/me",
+            headers={"Authorization": f"Bearer {bench_token}"},
+        )
+        elapsed_ms = (time.perf_counter() - start) * 1000
+        total_ms += elapsed_ms
+        assert response.status_code == 200
+
+    mean_ms = total_ms / iterations
+    print(
+        f"\n  Sessions list mean latency: {mean_ms:.1f}ms (threshold: {MAX_SESSIONS_LIST_MS}ms)"
+    )
+    assert mean_ms < MAX_SESSIONS_LIST_MS, (
+        f"Sessions list latency regression: {mean_ms:.1f}ms exceeds {MAX_SESSIONS_LIST_MS}ms threshold"
+    )
+
+
+@pytest.mark.asyncio
+async def test_user_profile_update_latency(client, bench_token):
+    """Performance: PATCH /api/v1/users/me must respond under threshold."""
+    iterations = 5
+    total_ms = 0.0
+
+    for i in range(iterations):
+        start = time.perf_counter()
+        response = await client.patch(
+            "/api/v1/users/me",
+            headers={"Authorization": f"Bearer {bench_token}"},
+            json={"first_name": f"Bench{i}"},
+        )
+        elapsed_ms = (time.perf_counter() - start) * 1000
+        total_ms += elapsed_ms
+        assert response.status_code == 200, f"Update failed: {response.text}"
+
+    mean_ms = total_ms / iterations
+    print(
+        f"\n  User update mean latency: {mean_ms:.1f}ms (threshold: {MAX_USER_UPDATE_MS}ms)"
+    )
+    assert mean_ms < MAX_USER_UPDATE_MS, (
+        f"User update latency regression: {mean_ms:.1f}ms exceeds {MAX_USER_UPDATE_MS}ms threshold"
+    )