Template

Files

Felipe Cardoso 0760a8284d feat(tests): add comprehensive benchmarks for auth and performance-critical endpoints

- Introduced benchmarks for password hashing, verification, and JWT token operations.
- Added latency tests for `/register`, `/refresh`, `/sessions`, and `/users/me` endpoints.
- Updated `BENCHMARKS.md` with new tests, thresholds, and execution details.

2026-03-01 17:01:44 +01:00

13 KiB

Raw Permalink Blame History

Performance Benchmarks Guide

Automated performance benchmarking infrastructure using pytest-benchmark to detect latency regressions in critical API endpoints.

Why Benchmark?
Quick Start
How It Works
Understanding Results
Test Organization
Writing Benchmark Tests
Baseline Management
CI/CD Integration
Troubleshooting

Why Benchmark?

Performance regressions are silent bugs — they don't break tests or cause errors, but they degrade the user experience over time. Common causes include:

Unintended N+1 queries after adding a relationship
Heavier serialization after adding new fields to a response model
Middleware overhead from new security headers or logging
Dependency upgrades that introduce slower code paths

Without automated benchmarks, these regressions go unnoticed until users complain. Performance benchmarks serve as an early warning system — they measure endpoint latency on every run and flag significant deviations from an established baseline.

What benchmarks give you

Benefit	Description
Regression detection	Automatically flags when an endpoint becomes significantly slower
Baseline tracking	Stores known-good performance numbers for comparison
Confidence in refactors	Verify that code changes don't degrade response times
Visibility	Makes performance a first-class, measurable quality attribute

Quick Start

# Run benchmarks (no comparison, just see current numbers)
make benchmark

# Save current results as the baseline
make benchmark-save

# Run benchmarks and compare against the saved baseline
make benchmark-check

How It Works

The benchmarking system has three layers:

1. pytest-benchmark integration

pytest-benchmark is a pytest plugin that provides a benchmark fixture. It handles:

Calibration: Automatically determines how many iterations to run for statistical significance
Timing: Uses time.perf_counter for high-resolution measurements
Statistics: Computes min, max, mean, median, standard deviation, IQR, and outlier detection
Comparison: Compares current results against saved baselines and flags regressions

2. Benchmark types

The test suite includes two categories of performance tests:

Type	How it works	Examples
pytest-benchmark tests	Uses the `benchmark` fixture for precise, multi-round timing	`test_health_endpoint_performance`, `test_openapi_schema_performance`, `test_password_hashing_performance`, `test_password_verification_performance`, `test_access_token_creation_performance`, `test_refresh_token_creation_performance`, `test_token_decode_performance`
Manual latency tests	Uses `time.perf_counter` with explicit thresholds (for async endpoints that pytest-benchmark doesn't support natively)	`test_login_latency`, `test_get_current_user_latency`, `test_register_latency`, `test_token_refresh_latency`, `test_sessions_list_latency`, `test_user_profile_update_latency`

3. Regression detection

When running make benchmark-check, the system:

Runs all benchmark tests
Compares results against the saved baseline (.benchmarks/ directory)
Fails the build if any test's mean time exceeds 200% of the baseline (i.e., 3× slower)

The 200% threshold in --benchmark-compare-fail=mean:200% means "fail if the mean increased by more than 200% relative to the baseline." This is deliberately generous to avoid false positives from normal run-to-run variance while still catching real regressions.

Understanding Results

A typical benchmark output looks like this:

--------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------
Name (time in ms)                       Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_health_endpoint_performance     0.9841 (1.0)      1.5513 (1.0)      1.1390 (1.0)      0.1098 (1.0)      1.1151 (1.0)      0.1672 (1.0)          39;2  877.9666 (1.0)         133           1
test_openapi_schema_performance      1.6523 (1.68)     2.0892 (1.35)     1.7843 (1.57)     0.1553 (1.41)     1.7200 (1.54)     0.1727 (1.03)          2;0  560.4471 (0.64)         10           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Column reference

Column	Meaning
Min	Fastest single execution
Max	Slowest single execution
Mean	Average across all rounds — the primary metric for regression detection
StdDev	How much results vary between rounds (lower = more stable)
Median	Middle value, less sensitive to outliers than mean
IQR	Interquartile range — spread of the middle 50% of results
Outliers	Format `A;B` — A = within 1 StdDev, B = within 1.5 IQR from quartiles
OPS	Operations per second (`1 / Mean`)
Rounds	How many times the test was executed (auto-calibrated)
Iterations	Iterations per round (usually 1 for ms-scale tests)

The ratio numbers `(1.0)`, `(1.68)`, etc.

These show how each test compares to the best result in that column. The fastest test is always (1.0), and others show their relative factor. For example, (1.68) means "1.68× slower than the fastest."

Color coding

Green: The fastest (best) value in each column
Red: The slowest (worst) value in each column

This is a relative ranking within the current run — red does NOT mean the test failed or that performance is bad. It simply highlights which endpoint is the slower one in the group.

What's "normal"?

For this project's current endpoints:

Test	Expected range	Why
`GET /health`	~1–1.5ms	Minimal logic, mocked DB check
`GET /api/v1/openapi.json`	~1.5–2.5ms	Serializes entire API schema
`get_password_hash`	~200ms	CPU-bound bcrypt hashing
`verify_password`	~200ms	CPU-bound bcrypt verification
`create_access_token`	~17–20µs	JWT encoding with HMAC-SHA256
`create_refresh_token`	~17–20µs	JWT encoding with HMAC-SHA256
`decode_token`	~20–25µs	JWT decoding and claim validation
`POST /api/v1/auth/login`	< 500ms threshold	Includes bcrypt password verification
`POST /api/v1/auth/register`	< 500ms threshold	Includes bcrypt password hashing
`POST /api/v1/auth/refresh`	< 200ms threshold	Token rotation + DB session update
`GET /api/v1/users/me`	< 200ms threshold	DB lookup + token validation
`GET /api/v1/sessions/me`	< 200ms threshold	Session list query + token validation
`PATCH /api/v1/users/me`	< 200ms threshold	DB update + token validation

Test Organization

backend/tests/
├── benchmarks/
│   └── test_endpoint_performance.py   # All performance benchmark tests
│
backend/.benchmarks/                    # Saved baselines (auto-generated)
└── Linux-CPython-3.12-64bit/
    └── 0001_baseline.json             # Platform-specific baseline file

Test markers

All benchmark tests use the @pytest.mark.benchmark marker. The --benchmark-only flag ensures that only tests using the benchmark fixture are executed during benchmark runs, while manual latency tests (async) are skipped.

Writing Benchmark Tests

Stateless endpoint (using pytest-benchmark fixture)

import pytest
from fastapi.testclient import TestClient

def test_my_endpoint_performance(sync_client, benchmark):
    """Benchmark: GET /my-endpoint should respond within acceptable latency."""
    result = benchmark(sync_client.get, "/my-endpoint")
    assert result.status_code == 200

The benchmark fixture handles all timing, calibration, and statistics automatically. Just pass it the callable and arguments.

Async / DB-dependent endpoint (manual timing)

For async endpoints that require database access, use manual timing with an explicit threshold:

import time
import pytest

MAX_RESPONSE_MS = 300

@pytest.mark.asyncio
async def test_my_async_endpoint_latency(client, setup_fixture):
    """Performance: endpoint must respond under threshold."""
    iterations = 5
    total_ms = 0.0

    for _ in range(iterations):
        start = time.perf_counter()
        response = await client.get("/api/v1/my-endpoint")
        elapsed_ms = (time.perf_counter() - start) * 1000
        total_ms += elapsed_ms
        assert response.status_code == 200

    mean_ms = total_ms / iterations
    assert mean_ms < MAX_RESPONSE_MS, (
        f"Latency regression: {mean_ms:.1f}ms exceeds {MAX_RESPONSE_MS}ms threshold"
    )

Guidelines for new benchmarks

Benchmark critical paths — endpoints users hit frequently or where latency matters most
Mock external dependencies for stateless tests to isolate endpoint overhead
Set generous thresholds for manual tests — account for CI variability
Keep benchmarks fast — they run on every check, so avoid heavy setup

Baseline Management

Saving a baseline

make benchmark-save

This runs all benchmarks and saves results to .benchmarks/<platform>/0001_baseline.json. The baseline captures:

Mean, min, max, median, stddev for each test
Machine info (CPU, OS, Python version)
Timestamp

Comparing against baseline

make benchmark-check

If no baseline exists, this command automatically creates one and prints a warning. On subsequent runs, it compares current results against the saved baseline.

When to update the baseline

After intentional performance changes (e.g., you optimized an endpoint — save the new, faster baseline)
After infrastructure changes (e.g., new CI runner, different hardware)
After adding new benchmark tests (the new tests need a baseline entry)

# Update the baseline after intentional changes
make benchmark-save

Version control

The .benchmarks/ directory can be committed to version control so that CI pipelines can compare against a known-good baseline. However, since benchmark results are machine-specific, you may prefer to generate baselines in CI rather than committing local results.

CI/CD Integration

Add benchmark checking to your CI pipeline to catch regressions on every PR:

# Example GitHub Actions step
- name: Performance regression check
  run: |
    cd backend
    make benchmark-save   # Create baseline from main branch
    # ... apply PR changes ...
    make benchmark-check  # Compare PR against baseline

A more robust approach:

Save the baseline on the main branch after each merge
On PR branches, run make benchmark-check against the main baseline
The pipeline fails if any endpoint regresses beyond the 200% threshold

Troubleshooting

"No benchmark baseline found" warning

⚠️  No benchmark baseline found. Run 'make benchmark-save' first to create one.

This means no baseline file exists yet. The command will auto-create one. Future runs of make benchmark-check will compare against it.

Machine info mismatch warning

WARNING: benchmark machine_info is different

This is expected when comparing baselines generated on a different machine or OS. The comparison still works, but absolute numbers may differ. Re-save the baseline on the current machine if needed.

High variance (large StdDev)

If StdDev is high relative to the Mean, results may be unreliable. Common causes:

System under load during benchmark run
Garbage collection interference
Thermal throttling

Try running benchmarks on an idle system or increasing min_rounds in pyproject.toml.

Only 7 of 13 tests run

The async tests (test_login_latency, test_get_current_user_latency, test_register_latency, test_token_refresh_latency, test_sessions_list_latency, test_user_profile_update_latency) are skipped during --benchmark-only runs because they don't use the benchmark fixture. They run as part of the normal test suite (make test) with manual threshold assertions.

13 KiB Raw Permalink Blame History Unescape Escape