- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)
Both run-baseline.sh and run-suite.sh now support:
- --max-size GB: skip models larger than N GB (prevents OOM)
- --category LIST: filter by catalog category (smoke,dense,moe)
- --skip-longctx: skip 32K context tests (saves time + memory)
- --reps N: configure repetition count
- --help: shows usage with examples
Safe pre-optimization run: benchmark baseline --max-size 20 --skip-longctx
Full post-optimization: benchmark baseline (no filters, all models + longctx)
Also: 4 new BATS tests for flag parsing (98 total, all passing)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bugs fixed in production code:
- compare.sh: Python truthiness on 0.0 — `if b_val` was False for 0.0 t/s,
displaying it as a dash instead of "0.0". Fixed with `is not None` checks.
- compare.sh: ZeroDivisionError when computing delta % with 0.0 baseline.
Test improvements (review findings):
- detect.bats: kernel param tests now use real detect_kernel_param logic
pattern (not a separate reimplementation). Added non-GiB-aligned RAM test,
device ID without 0x prefix, empty firmware version, llama-bench detection,
detect_total_physical_ram_kb tests.
- benchmark_compare.bats: assert delta percentages (+20.0%, -25.0%, 0.0%),
test 0.0 t/s edge case, test per-directory error messages, test config
change detection with specific field assertions.
- log_metrics.bats: add assert_success, --help test, timestamp format
validation. Remove unused mock sysfs setup.
- common.bats: fix data_dir test, remove redundant assertion, add cleanup.
- test_helper.sh: remove unused FIXTURES_DIR.
- Remove empty tests/fixtures/ directory.
94 tests, all passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>