Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide
Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
(ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
(EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
recommendations for agentic use
Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bugs fixed in production code:
- compare.sh: Python truthiness on 0.0 — `if b_val` was False for 0.0 t/s,
displaying it as a dash instead of "0.0". Fixed with `is not None` checks.
- compare.sh: ZeroDivisionError when computing delta % with 0.0 baseline.
Test improvements (review findings):
- detect.bats: kernel param tests now use real detect_kernel_param logic
pattern (not a separate reimplementation). Added non-GiB-aligned RAM test,
device ID without 0x prefix, empty firmware version, llama-bench detection,
detect_total_physical_ram_kb tests.
- benchmark_compare.bats: assert delta percentages (+20.0%, -25.0%, 0.0%),
test 0.0 t/s edge case, test per-directory error messages, test config
change detection with specific field assertions.
- log_metrics.bats: add assert_success, --help test, timestamp format
validation. Remove unused mock sysfs setup.
- common.bats: fix data_dir test, remove redundant assertion, add cleanup.
- test_helper.sh: remove unused FIXTURES_DIR.
- Remove empty tests/fixtures/ directory.
94 tests, all passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- architecture.md: fix kernel param math to match actual computed values,
use cardN placeholder in sysfs paths, clarify system_ram_kb is OS-visible
- benchmarking.md: normalize flags to -ngl 99 / -mmp 0 (matching code),
add llama-rocm7-nightlies backend
- CLAUDE.md: clarify HSA_OVERRIDE_GFX_VERSION is set in containers not
scripts, fix lib sourcing description, specify which scripts need root
- detect.sh: document detect_cpu_cores returns threads not cores
- troubleshooting.md: add link to references.md
- README.md: remove unsupported Fedora 42 claim, describe configs/ content
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace GNU \b with portable word-boundary sed patterns in kernel-params
- Warn on unknown CLI arguments instead of silently swallowing
- Add floor check on recommended_gttsize_mib to prevent negative values
- Fix Python operator precedence in benchmark log parser
- Add root checks to tuned-profile.sh and rollback.sh
- Remove redundant sudo calls (scripts already require root at entry)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>