- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)
Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide
Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
(ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
(EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
recommendations for agentic use
Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>