Both run-baseline.sh and run-suite.sh now support:
- --max-size GB: skip models larger than N GB (prevents OOM)
- --category LIST: filter by catalog category (smoke,dense,moe)
- --skip-longctx: skip 32K context tests (saves time + memory)
- --reps N: configure repetition count
- --help: shows usage with examples
Safe pre-optimization run: benchmark baseline --max-size 20 --skip-longctx
Full post-optimization: benchmark baseline (no filters, all models + longctx)
Also: 4 new BATS tests for flag parsing (98 total, all passing)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide
Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
(ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
(EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
recommendations for agentic use
Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>