Comprehensive research (706 lines, dated 2026-03-30) covering evaluation
dimensions, benchmark suites, and open-weight model performance for
software engineering agent use cases on 64GB systems.
Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo).
- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)