fix: address code review findings — batch args, venv path, serve flags

- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)
This commit is contained in:
Felipe Cardoso
2026-03-31 10:10:48 +02:00
parent dd403a907c
commit 6ab08537ca
10 changed files with 137 additions and 93 deletions

12
requirements.txt Normal file
View File

@@ -0,0 +1,12 @@
# Agentic evaluation frameworks
# Install: python3.13 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
# Requires Python >=3.10, <3.14 (bigcodebench constraint)
inspect-ai>=0.3.201
inspect-evals>=0.6.0
evalplus>=0.3.1
bigcodebench>=0.2.5
openai>=2.26.0
# IFEval dependency (not on PyPI)
instruction_following_eval @ git+https://github.com/josejg/instruction_following_eval