Files

Felipe Cardoso 6ab08537ca fix: address code review findings — batch args, venv path, serve flags

- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)

2026-03-31 10:10:48 +02:00

5.0 KiB

Raw Blame History

CLAUDE.md — AI Assistant Context

Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See README.md for user-facing commands.

Architecture

bin/ dispatchers → scripts/ implementations → lib/ shared libraries. Scripts source libs as needed: always common.sh first, then detect.sh if hardware detection is needed, then format.sh if formatted output is needed. Some scripts (e.g., rollback.sh) only need common.sh. Runtime data goes to data/ (gitignored). Full details in docs/architecture.md.

Safety Rules

scripts/optimize/kernel-params.sh modifies /etc/default/grub — requires root, backs up to data/backups/ first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).
scripts/optimize/tuned-profile.sh and rollback.sh require root and save previous state for rollback.
data/backups/ contains GRUB backups and tuned profile snapshots — never delete these.
Optimization scripts that modify system state (kernel-params.sh, tuned-profile.sh, rollback.sh) check $EUID at the top and exit immediately if not root. Guidance-only scripts (vram-gtt.sh, verify.sh) do not require root.
All Python blocks receive data via environment variables (os.environ), never via shell interpolation into Python source. This prevents injection. Do not revert to '''$var''' or "$var" patterns inside Python heredocs.

Key Technical Details

GPU sysfs: Auto-detected by find_gpu_card() in lib/detect.sh (matches vendor 0x1002). Falls back to first card with mem_info_vram_total.
Memory recommendations: recommended_gttsize_mib() in detect.sh computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB.
Kernel param detection: detect_kernel_param() uses word-boundary-anchored regex to avoid iommu matching amd_iommu.
Benchmark invocation: toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N. ENV_ARGS passed as a proper bash array (not string splitting).
llama-bench output: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
ROCm for gfx1151: Scripts set ROCBLAS_USE_HIPBLASLT=1 in benchmark ENV_ARGS. HSA_OVERRIDE_GFX_VERSION=11.5.1 is set inside the toolbox containers (not by our scripts) — needed for ollama and native ROCm builds.
Fedora GRUB: Prefers grubby (BLS), falls back to grub2-mkconfig, then grub-mkconfig. All three paths are handled.

Conventions

set -euo pipefail in every executable script
snake_case function names, UPPER_CASE for constants and loop variables
4-space indentation, no tabs
lib/ files are sourced (no shebang enforcement), but include #!/usr/bin/env bash for editor support
Colors gated on [[ -t 1 ]] (disabled when piped)
bc used for float math; python3 for JSON and GRUB editing only

Validating Changes

make audit          # Quick check — shows system status with pass/fail indicators
make verify         # 9-point optimization checklist
bin/audit --json | python3 -m json.tool   # Verify JSON output is valid

Serving

scripts/serve/launch.sh with dispatcher at bin/serve. Launches llama-server inside toolbox containers with optimized defaults: Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload. Key flags:

--ngram — n-gram speculative decoding (~1.1-1.4x tg for repetitive content)
--no-think — disables thinking/reasoning via --reasoning-budget 0 (faster for evals)
--ctx N — context size (default 131072)
--parallel N — concurrent request slots

System Tuning

scripts/optimize/power-profile.sh applies Phase 2 optimizations: RyzenAdj PPT increase (85W target, HP caps at 70W sustained), sysctl tuning (vm.swappiness=1, vm.max_map_count=500000), THP=always, RADV_PERFTEST=nogttspill. Systemd services for boot/resume persistence at configs/ryzenadj-llm.service and configs/ryzenadj-resume.service.

Agentic Evaluation

Scripts in scripts/agentic/ with dispatcher at bin/agentic. Uses a Python venv at .venv/ (Python 3.13, dependencies in requirements.txt). Eval frameworks: inspect-ai (all-in-one), inspect-evals (task definitions), evalplus (HumanEval+/MBPP+), bigcodebench. All target an OpenAI-compatible endpoint — auto-detects llama-server (port 8080) or ollama (port 11434). Model catalog at configs/models.conf.

External Resources

All external links are centralized in docs/references.md. Key ones:

AMD ROCm Strix Halo guide (kernel params, GTT configuration)
Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)
Qwen3.5 model family (GGUF quants by Unsloth)
Agentic eval frameworks (Inspect AI, EvalPlus, BFCL, BigCodeBench)

5.0 KiB Raw Blame History