Files
Felipe Cardoso 6ab08537ca fix: address code review findings — batch args, venv path, serve flags
- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)
2026-03-31 10:10:48 +02:00

5.0 KiB

CLAUDE.md — AI Assistant Context

Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See README.md for user-facing commands.

Architecture

bin/ dispatchers → scripts/ implementations → lib/ shared libraries. Scripts source libs as needed: always common.sh first, then detect.sh if hardware detection is needed, then format.sh if formatted output is needed. Some scripts (e.g., rollback.sh) only need common.sh. Runtime data goes to data/ (gitignored). Full details in docs/architecture.md.

Safety Rules

  • scripts/optimize/kernel-params.sh modifies /etc/default/grub — requires root, backs up to data/backups/ first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).
  • scripts/optimize/tuned-profile.sh and rollback.sh require root and save previous state for rollback.
  • data/backups/ contains GRUB backups and tuned profile snapshots — never delete these.
  • Optimization scripts that modify system state (kernel-params.sh, tuned-profile.sh, rollback.sh) check $EUID at the top and exit immediately if not root. Guidance-only scripts (vram-gtt.sh, verify.sh) do not require root.
  • All Python blocks receive data via environment variables (os.environ), never via shell interpolation into Python source. This prevents injection. Do not revert to '''$var''' or "$var" patterns inside Python heredocs.

Key Technical Details

  • GPU sysfs: Auto-detected by find_gpu_card() in lib/detect.sh (matches vendor 0x1002). Falls back to first card with mem_info_vram_total.
  • Memory recommendations: recommended_gttsize_mib() in detect.sh computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB.
  • Kernel param detection: detect_kernel_param() uses word-boundary-anchored regex to avoid iommu matching amd_iommu.
  • Benchmark invocation: toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N. ENV_ARGS passed as a proper bash array (not string splitting).
  • llama-bench output: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
  • ROCm for gfx1151: Scripts set ROCBLAS_USE_HIPBLASLT=1 in benchmark ENV_ARGS. HSA_OVERRIDE_GFX_VERSION=11.5.1 is set inside the toolbox containers (not by our scripts) — needed for ollama and native ROCm builds.
  • Fedora GRUB: Prefers grubby (BLS), falls back to grub2-mkconfig, then grub-mkconfig. All three paths are handled.

Conventions

  • set -euo pipefail in every executable script
  • snake_case function names, UPPER_CASE for constants and loop variables
  • 4-space indentation, no tabs
  • lib/ files are sourced (no shebang enforcement), but include #!/usr/bin/env bash for editor support
  • Colors gated on [[ -t 1 ]] (disabled when piped)
  • bc used for float math; python3 for JSON and GRUB editing only

Validating Changes

make audit          # Quick check — shows system status with pass/fail indicators
make verify         # 9-point optimization checklist
bin/audit --json | python3 -m json.tool   # Verify JSON output is valid

Serving

scripts/serve/launch.sh with dispatcher at bin/serve. Launches llama-server inside toolbox containers with optimized defaults: Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload. Key flags:

  • --ngram — n-gram speculative decoding (~1.1-1.4x tg for repetitive content)
  • --no-think — disables thinking/reasoning via --reasoning-budget 0 (faster for evals)
  • --ctx N — context size (default 131072)
  • --parallel N — concurrent request slots

System Tuning

scripts/optimize/power-profile.sh applies Phase 2 optimizations: RyzenAdj PPT increase (85W target, HP caps at 70W sustained), sysctl tuning (vm.swappiness=1, vm.max_map_count=500000), THP=always, RADV_PERFTEST=nogttspill. Systemd services for boot/resume persistence at configs/ryzenadj-llm.service and configs/ryzenadj-resume.service.

Agentic Evaluation

Scripts in scripts/agentic/ with dispatcher at bin/agentic. Uses a Python venv at .venv/ (Python 3.13, dependencies in requirements.txt). Eval frameworks: inspect-ai (all-in-one), inspect-evals (task definitions), evalplus (HumanEval+/MBPP+), bigcodebench. All target an OpenAI-compatible endpoint — auto-detects llama-server (port 8080) or ollama (port 11434). Model catalog at configs/models.conf.

External Resources

All external links are centralized in docs/references.md. Key ones:

  • AMD ROCm Strix Halo guide (kernel params, GTT configuration)
  • Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)
  • Qwen3.5 model family (GGUF quants by Unsloth)
  • Agentic eval frameworks (Inspect AI, EvalPlus, BFCL, BigCodeBench)