- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts) - Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs - Add -b/--batch to bin/benchmark help text - Add --no-think flag to serve script (--reasoning-budget 0) - Sanitize model names in eval run directories - Simplify agentic setup to use requirements.txt - Add serve --help test, batch flag assertions to existing tests - Add requirements.txt for reproducible venv setup (Python 3.13)
5.0 KiB
CLAUDE.md — AI Assistant Context
Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See README.md for user-facing commands.
Architecture
bin/ dispatchers → scripts/ implementations → lib/ shared libraries. Scripts source libs as needed: always common.sh first, then detect.sh if hardware detection is needed, then format.sh if formatted output is needed. Some scripts (e.g., rollback.sh) only need common.sh. Runtime data goes to data/ (gitignored). Full details in docs/architecture.md.
Safety Rules
scripts/optimize/kernel-params.shmodifies/etc/default/grub— requires root, backs up todata/backups/first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).scripts/optimize/tuned-profile.shandrollback.shrequire root and save previous state for rollback.data/backups/contains GRUB backups and tuned profile snapshots — never delete these.- Optimization scripts that modify system state (
kernel-params.sh,tuned-profile.sh,rollback.sh) check$EUIDat the top and exit immediately if not root. Guidance-only scripts (vram-gtt.sh,verify.sh) do not require root. - All Python blocks receive data via environment variables (
os.environ), never via shell interpolation into Python source. This prevents injection. Do not revert to'''$var'''or"$var"patterns inside Python heredocs.
Key Technical Details
- GPU sysfs: Auto-detected by
find_gpu_card()inlib/detect.sh(matches vendor0x1002). Falls back to first card withmem_info_vram_total. - Memory recommendations:
recommended_gttsize_mib()indetect.shcomputes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB. - Kernel param detection:
detect_kernel_param()uses word-boundary-anchored regex to avoidiommumatchingamd_iommu. - Benchmark invocation:
toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N. ENV_ARGS passed as a proper bash array (not string splitting). - llama-bench output: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
- ROCm for gfx1151: Scripts set
ROCBLAS_USE_HIPBLASLT=1in benchmark ENV_ARGS.HSA_OVERRIDE_GFX_VERSION=11.5.1is set inside the toolbox containers (not by our scripts) — needed for ollama and native ROCm builds. - Fedora GRUB: Prefers
grubby(BLS), falls back togrub2-mkconfig, thengrub-mkconfig. All three paths are handled.
Conventions
set -euo pipefailin every executable scriptsnake_casefunction names,UPPER_CASEfor constants and loop variables- 4-space indentation, no tabs
lib/files are sourced (no shebang enforcement), but include#!/usr/bin/env bashfor editor support- Colors gated on
[[ -t 1 ]](disabled when piped) bcused for float math;python3for JSON and GRUB editing only
Validating Changes
make audit # Quick check — shows system status with pass/fail indicators
make verify # 9-point optimization checklist
bin/audit --json | python3 -m json.tool # Verify JSON output is valid
Serving
scripts/serve/launch.sh with dispatcher at bin/serve. Launches llama-server inside toolbox containers with optimized defaults: Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload. Key flags:
--ngram— n-gram speculative decoding (~1.1-1.4x tg for repetitive content)--no-think— disables thinking/reasoning via--reasoning-budget 0(faster for evals)--ctx N— context size (default 131072)--parallel N— concurrent request slots
System Tuning
scripts/optimize/power-profile.sh applies Phase 2 optimizations: RyzenAdj PPT increase (85W target, HP caps at 70W sustained), sysctl tuning (vm.swappiness=1, vm.max_map_count=500000), THP=always, RADV_PERFTEST=nogttspill. Systemd services for boot/resume persistence at configs/ryzenadj-llm.service and configs/ryzenadj-resume.service.
Agentic Evaluation
Scripts in scripts/agentic/ with dispatcher at bin/agentic. Uses a Python venv at .venv/ (Python 3.13, dependencies in requirements.txt). Eval frameworks: inspect-ai (all-in-one), inspect-evals (task definitions), evalplus (HumanEval+/MBPP+), bigcodebench. All target an OpenAI-compatible endpoint — auto-detects llama-server (port 8080) or ollama (port 11434). Model catalog at configs/models.conf.
External Resources
All external links are centralized in docs/references.md. Key ones:
- AMD ROCm Strix Halo guide (kernel params, GTT configuration)
- Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)
- Qwen3.5 model family (GGUF quants by Unsloth)
- Agentic eval frameworks (Inspect AI, EvalPlus, BFCL, BigCodeBench)