- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts) - Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs - Add -b/--batch to bin/benchmark help text - Add --no-think flag to serve script (--reasoning-budget 0) - Sanitize model names in eval run directories - Simplify agentic setup to use requirements.txt - Add serve --help test, batch flag assertions to existing tests - Add requirements.txt for reproducible venv setup (Python 3.13)
67 lines
5.0 KiB
Markdown
67 lines
5.0 KiB
Markdown
# CLAUDE.md — AI Assistant Context
|
|
|
|
Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See [README.md](README.md) for user-facing commands.
|
|
|
|
## Architecture
|
|
|
|
`bin/` dispatchers → `scripts/` implementations → `lib/` shared libraries. Scripts source libs as needed: always `common.sh` first, then `detect.sh` if hardware detection is needed, then `format.sh` if formatted output is needed. Some scripts (e.g., `rollback.sh`) only need `common.sh`. Runtime data goes to `data/` (gitignored). Full details in [docs/architecture.md](docs/architecture.md).
|
|
|
|
## Safety Rules
|
|
|
|
- **`scripts/optimize/kernel-params.sh`** modifies `/etc/default/grub` — requires root, backs up to `data/backups/` first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).
|
|
- **`scripts/optimize/tuned-profile.sh`** and **`rollback.sh`** require root and save previous state for rollback.
|
|
- **`data/backups/`** contains GRUB backups and tuned profile snapshots — never delete these.
|
|
- Optimization scripts that modify system state (`kernel-params.sh`, `tuned-profile.sh`, `rollback.sh`) check `$EUID` at the top and exit immediately if not root. Guidance-only scripts (`vram-gtt.sh`, `verify.sh`) do not require root.
|
|
- All Python blocks receive data via environment variables (`os.environ`), never via shell interpolation into Python source. This prevents injection. **Do not revert to `'''$var'''` or `"$var"` patterns inside Python heredocs.**
|
|
|
|
## Key Technical Details
|
|
|
|
- **GPU sysfs**: Auto-detected by `find_gpu_card()` in `lib/detect.sh` (matches vendor `0x1002`). Falls back to first card with `mem_info_vram_total`.
|
|
- **Memory recommendations**: `recommended_gttsize_mib()` in `detect.sh` computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB.
|
|
- **Kernel param detection**: `detect_kernel_param()` uses word-boundary-anchored regex to avoid `iommu` matching `amd_iommu`.
|
|
- **Benchmark invocation**: `toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N`. ENV_ARGS passed as a proper bash array (not string splitting).
|
|
- **llama-bench output**: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
|
|
- **ROCm for gfx1151**: Scripts set `ROCBLAS_USE_HIPBLASLT=1` in benchmark ENV_ARGS. `HSA_OVERRIDE_GFX_VERSION=11.5.1` is set inside the toolbox containers (not by our scripts) — needed for ollama and native ROCm builds.
|
|
- **Fedora GRUB**: Prefers `grubby` (BLS), falls back to `grub2-mkconfig`, then `grub-mkconfig`. All three paths are handled.
|
|
|
|
## Conventions
|
|
|
|
- `set -euo pipefail` in every executable script
|
|
- `snake_case` function names, `UPPER_CASE` for constants and loop variables
|
|
- 4-space indentation, no tabs
|
|
- `lib/` files are sourced (no shebang enforcement), but include `#!/usr/bin/env bash` for editor support
|
|
- Colors gated on `[[ -t 1 ]]` (disabled when piped)
|
|
- `bc` used for float math; `python3` for JSON and GRUB editing only
|
|
|
|
## Validating Changes
|
|
|
|
```bash
|
|
make audit # Quick check — shows system status with pass/fail indicators
|
|
make verify # 9-point optimization checklist
|
|
bin/audit --json | python3 -m json.tool # Verify JSON output is valid
|
|
```
|
|
|
|
## Serving
|
|
|
|
`scripts/serve/launch.sh` with dispatcher at `bin/serve`. Launches llama-server inside toolbox containers with optimized defaults: Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload. Key flags:
|
|
- `--ngram` — n-gram speculative decoding (~1.1-1.4x tg for repetitive content)
|
|
- `--no-think` — disables thinking/reasoning via `--reasoning-budget 0` (faster for evals)
|
|
- `--ctx N` — context size (default 131072)
|
|
- `--parallel N` — concurrent request slots
|
|
|
|
## System Tuning
|
|
|
|
`scripts/optimize/power-profile.sh` applies Phase 2 optimizations: RyzenAdj PPT increase (85W target, HP caps at 70W sustained), sysctl tuning (vm.swappiness=1, vm.max_map_count=500000), THP=always, RADV_PERFTEST=nogttspill. Systemd services for boot/resume persistence at `configs/ryzenadj-llm.service` and `configs/ryzenadj-resume.service`.
|
|
|
|
## Agentic Evaluation
|
|
|
|
Scripts in `scripts/agentic/` with dispatcher at `bin/agentic`. Uses a Python venv at `.venv/` (Python 3.13, dependencies in `requirements.txt`). Eval frameworks: inspect-ai (all-in-one), inspect-evals (task definitions), evalplus (HumanEval+/MBPP+), bigcodebench. All target an OpenAI-compatible endpoint — auto-detects llama-server (port 8080) or ollama (port 11434). Model catalog at `configs/models.conf`.
|
|
|
|
## External Resources
|
|
|
|
All external links are centralized in [docs/references.md](docs/references.md). Key ones:
|
|
- AMD ROCm Strix Halo guide (kernel params, GTT configuration)
|
|
- Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)
|
|
- Qwen3.5 model family (GGUF quants by Unsloth)
|
|
- Agentic eval frameworks (Inspect AI, EvalPlus, BFCL, BigCodeBench)
|