docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite

- README.md: project overview, quick start, command reference, workflow - CLAUDE.md: AI safety rules, technical details, conventions - AGENTS.md: agent workflows, file responsibility map, dependency matrix - docs/architecture.md: script layers, data flow, unified memory, JSON schemas - docs/optimization.md: step-by-step optimization walkthrough - docs/benchmarking.md: methodology, test params, result interpretation - docs/troubleshooting.md: common issues and fixes - docs/references.md: centralized external links (single source of truth) - docs/bios-vram-guide.md: add back-link to optimization workflow Cross-linked non-redundantly: each doc owns one layer, others link to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:50:00 +01:00
parent af0515d05d
commit 5b81437637
9 changed files with 667 additions and 0 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -0,0 +1,118 @@
+# Architecture
+
+## Script Layers
+
+```
+bin/                    User entry points (thin dispatchers)
+  audit                   → scripts/audit/
+  monitor                 → scripts/monitor/
+  benchmark               → scripts/benchmark/
+  optimize                → scripts/optimize/
+
+scripts/                Implementation (sourcing lib/ for shared logic)
+  audit/                  System assessment
+  monitor/                GPU/system monitoring + metrics logging
+  benchmark/              llama-bench via toolbox containers
+  optimize/               GRUB, tuned, BIOS guidance, verify, rollback
+
+lib/                    Shared bash libraries (sourced, not executed)
+  common.sh               Logging, root checks, confirm prompts, paths
+  detect.sh               Hardware/config detection from sysfs + system tools
+  format.sh               Color output, human_bytes, status indicators, tables
+
+configs/                Reference configuration templates
+data/                   Runtime output (gitignored)
+docs/                   Documentation
+```
+
+Every script sources libs in order: `common.sh` → `detect.sh` → `format.sh`. `format.sh` depends on color variables defined in `common.sh`.
+
+## Data Flow
+
+```
+/sys/class/drm/card1/device/   ──→  lib/detect.sh  ──→  scripts/audit/
+/proc/cpuinfo, /proc/meminfo   ──→      (detect_*)  ──→  scripts/monitor/
+/proc/cmdline                   ──→                  ──→  scripts/optimize/
+tuned-adm, rpm, lspci          ──→                  ──→  scripts/benchmark/
+                                                              │
+                                                              ▼
+                                                         data/
+                                                           ├── audits/*.json
+                                                           ├── logs/*.csv
+                                                           ├── baselines/*/summary.json
+                                                           └── benchmarks/*/summary.json
+```
+
+## Unified Memory Model
+
+AMD Strix Halo shares physical RAM between CPU and GPU. Two allocation mechanisms:
+
+| Type | Description | Configuration |
+|------|-------------|---------------|
+| **VRAM (dedicated)** | Permanently reserved for GPU framebuffer | BIOS setting (UMA Frame Buffer Size) |
+| **GTT (dynamic)** | System RAM mapped into GPU address space on demand | Kernel boot params: `amdgpu.gttsize`, `ttm.pages_limit` |
+
+**Optimal for LLM workloads**: Minimize VRAM (0.5 GiB), maximize GTT (~60 GiB on 64 GB system). The GPU borrows memory when needed and releases it when idle.
+
+### Kernel Parameter Math (64 GB system)
+
+```
+Total physical RAM:     64 GiB
+OS reserve:              4 GiB
+Available for GTT:      60 GiB = 61440 MiB
+
+amdgpu.gttsize = 60 * 1024         = 61440   (MiB)
+ttm.pages_limit = 60 * 1024 * 256  = 15728640 (4K pages)
+iommu = pt                                     (passthrough, lower latency)
+```
+
+The toolkit computes these dynamically via `recommended_gttsize_mib()` and `recommended_pages_limit()` in `lib/detect.sh`, based on detected total physical RAM (visible + VRAM).
+
+### Sysfs Paths
+
+| Path | Content |
+|------|---------|
+| `/sys/class/drm/card1/device/mem_info_vram_total` | Dedicated VRAM in bytes |
+| `/sys/class/drm/card1/device/mem_info_vram_used` | VRAM currently in use |
+| `/sys/class/drm/card1/device/mem_info_gtt_total` | GTT allocation in bytes |
+| `/sys/class/drm/card1/device/mem_info_gtt_used` | GTT currently in use |
+| `/sys/class/drm/card1/device/gpu_busy_percent` | GPU utilization 0-100 |
+| `/sys/class/drm/card1/device/hwmon/hwmon*/temp1_input` | Temperature in millidegrees C |
+| `/sys/class/drm/card1/device/hwmon/hwmon*/power1_average` | Power in microwatts |
+
+Card number is auto-detected by `find_gpu_card()` (matches AMD vendor ID `0x1002`).
+
+## JSON Output Schemas
+
+### system-state.json (from `audit --json`)
+
+```json
+{
+  "timestamp": "20260325-120000",
+  "hardware": { "cpu_model": "...", "cpu_cores": 16, "cpu_threads": 32, "gpu_name": "...", "gpu_device_id": "1586", "system_ram_kb": 32609248 },
+  "memory": { "vram_total_bytes": 0, "vram_used_bytes": 0, "gtt_total_bytes": 0, "gtt_used_bytes": 0, "recommended_gttsize_mib": 0, "recommended_pages_limit": 0 },
+  "kernel": { "version": "...", "cmdline": "...", "param_iommu": "", "param_gttsize": "", "param_pages_limit": "" },
+  "firmware": "...", "tuned_profile": "...", "rocm_version": "...",
+  "vulkan": { "driver": "...", "version": "..." },
+  "sensors": { "gpu_temp_mc": 0, "gpu_power_uw": 0, "gpu_busy_pct": 0 },
+  "toolboxes": [], "stacks": { "ollama": "...", "lmstudio": "...", "llamacpp": "...", "opencode": "..." }
+}
+```
+
+### summary.json (from benchmark runs)
+
+```json
+{
+  "results": [
+    { "file": "model__backend__fa1.log", "model": "...", "size": "...", "backend": "Vulkan", "test": "pp512", "tokens_per_sec": 548.18, "raw": "548.18 +/- 1.59" }
+  ]
+}
+```
+
+### metrics.csv (from `monitor --log`)
+
+```
+timestamp,gpu_busy_pct,vram_used_mib,gtt_used_mib,gpu_temp_c,gpu_power_w,cpu_pct,ram_used_mib
+```
+
+Sampled every 2 seconds by default. Pure bash implementation (no subshell forks per sample).
--- a/docs/benchmarking.md
+++ b/docs/benchmarking.md
@@ -0,0 +1,94 @@
+# Benchmarking
+
+## What We Measure
+
+All benchmarks use [llama-bench](https://github.com/ggml-org/llama.cpp) (part of llama.cpp) running inside toolbox containers. Two test types:
+
+| Metric | Meaning | Test Params |
+|--------|---------|-------------|
+| **pp** (prompt processing) | How fast the model ingests input tokens | Default: 512 tokens |
+| **tg** (token generation) | How fast the model produces output tokens | Default: 128 tokens |
+
+Results are in **tokens/second (t/s)**. Higher is better.
+
+## Test Parameters
+
+### Standard Test
+```
+-ngl 99 -mmp 0 -fa 1 -r 5
+```
+- `-ngl 99` — all layers on GPU
+- `-mmp 0` — disable memory mapping (`--no-mmap`)
+- `-fa 1` — flash attention enabled
+- `-r 5` — 5 repetitions for statistical confidence
+
+### Long-Context Test
+```
+-ngl 99 -mmp 0 -fa 1 -p 2048 -n 32 -d 32768 -ub SIZE -r 3
+```
+- `-p 2048` — 2048 prompt tokens
+- `-n 32` — generate 32 tokens
+- `-d 32768` — 32K context window
+- `-ub SIZE` — micro-batch size (512 for Vulkan, 2048 for ROCm)
+- `-r 3` — 3 repetitions (long-context tests are slow)
+
+The `-fa 1 --no-mmap -ngl 999` flags are **mandatory** on Strix Halo to avoid crashes.
+
+## Available Backends
+
+| Backend | Container | Technology | Notes |
+|---------|-----------|------------|-------|
+| `llama-vulkan-radv` | Mesa RADV | Vulkan | Most stable, recommended default |
+| `llama-vulkan-amdvlk` | AMDVLK | Vulkan | Fastest when it works, 2GB buffer limit |
+| `llama-rocm-6.4.4` | ROCm 6.4.4 | HIP | Proven stable |
+| `llama-rocm-7.2` | ROCm 7.2 | HIP | Latest, compiler fixes applied |
+
+Containers are from [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes). Set up with `make benchmark-setup`.
+
+## Workflow
+
+```bash
+# 1. Setup (one-time)
+make benchmark-setup
+
+# 2. Capture baseline (before optimization)
+make benchmark-baseline
+
+# 3. After optimizing, run again
+make benchmark              # or: bin/benchmark run --tag post-opt
+
+# 4. Compare
+make benchmark-compare BEFORE=data/baselines/20260325-120000 AFTER=data/benchmarks/post-opt-20260326-100000
+```
+
+## Result Format
+
+Each run produces a directory under `data/baselines/` or `data/benchmarks/`:
+
+```
+TIMESTAMP/
+  system-state.json    # Full system audit snapshot
+  summary.json         # Parsed results (model, backend, test, t/s)
+  metrics.csv          # GPU/CPU metrics during the run
+  *.log                # Raw llama-bench output per backend+model+test
+```
+
+### Comparison Output
+
+```
+Backend     | Model     | Test  | Before  | After   | Delta
+vulkan-radv | qwen3-4b  | pp512 | 548 t/s | 612 t/s | +11.7%
+vulkan-radv | qwen3-4b  | tg128 | 13.9    | 15.2    | +9.4%
+```
+
+Configuration changes between runs (VRAM, GTT, kernel params, tuned profile) are shown if system-state.json differs.
+
+## Recommended Test Models
+
+| Size | Model | File | Disk | Use Case |
+|------|-------|------|------|----------|
+| Small | Qwen3-4B | Q4_K_M.gguf | ~3 GB | Quick smoke tests |
+| Medium | Qwen3-14B | Q4_K_M.gguf | ~9 GB | Standard benchmarks |
+| Large | Qwen3-32B | Q4_K_M.gguf | ~20 GB | Memory pressure tests |
+
+Place models in `data/models/`. The VRAM estimator from the [toolboxes project](https://github.com/kyuz0/amd-strix-halo-toolboxes) (`gguf-vram-estimator.py`) can help plan which models fit.
--- a/docs/bios-vram-guide.md
+++ b/docs/bios-vram-guide.md
@@ -1,5 +1,7 @@
 # BIOS VRAM Configuration — HP ZBook Ultra G1a

+> Part of the [optimization workflow](optimization.md). For the full context on unified memory, see [architecture.md](architecture.md).
+
 ## Why Change VRAM?

 AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it.
--- a/docs/optimization.md
+++ b/docs/optimization.md
@@ -0,0 +1,84 @@
+# Optimization Guide
+
+Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
+
+**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.
+
+## Step 1: Tuned Profile (no reboot)
+
+```bash
+sudo make optimize-tuned
+```
+
+Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
+
+Takes effect immediately. Previous profile is saved for rollback.
+
+## Step 2: Kernel Boot Parameters (reboot required)
+
+```bash
+sudo make optimize-kernel
+```
+
+Adds three parameters to GRUB:
+
+| Parameter | Value (64 GB) | Purpose |
+|-----------|--------------|---------|
+| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
+| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
+| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
+
+Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
+
+See [docs/architecture.md](architecture.md) for the math behind these values.
+
+## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
+
+```bash
+make optimize-vram
+```
+
+This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
+
+See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
+
+**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
+
+## Step 4: Verify
+
+```bash
+make verify
+```
+
+Checks 9 criteria and reports a score. Target: 9/9.
+
+## Step 5: Measure Impact
+
+```bash
+make benchmark
+make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
+```
+
+See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
+
+## Expected Impact
+
+| Optimization | pp512 Improvement | tg128 Improvement |
+|-------------|-------------------|-------------------|
+| Tuned profile | +5-8% | +2-3% |
+| Kernel params + BIOS VRAM | +10-20% | +5-15% |
+| **Combined** | **+15-25%** | **+8-18%** |
+
+Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
+
+## Rollback
+
+```bash
+sudo make rollback
+```
+
+Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
+
+## Troubleshooting
+
+If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).
--- a/docs/references.md
+++ b/docs/references.md
@@ -0,0 +1,49 @@
+# External References
+
+Single source of truth for all external links used across this project.
+
+## AMD Official
+
+- [ROCm Strix Halo Optimization Guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html) — BIOS, kernel params, GTT/TTM configuration
+- [ROCm System Optimization Index](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/index.html) — General ROCm tuning
+- [ROCm Installation Guide (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) — Package installation
+- [AMD SMI Documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) — GPU monitoring API
+- [ROCm GitHub](https://github.com/ROCm/ROCm) — Source and issue tracker
+
+## Strix Halo Toolboxes (Donato Capitella)
+
+The most comprehensive community resource for Strix Halo LLM optimization.
+
+- [strix-halo-toolboxes.com](https://strix-halo-toolboxes.com/) — Documentation, benchmarks, guides
+- [GitHub: kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) — Container images, benchmark scripts, VRAM estimator
+- [Benchmark Results Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) — Interactive performance charts
+
+## Community
+
+- [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
+- [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
+- [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
+- [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database
+
+## Other Strix Halo Repos
+
+- [pablo-ross/strix-halo-gmktec-evo-x2](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2) — GMKtec EVO X2 optimization
+- [kyuz0/amd-strix-halo-llm-finetuning](https://github.com/kyuz0/amd-strix-halo-llm-finetuning) — Fine-tuning guides (Gemma-3, Qwen-3)
+
+## Monitoring Tools
+
+- [amdgpu_top](https://github.com/Umio-Yasuno/amdgpu_top) — Best AMD GPU monitor (TUI/GUI/JSON)
+- [nvtop](https://github.com/Syllo/nvtop) — Cross-vendor GPU monitor
+- [btop](https://github.com/aristocratos/btop) — System resource monitor
+
+## LLM Inference
+
+- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine (Vulkan + ROCm)
+- [ollama](https://ollama.com/) — LLM runtime with model management
+- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
+- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking
+
+## AMD GPU Profiling
+
+- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling
+- [Radeon GPU Analyzer (RGA)](https://gpuopen.com/rga/) — Offline shader/kernel analysis
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -0,0 +1,96 @@
+# Troubleshooting
+
+## Firmware: linux-firmware 20251125 Causes ROCm Crashes
+
+**Symptoms**: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.
+
+**Check**: `rpm -qa | grep linux-firmware`
+
+**Fix**: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:
+```bash
+sudo dracut -f --kver $(uname -r)
+```
+
+The toolkit checks this automatically — `make audit` shows firmware status.
+
+## amdgpu_top: Cargo Build Fails (gix-hash error)
+
+**Symptoms**: `error: Please set either the sha1 or sha256 feature flag` during `cargo install amdgpu_top`.
+
+**Cause**: Rust toolchain version incompatibility with the `gix-hash` dependency.
+
+**Fix**: Use the pre-built RPM instead:
+```bash
+make monitor-install
+```
+The install script downloads the RPM from GitHub releases, bypassing cargo entirely.
+
+## Toolbox GPU Access Failure
+
+**Symptoms**: `llama-cli --list-devices` shows no GPU inside a toolbox container.
+
+**Check**: Device mappings when creating the toolbox:
+- Vulkan backends need: `--device /dev/dri`
+- ROCm backends need: `--device /dev/dri --device /dev/kfd`
+
+**Fix**: Recreate the toolbox with correct device flags. The [refresh-toolboxes.sh](https://github.com/kyuz0/amd-strix-halo-toolboxes) script handles this automatically.
+
+Also ensure your user is in the `video` and `render` groups:
+```bash
+sudo usermod -aG video,render $USER
+```
+
+## GRUB Changes Not Taking Effect
+
+**Symptoms**: After `make optimize-kernel` and reboot, `make audit` still shows missing params.
+
+**Possible causes**:
+
+1. **BLS (Boot Loader Spec)**: Modern Fedora uses BLS entries. The script uses `grubby` when available, but verify:
+   ```bash
+   grubby --info=ALL | grep args
+   ```
+
+2. **Wrong GRUB config path**: Check which config is actually used:
+   ```bash
+   cat /proc/cmdline    # what the kernel actually booted with
+   cat /etc/default/grub  # what the script modified
+   ```
+
+3. **GRUB not regenerated**: Manually regenerate:
+   ```bash
+   sudo grub2-mkconfig -o /boot/grub2/grub.cfg
+   ```
+
+## Memory Unchanged After BIOS Change
+
+**Symptoms**: Changed VRAM in BIOS but `make audit` still shows 32 GiB.
+
+**Check**:
+```bash
+cat /sys/class/drm/card1/device/mem_info_vram_total
+```
+
+**Possible causes**:
+- BIOS change not saved (verify by re-entering BIOS)
+- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
+- Kernel params not applied (VRAM reduction requires kernel params to be useful)
+
+## Benchmark Failures
+
+**Symptoms**: `make benchmark-baseline` reports "FAILED" for some backends.
+
+**Common fixes**:
+- Ensure model exists: `ls data/models/*.gguf`
+- Check model fits in memory: small models (4B) for initial testing
+- Try `llama-vulkan-radv` first (most stable backend)
+- Check dmesg for GPU errors: `dmesg | tail -30`
+
+## Rollback
+
+If optimization causes issues:
+```bash
+sudo make rollback
+```
+
+This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See [docs/optimization.md](optimization.md) for the full rollback procedure.