docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite

- README.md: project overview, quick start, command reference, workflow
- CLAUDE.md: AI safety rules, technical details, conventions
- AGENTS.md: agent workflows, file responsibility map, dependency matrix
- docs/architecture.md: script layers, data flow, unified memory, JSON schemas
- docs/optimization.md: step-by-step optimization walkthrough
- docs/benchmarking.md: methodology, test params, result interpretation
- docs/troubleshooting.md: common issues and fixes
- docs/references.md: centralized external links (single source of truth)
- docs/bios-vram-guide.md: add back-link to optimization workflow

Cross-linked non-redundantly: each doc owns one layer, others link to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Felipe Cardoso
2026-03-25 20:50:00 +01:00
parent af0515d05d
commit 5b81437637
9 changed files with 667 additions and 0 deletions

118
docs/architecture.md Normal file
View File

@@ -0,0 +1,118 @@
# Architecture
## Script Layers
```
bin/ User entry points (thin dispatchers)
audit → scripts/audit/
monitor → scripts/monitor/
benchmark → scripts/benchmark/
optimize → scripts/optimize/
scripts/ Implementation (sourcing lib/ for shared logic)
audit/ System assessment
monitor/ GPU/system monitoring + metrics logging
benchmark/ llama-bench via toolbox containers
optimize/ GRUB, tuned, BIOS guidance, verify, rollback
lib/ Shared bash libraries (sourced, not executed)
common.sh Logging, root checks, confirm prompts, paths
detect.sh Hardware/config detection from sysfs + system tools
format.sh Color output, human_bytes, status indicators, tables
configs/ Reference configuration templates
data/ Runtime output (gitignored)
docs/ Documentation
```
Every script sources libs in order: `common.sh``detect.sh``format.sh`. `format.sh` depends on color variables defined in `common.sh`.
## Data Flow
```
/sys/class/drm/card1/device/ ──→ lib/detect.sh ──→ scripts/audit/
/proc/cpuinfo, /proc/meminfo ──→ (detect_*) ──→ scripts/monitor/
/proc/cmdline ──→ ──→ scripts/optimize/
tuned-adm, rpm, lspci ──→ ──→ scripts/benchmark/
data/
├── audits/*.json
├── logs/*.csv
├── baselines/*/summary.json
└── benchmarks/*/summary.json
```
## Unified Memory Model
AMD Strix Halo shares physical RAM between CPU and GPU. Two allocation mechanisms:
| Type | Description | Configuration |
|------|-------------|---------------|
| **VRAM (dedicated)** | Permanently reserved for GPU framebuffer | BIOS setting (UMA Frame Buffer Size) |
| **GTT (dynamic)** | System RAM mapped into GPU address space on demand | Kernel boot params: `amdgpu.gttsize`, `ttm.pages_limit` |
**Optimal for LLM workloads**: Minimize VRAM (0.5 GiB), maximize GTT (~60 GiB on 64 GB system). The GPU borrows memory when needed and releases it when idle.
### Kernel Parameter Math (64 GB system)
```
Total physical RAM: 64 GiB
OS reserve: 4 GiB
Available for GTT: 60 GiB = 61440 MiB
amdgpu.gttsize = 60 * 1024 = 61440 (MiB)
ttm.pages_limit = 60 * 1024 * 256 = 15728640 (4K pages)
iommu = pt (passthrough, lower latency)
```
The toolkit computes these dynamically via `recommended_gttsize_mib()` and `recommended_pages_limit()` in `lib/detect.sh`, based on detected total physical RAM (visible + VRAM).
### Sysfs Paths
| Path | Content |
|------|---------|
| `/sys/class/drm/card1/device/mem_info_vram_total` | Dedicated VRAM in bytes |
| `/sys/class/drm/card1/device/mem_info_vram_used` | VRAM currently in use |
| `/sys/class/drm/card1/device/mem_info_gtt_total` | GTT allocation in bytes |
| `/sys/class/drm/card1/device/mem_info_gtt_used` | GTT currently in use |
| `/sys/class/drm/card1/device/gpu_busy_percent` | GPU utilization 0-100 |
| `/sys/class/drm/card1/device/hwmon/hwmon*/temp1_input` | Temperature in millidegrees C |
| `/sys/class/drm/card1/device/hwmon/hwmon*/power1_average` | Power in microwatts |
Card number is auto-detected by `find_gpu_card()` (matches AMD vendor ID `0x1002`).
## JSON Output Schemas
### system-state.json (from `audit --json`)
```json
{
"timestamp": "20260325-120000",
"hardware": { "cpu_model": "...", "cpu_cores": 16, "cpu_threads": 32, "gpu_name": "...", "gpu_device_id": "1586", "system_ram_kb": 32609248 },
"memory": { "vram_total_bytes": 0, "vram_used_bytes": 0, "gtt_total_bytes": 0, "gtt_used_bytes": 0, "recommended_gttsize_mib": 0, "recommended_pages_limit": 0 },
"kernel": { "version": "...", "cmdline": "...", "param_iommu": "", "param_gttsize": "", "param_pages_limit": "" },
"firmware": "...", "tuned_profile": "...", "rocm_version": "...",
"vulkan": { "driver": "...", "version": "..." },
"sensors": { "gpu_temp_mc": 0, "gpu_power_uw": 0, "gpu_busy_pct": 0 },
"toolboxes": [], "stacks": { "ollama": "...", "lmstudio": "...", "llamacpp": "...", "opencode": "..." }
}
```
### summary.json (from benchmark runs)
```json
{
"results": [
{ "file": "model__backend__fa1.log", "model": "...", "size": "...", "backend": "Vulkan", "test": "pp512", "tokens_per_sec": 548.18, "raw": "548.18 +/- 1.59" }
]
}
```
### metrics.csv (from `monitor --log`)
```
timestamp,gpu_busy_pct,vram_used_mib,gtt_used_mib,gpu_temp_c,gpu_power_w,cpu_pct,ram_used_mib
```
Sampled every 2 seconds by default. Pure bash implementation (no subshell forks per sample).

94
docs/benchmarking.md Normal file
View File

@@ -0,0 +1,94 @@
# Benchmarking
## What We Measure
All benchmarks use [llama-bench](https://github.com/ggml-org/llama.cpp) (part of llama.cpp) running inside toolbox containers. Two test types:
| Metric | Meaning | Test Params |
|--------|---------|-------------|
| **pp** (prompt processing) | How fast the model ingests input tokens | Default: 512 tokens |
| **tg** (token generation) | How fast the model produces output tokens | Default: 128 tokens |
Results are in **tokens/second (t/s)**. Higher is better.
## Test Parameters
### Standard Test
```
-ngl 99 -mmp 0 -fa 1 -r 5
```
- `-ngl 99` — all layers on GPU
- `-mmp 0` — disable memory mapping (`--no-mmap`)
- `-fa 1` — flash attention enabled
- `-r 5` — 5 repetitions for statistical confidence
### Long-Context Test
```
-ngl 99 -mmp 0 -fa 1 -p 2048 -n 32 -d 32768 -ub SIZE -r 3
```
- `-p 2048` — 2048 prompt tokens
- `-n 32` — generate 32 tokens
- `-d 32768` — 32K context window
- `-ub SIZE` — micro-batch size (512 for Vulkan, 2048 for ROCm)
- `-r 3` — 3 repetitions (long-context tests are slow)
The `-fa 1 --no-mmap -ngl 999` flags are **mandatory** on Strix Halo to avoid crashes.
## Available Backends
| Backend | Container | Technology | Notes |
|---------|-----------|------------|-------|
| `llama-vulkan-radv` | Mesa RADV | Vulkan | Most stable, recommended default |
| `llama-vulkan-amdvlk` | AMDVLK | Vulkan | Fastest when it works, 2GB buffer limit |
| `llama-rocm-6.4.4` | ROCm 6.4.4 | HIP | Proven stable |
| `llama-rocm-7.2` | ROCm 7.2 | HIP | Latest, compiler fixes applied |
Containers are from [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes). Set up with `make benchmark-setup`.
## Workflow
```bash
# 1. Setup (one-time)
make benchmark-setup
# 2. Capture baseline (before optimization)
make benchmark-baseline
# 3. After optimizing, run again
make benchmark # or: bin/benchmark run --tag post-opt
# 4. Compare
make benchmark-compare BEFORE=data/baselines/20260325-120000 AFTER=data/benchmarks/post-opt-20260326-100000
```
## Result Format
Each run produces a directory under `data/baselines/` or `data/benchmarks/`:
```
TIMESTAMP/
system-state.json # Full system audit snapshot
summary.json # Parsed results (model, backend, test, t/s)
metrics.csv # GPU/CPU metrics during the run
*.log # Raw llama-bench output per backend+model+test
```
### Comparison Output
```
Backend | Model | Test | Before | After | Delta
vulkan-radv | qwen3-4b | pp512 | 548 t/s | 612 t/s | +11.7%
vulkan-radv | qwen3-4b | tg128 | 13.9 | 15.2 | +9.4%
```
Configuration changes between runs (VRAM, GTT, kernel params, tuned profile) are shown if system-state.json differs.
## Recommended Test Models
| Size | Model | File | Disk | Use Case |
|------|-------|------|------|----------|
| Small | Qwen3-4B | Q4_K_M.gguf | ~3 GB | Quick smoke tests |
| Medium | Qwen3-14B | Q4_K_M.gguf | ~9 GB | Standard benchmarks |
| Large | Qwen3-32B | Q4_K_M.gguf | ~20 GB | Memory pressure tests |
Place models in `data/models/`. The VRAM estimator from the [toolboxes project](https://github.com/kyuz0/amd-strix-halo-toolboxes) (`gguf-vram-estimator.py`) can help plan which models fit.

View File

@@ -1,5 +1,7 @@
# BIOS VRAM Configuration — HP ZBook Ultra G1a
> Part of the [optimization workflow](optimization.md). For the full context on unified memory, see [architecture.md](architecture.md).
## Why Change VRAM?
AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it.

84
docs/optimization.md Normal file
View File

@@ -0,0 +1,84 @@
# Optimization Guide
Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.
## Step 1: Tuned Profile (no reboot)
```bash
sudo make optimize-tuned
```
Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
Takes effect immediately. Previous profile is saved for rollback.
## Step 2: Kernel Boot Parameters (reboot required)
```bash
sudo make optimize-kernel
```
Adds three parameters to GRUB:
| Parameter | Value (64 GB) | Purpose |
|-----------|--------------|---------|
| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
See [docs/architecture.md](architecture.md) for the math behind these values.
## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
```bash
make optimize-vram
```
This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
## Step 4: Verify
```bash
make verify
```
Checks 9 criteria and reports a score. Target: 9/9.
## Step 5: Measure Impact
```bash
make benchmark
make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
```
See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
## Expected Impact
| Optimization | pp512 Improvement | tg128 Improvement |
|-------------|-------------------|-------------------|
| Tuned profile | +5-8% | +2-3% |
| Kernel params + BIOS VRAM | +10-20% | +5-15% |
| **Combined** | **+15-25%** | **+8-18%** |
Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
## Rollback
```bash
sudo make rollback
```
Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
## Troubleshooting
If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).

49
docs/references.md Normal file
View File

@@ -0,0 +1,49 @@
# External References
Single source of truth for all external links used across this project.
## AMD Official
- [ROCm Strix Halo Optimization Guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html) — BIOS, kernel params, GTT/TTM configuration
- [ROCm System Optimization Index](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/index.html) — General ROCm tuning
- [ROCm Installation Guide (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) — Package installation
- [AMD SMI Documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) — GPU monitoring API
- [ROCm GitHub](https://github.com/ROCm/ROCm) — Source and issue tracker
## Strix Halo Toolboxes (Donato Capitella)
The most comprehensive community resource for Strix Halo LLM optimization.
- [strix-halo-toolboxes.com](https://strix-halo-toolboxes.com/) — Documentation, benchmarks, guides
- [GitHub: kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) — Container images, benchmark scripts, VRAM estimator
- [Benchmark Results Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) — Interactive performance charts
## Community
- [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
- [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
- [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
- [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database
## Other Strix Halo Repos
- [pablo-ross/strix-halo-gmktec-evo-x2](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2) — GMKtec EVO X2 optimization
- [kyuz0/amd-strix-halo-llm-finetuning](https://github.com/kyuz0/amd-strix-halo-llm-finetuning) — Fine-tuning guides (Gemma-3, Qwen-3)
## Monitoring Tools
- [amdgpu_top](https://github.com/Umio-Yasuno/amdgpu_top) — Best AMD GPU monitor (TUI/GUI/JSON)
- [nvtop](https://github.com/Syllo/nvtop) — Cross-vendor GPU monitor
- [btop](https://github.com/aristocratos/btop) — System resource monitor
## LLM Inference
- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine (Vulkan + ROCm)
- [ollama](https://ollama.com/) — LLM runtime with model management
- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking
## AMD GPU Profiling
- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling
- [Radeon GPU Analyzer (RGA)](https://gpuopen.com/rga/) — Offline shader/kernel analysis

96
docs/troubleshooting.md Normal file
View File

@@ -0,0 +1,96 @@
# Troubleshooting
## Firmware: linux-firmware 20251125 Causes ROCm Crashes
**Symptoms**: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.
**Check**: `rpm -qa | grep linux-firmware`
**Fix**: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:
```bash
sudo dracut -f --kver $(uname -r)
```
The toolkit checks this automatically — `make audit` shows firmware status.
## amdgpu_top: Cargo Build Fails (gix-hash error)
**Symptoms**: `error: Please set either the sha1 or sha256 feature flag` during `cargo install amdgpu_top`.
**Cause**: Rust toolchain version incompatibility with the `gix-hash` dependency.
**Fix**: Use the pre-built RPM instead:
```bash
make monitor-install
```
The install script downloads the RPM from GitHub releases, bypassing cargo entirely.
## Toolbox GPU Access Failure
**Symptoms**: `llama-cli --list-devices` shows no GPU inside a toolbox container.
**Check**: Device mappings when creating the toolbox:
- Vulkan backends need: `--device /dev/dri`
- ROCm backends need: `--device /dev/dri --device /dev/kfd`
**Fix**: Recreate the toolbox with correct device flags. The [refresh-toolboxes.sh](https://github.com/kyuz0/amd-strix-halo-toolboxes) script handles this automatically.
Also ensure your user is in the `video` and `render` groups:
```bash
sudo usermod -aG video,render $USER
```
## GRUB Changes Not Taking Effect
**Symptoms**: After `make optimize-kernel` and reboot, `make audit` still shows missing params.
**Possible causes**:
1. **BLS (Boot Loader Spec)**: Modern Fedora uses BLS entries. The script uses `grubby` when available, but verify:
```bash
grubby --info=ALL | grep args
```
2. **Wrong GRUB config path**: Check which config is actually used:
```bash
cat /proc/cmdline # what the kernel actually booted with
cat /etc/default/grub # what the script modified
```
3. **GRUB not regenerated**: Manually regenerate:
```bash
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
```
## Memory Unchanged After BIOS Change
**Symptoms**: Changed VRAM in BIOS but `make audit` still shows 32 GiB.
**Check**:
```bash
cat /sys/class/drm/card1/device/mem_info_vram_total
```
**Possible causes**:
- BIOS change not saved (verify by re-entering BIOS)
- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
- Kernel params not applied (VRAM reduction requires kernel params to be useful)
## Benchmark Failures
**Symptoms**: `make benchmark-baseline` reports "FAILED" for some backends.
**Common fixes**:
- Ensure model exists: `ls data/models/*.gguf`
- Check model fits in memory: small models (4B) for initial testing
- Try `llama-vulkan-radv` first (most stable backend)
- Check dmesg for GPU errors: `dmesg | tail -30`
## Rollback
If optimization causes issues:
```bash
sudo make rollback
```
This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See [docs/optimization.md](optimization.md) for the full rollback procedure.