docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite

- README.md: project overview, quick start, command reference, workflow
- CLAUDE.md: AI safety rules, technical details, conventions
- AGENTS.md: agent workflows, file responsibility map, dependency matrix
- docs/architecture.md: script layers, data flow, unified memory, JSON schemas
- docs/optimization.md: step-by-step optimization walkthrough
- docs/benchmarking.md: methodology, test params, result interpretation
- docs/troubleshooting.md: common issues and fixes
- docs/references.md: centralized external links (single source of truth)
- docs/bios-vram-guide.md: add back-link to optimization workflow

Cross-linked non-redundantly: each doc owns one layer, others link to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Felipe Cardoso
2026-03-25 20:50:00 +01:00
parent af0515d05d
commit 5b81437637
9 changed files with 667 additions and 0 deletions

62
AGENTS.md Normal file
View File

@@ -0,0 +1,62 @@
# AGENTS.md — Agent Workflows
Read [CLAUDE.md](CLAUDE.md) first for safety rules and technical context.
## Common Workflows
### Add a Detection Function
1. Add the function to `lib/detect.sh` following `detect_*` naming convention
2. If it reads sysfs, use `$GPU_SYSFS` (auto-detected) with a `2>/dev/null` fallback
3. Wire it into `scripts/audit/quick-glance.sh` (display) and/or `scripts/audit/system-report.sh` (JSON output)
4. If it has an optimal value, add a check to `scripts/optimize/verify.sh`
5. Validate: `make audit` and `bin/audit --json | python3 -m json.tool`
### Add a Benchmark Backend
1. Add the toolbox entry to `BENCH_PATHS` associative array in both `scripts/benchmark/run-baseline.sh` and `scripts/benchmark/run-suite.sh`
2. Map the toolbox name → llama-bench binary path (Vulkan: `/usr/sbin/llama-bench`, ROCm: `/usr/local/bin/llama-bench`)
3. If ROCm, the `ENV_ARGS` logic already handles `ROCBLAS_USE_HIPBLASLT=1`
4. Add the toolbox to `refresh-toolboxes.sh` in the [toolboxes repo](https://github.com/kyuz0/amd-strix-halo-toolboxes)
5. Validate: `make benchmark-setup` then `make benchmark`
### Add an Optimization Script
1. Create `scripts/optimize/my-optimization.sh` sourcing `lib/common.sh` (and `detect.sh` / `format.sh` as needed)
2. Add root check at top if the script modifies system state: `[[ $EUID -ne 0 ]] && { log_error "Requires root"; exit 1; }`
3. Add a corresponding case to `bin/optimize`
4. Add a Makefile target
5. Add verification criteria to `scripts/optimize/verify.sh`
6. If the optimization is reversible, add rollback logic to `scripts/optimize/rollback.sh`
7. Document in [docs/optimization.md](docs/optimization.md)
### Add a Monitoring Metric
1. In `scripts/monitor/log-metrics.sh`, cache the sysfs path at startup (avoid per-sample globbing)
2. Read with `read -r var < "$SYSFS_PATH" 2>/dev/null || var=0` (no subshells in the hot loop)
3. Add the column to the CSV header and the `echo` line
4. Update the CSV schema in [docs/architecture.md](docs/architecture.md)
## File Responsibility Map
| Want to change... | Touch these files |
|-------------------|-------------------|
| What `make audit` shows | `scripts/audit/quick-glance.sh`, `lib/detect.sh` |
| JSON audit output | `scripts/audit/system-report.sh`, `lib/detect.sh` |
| Dashboard layout | `scripts/monitor/dashboard.sh` |
| Metric collection | `scripts/monitor/log-metrics.sh` |
| Benchmark parameters | `scripts/benchmark/run-baseline.sh`, `run-suite.sh` |
| Result comparison | `scripts/benchmark/compare.sh` |
| Kernel params | `scripts/optimize/kernel-params.sh`, `lib/detect.sh` (recommended values) |
| Optimization checks | `scripts/optimize/verify.sh`, `scripts/audit/quick-glance.sh` |
| Shared utilities | `lib/common.sh` (logging), `lib/format.sh` (output), `lib/detect.sh` (hardware) |
| External links | `docs/references.md` (single source of truth) |
## Dependencies by Workflow
| Workflow | Requires |
|----------|----------|
| Audit | `bc`, `python3` |
| Monitor | `tmux`, `amdgpu_top` or `nvtop`, `btop` or `htop` |
| Benchmark | `toolbox`, `podman`, GGUF models in `data/models/`, `python3` |
| Optimize | `sudo`/root, `grubby` or `grub2-mkconfig`, `tuned-adm`, `python3` |

48
CLAUDE.md Normal file
View File

@@ -0,0 +1,48 @@
# CLAUDE.md — AI Assistant Context
Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See [README.md](README.md) for user-facing commands.
## Architecture
`bin/` dispatchers → `scripts/` implementations → `lib/` shared libraries. All scripts source libs in order: `common.sh``detect.sh``format.sh`. Runtime data goes to `data/` (gitignored). Full details in [docs/architecture.md](docs/architecture.md).
## Safety Rules
- **`scripts/optimize/kernel-params.sh`** modifies `/etc/default/grub` — requires root, backs up to `data/backups/` first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).
- **`scripts/optimize/tuned-profile.sh`** and **`rollback.sh`** require root and save previous state for rollback.
- **`data/backups/`** contains GRUB backups and tuned profile snapshots — never delete these.
- Optimization scripts that require root check `$EUID` at the top and exit immediately if not root.
- All Python blocks receive data via environment variables (`os.environ`), never via shell interpolation into Python source. This prevents injection. **Do not revert to `'''$var'''` or `"$var"` patterns inside Python heredocs.**
## Key Technical Details
- **GPU sysfs**: Auto-detected by `find_gpu_card()` in `lib/detect.sh` (matches vendor `0x1002`). Falls back to first card with `mem_info_vram_total`.
- **Memory recommendations**: `recommended_gttsize_mib()` in `detect.sh` computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB.
- **Kernel param detection**: `detect_kernel_param()` uses word-boundary-anchored regex to avoid `iommu` matching `amd_iommu`.
- **Benchmark invocation**: `toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N`. ENV_ARGS passed as a proper bash array (not string splitting).
- **llama-bench output**: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
- **ROCm for gfx1151**: `ROCBLAS_USE_HIPBLASLT=1`, `HSA_OVERRIDE_GFX_VERSION=11.5.1`.
- **Fedora GRUB**: Prefers `grubby` (BLS) over `grub2-mkconfig`. Both paths are handled.
## Conventions
- `set -euo pipefail` in every executable script
- `snake_case` function names, `UPPER_CASE` for constants and loop variables
- 4-space indentation, no tabs
- `lib/` files are sourced (no shebang enforcement), but include `#!/usr/bin/env bash` for editor support
- Colors gated on `[[ -t 1 ]]` (disabled when piped)
- `bc` used for float math; `python3` for JSON and GRUB editing only
## Validating Changes
```bash
make audit # Quick check — shows system status with pass/fail indicators
make verify # 9-point optimization checklist
bin/audit --json | python3 -m json.tool # Verify JSON output is valid
```
## External Resources
All external links are centralized in [docs/references.md](docs/references.md). Key ones:
- AMD ROCm Strix Halo guide (kernel params, GTT configuration)
- Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)

114
README.md Normal file
View File

@@ -0,0 +1,114 @@
# Strix Halo Optimization Toolkit
Audit, monitor, benchmark, and optimize AMD Strix Halo integrated GPU systems for LLM inference workloads.
**Target hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) with 64 GB unified memory, on Fedora 43. Tested on HP ZBook Ultra G1a.
## Quick Start
```bash
make audit # See current system status and optimization score
make monitor-install # Install amdgpu_top + btop
make benchmark-setup # Create toolbox containers + download test model
make benchmark-baseline # Capture performance before optimization
```
## System Status
`make audit` produces a single-screen overview:
```
=== Memory Allocation ===
[!!] VRAM (dedicated) 32.0 GiB — should be 0.5 GiB in BIOS
[!!] GTT (dynamic) 15.5 GiB — should be ~59.0 GiB with kernel params
=== Kernel Boot Parameters ===
[!!] iommu=pt MISSING
[!!] amdgpu.gttsize MISSING — recommended: 60416
[!!] ttm.pages_limit MISSING — recommended: 15466496
=== Performance Profile ===
[!!] Tuned profile throughput-performance — recommended: accelerator-performance
=== Optimization Score ===
2 / 8 checks passing
```
Each `[!!]` is an optimization opportunity. Run `make optimize` to address them.
## Commands
| Command | Description |
|---------|-------------|
| `make audit` | Quick system status (single screen) |
| `make audit-full` | Full system report (saved to data/audits/) |
| `make monitor` | Launch tmux monitoring dashboard |
| `make monitor-simple` | Launch amdgpu_top only |
| `make monitor-install` | Install monitoring tools (amdgpu_top, btop) |
| `make monitor-log` | Start background CSV metric logger |
| `make benchmark-setup` | Ensure toolboxes and test models are ready |
| `make benchmark-baseline` | Capture pre-optimization baseline |
| `make benchmark` | Run full benchmark suite |
| `make benchmark-compare` | Compare two runs (`BEFORE=dir AFTER=dir`) |
| `sudo make optimize` | Interactive optimization walkthrough |
| `sudo make optimize-kernel` | Configure kernel boot parameters |
| `sudo make optimize-tuned` | Switch to accelerator-performance profile |
| `make optimize-vram` | BIOS VRAM guidance + GTT verification |
| `make verify` | Post-optimization verification checklist |
| `sudo make rollback` | Rollback optimizations |
## Optimization Workflow
```
1. Audit make audit
2. Monitor make monitor-install && make monitor
3. Baseline make benchmark-setup && make benchmark-baseline
4. Optimize sudo make optimize
│ ├── tuned profile (instant, +5-8% pp)
│ ├── kernel params (reboot required)
│ └── BIOS VRAM (reboot + BIOS access)
5. Verify make verify
6. Re-benchmark make benchmark && make benchmark-compare BEFORE=... AFTER=...
```
See [docs/optimization.md](docs/optimization.md) for the full walkthrough with explanations.
## Project Structure
```
bin/ Entry points (audit, monitor, benchmark, optimize)
lib/ Shared bash libraries (common, detect, format)
scripts/ Implementation organized by function
configs/ Reference configuration templates
data/ Runtime output: audits, benchmarks, logs, backups (gitignored)
docs/ Technical documentation
```
See [docs/architecture.md](docs/architecture.md) for the full architecture, data flow, and JSON schemas.
## Requirements
- **OS**: Fedora 43 (tested), Fedora 42+ should work
- **Hardware**: AMD Strix Halo (Ryzen AI MAX / MAX+) with RDNA 3.5 iGPU
- **Tools**: `bc`, `python3`, `tmux`, `podman`, `toolbox`
- **Optional**: `amdgpu_top` (installed via `make monitor-install`), `huggingface-cli` (for model downloads)
## Documentation
| Document | Contents |
|----------|----------|
| [docs/architecture.md](docs/architecture.md) | Script layers, data flow, unified memory model, JSON schemas |
| [docs/optimization.md](docs/optimization.md) | Step-by-step optimization walkthrough |
| [docs/benchmarking.md](docs/benchmarking.md) | Benchmark methodology, test params, result interpretation |
| [docs/bios-vram-guide.md](docs/bios-vram-guide.md) | HP ZBook BIOS configuration for VRAM |
| [docs/troubleshooting.md](docs/troubleshooting.md) | Common issues and fixes |
| [docs/references.md](docs/references.md) | External links: AMD docs, toolboxes, community resources |
## Contributing
AI assistants: see [CLAUDE.md](CLAUDE.md) for safety rules and technical context. Agent workflows are in [AGENTS.md](AGENTS.md).

118
docs/architecture.md Normal file
View File

@@ -0,0 +1,118 @@
# Architecture
## Script Layers
```
bin/ User entry points (thin dispatchers)
audit → scripts/audit/
monitor → scripts/monitor/
benchmark → scripts/benchmark/
optimize → scripts/optimize/
scripts/ Implementation (sourcing lib/ for shared logic)
audit/ System assessment
monitor/ GPU/system monitoring + metrics logging
benchmark/ llama-bench via toolbox containers
optimize/ GRUB, tuned, BIOS guidance, verify, rollback
lib/ Shared bash libraries (sourced, not executed)
common.sh Logging, root checks, confirm prompts, paths
detect.sh Hardware/config detection from sysfs + system tools
format.sh Color output, human_bytes, status indicators, tables
configs/ Reference configuration templates
data/ Runtime output (gitignored)
docs/ Documentation
```
Every script sources libs in order: `common.sh``detect.sh``format.sh`. `format.sh` depends on color variables defined in `common.sh`.
## Data Flow
```
/sys/class/drm/card1/device/ ──→ lib/detect.sh ──→ scripts/audit/
/proc/cpuinfo, /proc/meminfo ──→ (detect_*) ──→ scripts/monitor/
/proc/cmdline ──→ ──→ scripts/optimize/
tuned-adm, rpm, lspci ──→ ──→ scripts/benchmark/
data/
├── audits/*.json
├── logs/*.csv
├── baselines/*/summary.json
└── benchmarks/*/summary.json
```
## Unified Memory Model
AMD Strix Halo shares physical RAM between CPU and GPU. Two allocation mechanisms:
| Type | Description | Configuration |
|------|-------------|---------------|
| **VRAM (dedicated)** | Permanently reserved for GPU framebuffer | BIOS setting (UMA Frame Buffer Size) |
| **GTT (dynamic)** | System RAM mapped into GPU address space on demand | Kernel boot params: `amdgpu.gttsize`, `ttm.pages_limit` |
**Optimal for LLM workloads**: Minimize VRAM (0.5 GiB), maximize GTT (~60 GiB on 64 GB system). The GPU borrows memory when needed and releases it when idle.
### Kernel Parameter Math (64 GB system)
```
Total physical RAM: 64 GiB
OS reserve: 4 GiB
Available for GTT: 60 GiB = 61440 MiB
amdgpu.gttsize = 60 * 1024 = 61440 (MiB)
ttm.pages_limit = 60 * 1024 * 256 = 15728640 (4K pages)
iommu = pt (passthrough, lower latency)
```
The toolkit computes these dynamically via `recommended_gttsize_mib()` and `recommended_pages_limit()` in `lib/detect.sh`, based on detected total physical RAM (visible + VRAM).
### Sysfs Paths
| Path | Content |
|------|---------|
| `/sys/class/drm/card1/device/mem_info_vram_total` | Dedicated VRAM in bytes |
| `/sys/class/drm/card1/device/mem_info_vram_used` | VRAM currently in use |
| `/sys/class/drm/card1/device/mem_info_gtt_total` | GTT allocation in bytes |
| `/sys/class/drm/card1/device/mem_info_gtt_used` | GTT currently in use |
| `/sys/class/drm/card1/device/gpu_busy_percent` | GPU utilization 0-100 |
| `/sys/class/drm/card1/device/hwmon/hwmon*/temp1_input` | Temperature in millidegrees C |
| `/sys/class/drm/card1/device/hwmon/hwmon*/power1_average` | Power in microwatts |
Card number is auto-detected by `find_gpu_card()` (matches AMD vendor ID `0x1002`).
## JSON Output Schemas
### system-state.json (from `audit --json`)
```json
{
"timestamp": "20260325-120000",
"hardware": { "cpu_model": "...", "cpu_cores": 16, "cpu_threads": 32, "gpu_name": "...", "gpu_device_id": "1586", "system_ram_kb": 32609248 },
"memory": { "vram_total_bytes": 0, "vram_used_bytes": 0, "gtt_total_bytes": 0, "gtt_used_bytes": 0, "recommended_gttsize_mib": 0, "recommended_pages_limit": 0 },
"kernel": { "version": "...", "cmdline": "...", "param_iommu": "", "param_gttsize": "", "param_pages_limit": "" },
"firmware": "...", "tuned_profile": "...", "rocm_version": "...",
"vulkan": { "driver": "...", "version": "..." },
"sensors": { "gpu_temp_mc": 0, "gpu_power_uw": 0, "gpu_busy_pct": 0 },
"toolboxes": [], "stacks": { "ollama": "...", "lmstudio": "...", "llamacpp": "...", "opencode": "..." }
}
```
### summary.json (from benchmark runs)
```json
{
"results": [
{ "file": "model__backend__fa1.log", "model": "...", "size": "...", "backend": "Vulkan", "test": "pp512", "tokens_per_sec": 548.18, "raw": "548.18 +/- 1.59" }
]
}
```
### metrics.csv (from `monitor --log`)
```
timestamp,gpu_busy_pct,vram_used_mib,gtt_used_mib,gpu_temp_c,gpu_power_w,cpu_pct,ram_used_mib
```
Sampled every 2 seconds by default. Pure bash implementation (no subshell forks per sample).

94
docs/benchmarking.md Normal file
View File

@@ -0,0 +1,94 @@
# Benchmarking
## What We Measure
All benchmarks use [llama-bench](https://github.com/ggml-org/llama.cpp) (part of llama.cpp) running inside toolbox containers. Two test types:
| Metric | Meaning | Test Params |
|--------|---------|-------------|
| **pp** (prompt processing) | How fast the model ingests input tokens | Default: 512 tokens |
| **tg** (token generation) | How fast the model produces output tokens | Default: 128 tokens |
Results are in **tokens/second (t/s)**. Higher is better.
## Test Parameters
### Standard Test
```
-ngl 99 -mmp 0 -fa 1 -r 5
```
- `-ngl 99` — all layers on GPU
- `-mmp 0` — disable memory mapping (`--no-mmap`)
- `-fa 1` — flash attention enabled
- `-r 5` — 5 repetitions for statistical confidence
### Long-Context Test
```
-ngl 99 -mmp 0 -fa 1 -p 2048 -n 32 -d 32768 -ub SIZE -r 3
```
- `-p 2048` — 2048 prompt tokens
- `-n 32` — generate 32 tokens
- `-d 32768` — 32K context window
- `-ub SIZE` — micro-batch size (512 for Vulkan, 2048 for ROCm)
- `-r 3` — 3 repetitions (long-context tests are slow)
The `-fa 1 --no-mmap -ngl 999` flags are **mandatory** on Strix Halo to avoid crashes.
## Available Backends
| Backend | Container | Technology | Notes |
|---------|-----------|------------|-------|
| `llama-vulkan-radv` | Mesa RADV | Vulkan | Most stable, recommended default |
| `llama-vulkan-amdvlk` | AMDVLK | Vulkan | Fastest when it works, 2GB buffer limit |
| `llama-rocm-6.4.4` | ROCm 6.4.4 | HIP | Proven stable |
| `llama-rocm-7.2` | ROCm 7.2 | HIP | Latest, compiler fixes applied |
Containers are from [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes). Set up with `make benchmark-setup`.
## Workflow
```bash
# 1. Setup (one-time)
make benchmark-setup
# 2. Capture baseline (before optimization)
make benchmark-baseline
# 3. After optimizing, run again
make benchmark # or: bin/benchmark run --tag post-opt
# 4. Compare
make benchmark-compare BEFORE=data/baselines/20260325-120000 AFTER=data/benchmarks/post-opt-20260326-100000
```
## Result Format
Each run produces a directory under `data/baselines/` or `data/benchmarks/`:
```
TIMESTAMP/
system-state.json # Full system audit snapshot
summary.json # Parsed results (model, backend, test, t/s)
metrics.csv # GPU/CPU metrics during the run
*.log # Raw llama-bench output per backend+model+test
```
### Comparison Output
```
Backend | Model | Test | Before | After | Delta
vulkan-radv | qwen3-4b | pp512 | 548 t/s | 612 t/s | +11.7%
vulkan-radv | qwen3-4b | tg128 | 13.9 | 15.2 | +9.4%
```
Configuration changes between runs (VRAM, GTT, kernel params, tuned profile) are shown if system-state.json differs.
## Recommended Test Models
| Size | Model | File | Disk | Use Case |
|------|-------|------|------|----------|
| Small | Qwen3-4B | Q4_K_M.gguf | ~3 GB | Quick smoke tests |
| Medium | Qwen3-14B | Q4_K_M.gguf | ~9 GB | Standard benchmarks |
| Large | Qwen3-32B | Q4_K_M.gguf | ~20 GB | Memory pressure tests |
Place models in `data/models/`. The VRAM estimator from the [toolboxes project](https://github.com/kyuz0/amd-strix-halo-toolboxes) (`gguf-vram-estimator.py`) can help plan which models fit.

View File

@@ -1,5 +1,7 @@
# BIOS VRAM Configuration — HP ZBook Ultra G1a # BIOS VRAM Configuration — HP ZBook Ultra G1a
> Part of the [optimization workflow](optimization.md). For the full context on unified memory, see [architecture.md](architecture.md).
## Why Change VRAM? ## Why Change VRAM?
AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it. AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it.

84
docs/optimization.md Normal file
View File

@@ -0,0 +1,84 @@
# Optimization Guide
Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.
## Step 1: Tuned Profile (no reboot)
```bash
sudo make optimize-tuned
```
Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
Takes effect immediately. Previous profile is saved for rollback.
## Step 2: Kernel Boot Parameters (reboot required)
```bash
sudo make optimize-kernel
```
Adds three parameters to GRUB:
| Parameter | Value (64 GB) | Purpose |
|-----------|--------------|---------|
| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
See [docs/architecture.md](architecture.md) for the math behind these values.
## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
```bash
make optimize-vram
```
This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
## Step 4: Verify
```bash
make verify
```
Checks 9 criteria and reports a score. Target: 9/9.
## Step 5: Measure Impact
```bash
make benchmark
make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
```
See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
## Expected Impact
| Optimization | pp512 Improvement | tg128 Improvement |
|-------------|-------------------|-------------------|
| Tuned profile | +5-8% | +2-3% |
| Kernel params + BIOS VRAM | +10-20% | +5-15% |
| **Combined** | **+15-25%** | **+8-18%** |
Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
## Rollback
```bash
sudo make rollback
```
Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
## Troubleshooting
If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).

49
docs/references.md Normal file
View File

@@ -0,0 +1,49 @@
# External References
Single source of truth for all external links used across this project.
## AMD Official
- [ROCm Strix Halo Optimization Guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html) — BIOS, kernel params, GTT/TTM configuration
- [ROCm System Optimization Index](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/index.html) — General ROCm tuning
- [ROCm Installation Guide (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) — Package installation
- [AMD SMI Documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) — GPU monitoring API
- [ROCm GitHub](https://github.com/ROCm/ROCm) — Source and issue tracker
## Strix Halo Toolboxes (Donato Capitella)
The most comprehensive community resource for Strix Halo LLM optimization.
- [strix-halo-toolboxes.com](https://strix-halo-toolboxes.com/) — Documentation, benchmarks, guides
- [GitHub: kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) — Container images, benchmark scripts, VRAM estimator
- [Benchmark Results Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) — Interactive performance charts
## Community
- [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
- [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
- [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
- [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database
## Other Strix Halo Repos
- [pablo-ross/strix-halo-gmktec-evo-x2](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2) — GMKtec EVO X2 optimization
- [kyuz0/amd-strix-halo-llm-finetuning](https://github.com/kyuz0/amd-strix-halo-llm-finetuning) — Fine-tuning guides (Gemma-3, Qwen-3)
## Monitoring Tools
- [amdgpu_top](https://github.com/Umio-Yasuno/amdgpu_top) — Best AMD GPU monitor (TUI/GUI/JSON)
- [nvtop](https://github.com/Syllo/nvtop) — Cross-vendor GPU monitor
- [btop](https://github.com/aristocratos/btop) — System resource monitor
## LLM Inference
- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine (Vulkan + ROCm)
- [ollama](https://ollama.com/) — LLM runtime with model management
- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking
## AMD GPU Profiling
- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling
- [Radeon GPU Analyzer (RGA)](https://gpuopen.com/rga/) — Offline shader/kernel analysis

96
docs/troubleshooting.md Normal file
View File

@@ -0,0 +1,96 @@
# Troubleshooting
## Firmware: linux-firmware 20251125 Causes ROCm Crashes
**Symptoms**: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.
**Check**: `rpm -qa | grep linux-firmware`
**Fix**: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:
```bash
sudo dracut -f --kver $(uname -r)
```
The toolkit checks this automatically — `make audit` shows firmware status.
## amdgpu_top: Cargo Build Fails (gix-hash error)
**Symptoms**: `error: Please set either the sha1 or sha256 feature flag` during `cargo install amdgpu_top`.
**Cause**: Rust toolchain version incompatibility with the `gix-hash` dependency.
**Fix**: Use the pre-built RPM instead:
```bash
make monitor-install
```
The install script downloads the RPM from GitHub releases, bypassing cargo entirely.
## Toolbox GPU Access Failure
**Symptoms**: `llama-cli --list-devices` shows no GPU inside a toolbox container.
**Check**: Device mappings when creating the toolbox:
- Vulkan backends need: `--device /dev/dri`
- ROCm backends need: `--device /dev/dri --device /dev/kfd`
**Fix**: Recreate the toolbox with correct device flags. The [refresh-toolboxes.sh](https://github.com/kyuz0/amd-strix-halo-toolboxes) script handles this automatically.
Also ensure your user is in the `video` and `render` groups:
```bash
sudo usermod -aG video,render $USER
```
## GRUB Changes Not Taking Effect
**Symptoms**: After `make optimize-kernel` and reboot, `make audit` still shows missing params.
**Possible causes**:
1. **BLS (Boot Loader Spec)**: Modern Fedora uses BLS entries. The script uses `grubby` when available, but verify:
```bash
grubby --info=ALL | grep args
```
2. **Wrong GRUB config path**: Check which config is actually used:
```bash
cat /proc/cmdline # what the kernel actually booted with
cat /etc/default/grub # what the script modified
```
3. **GRUB not regenerated**: Manually regenerate:
```bash
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
```
## Memory Unchanged After BIOS Change
**Symptoms**: Changed VRAM in BIOS but `make audit` still shows 32 GiB.
**Check**:
```bash
cat /sys/class/drm/card1/device/mem_info_vram_total
```
**Possible causes**:
- BIOS change not saved (verify by re-entering BIOS)
- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
- Kernel params not applied (VRAM reduction requires kernel params to be useful)
## Benchmark Failures
**Symptoms**: `make benchmark-baseline` reports "FAILED" for some backends.
**Common fixes**:
- Ensure model exists: `ls data/models/*.gguf`
- Check model fits in memory: small models (4B) for initial testing
- Try `llama-vulkan-radv` first (most stable backend)
- Check dmesg for GPU errors: `dmesg | tail -30`
## Rollback
If optimization causes issues:
```bash
sudo make rollback
```
This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See [docs/optimization.md](optimization.md) for the full rollback procedure.