From 5b81437637295a37627cd8169334ae80fbaa80f0 Mon Sep 17 00:00:00 2001 From: Felipe Cardoso Date: Wed, 25 Mar 2026 20:50:00 +0100 Subject: [PATCH] docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite - README.md: project overview, quick start, command reference, workflow - CLAUDE.md: AI safety rules, technical details, conventions - AGENTS.md: agent workflows, file responsibility map, dependency matrix - docs/architecture.md: script layers, data flow, unified memory, JSON schemas - docs/optimization.md: step-by-step optimization walkthrough - docs/benchmarking.md: methodology, test params, result interpretation - docs/troubleshooting.md: common issues and fixes - docs/references.md: centralized external links (single source of truth) - docs/bios-vram-guide.md: add back-link to optimization workflow Cross-linked non-redundantly: each doc owns one layer, others link to it. Co-Authored-By: Claude Opus 4.6 (1M context) --- AGENTS.md | 62 +++++++++++++++++++++ CLAUDE.md | 48 ++++++++++++++++ README.md | 114 ++++++++++++++++++++++++++++++++++++++ docs/architecture.md | 118 ++++++++++++++++++++++++++++++++++++++++ docs/benchmarking.md | 94 ++++++++++++++++++++++++++++++++ docs/bios-vram-guide.md | 2 + docs/optimization.md | 84 ++++++++++++++++++++++++++++ docs/references.md | 49 +++++++++++++++++ docs/troubleshooting.md | 96 ++++++++++++++++++++++++++++++++ 9 files changed, 667 insertions(+) create mode 100644 AGENTS.md create mode 100644 CLAUDE.md create mode 100644 README.md create mode 100644 docs/architecture.md create mode 100644 docs/benchmarking.md create mode 100644 docs/optimization.md create mode 100644 docs/references.md create mode 100644 docs/troubleshooting.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..4d6573d --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,62 @@ +# AGENTS.md — Agent Workflows + +Read [CLAUDE.md](CLAUDE.md) first for safety rules and technical context. + +## Common Workflows + +### Add a Detection Function + +1. Add the function to `lib/detect.sh` following `detect_*` naming convention +2. If it reads sysfs, use `$GPU_SYSFS` (auto-detected) with a `2>/dev/null` fallback +3. Wire it into `scripts/audit/quick-glance.sh` (display) and/or `scripts/audit/system-report.sh` (JSON output) +4. If it has an optimal value, add a check to `scripts/optimize/verify.sh` +5. Validate: `make audit` and `bin/audit --json | python3 -m json.tool` + +### Add a Benchmark Backend + +1. Add the toolbox entry to `BENCH_PATHS` associative array in both `scripts/benchmark/run-baseline.sh` and `scripts/benchmark/run-suite.sh` +2. Map the toolbox name → llama-bench binary path (Vulkan: `/usr/sbin/llama-bench`, ROCm: `/usr/local/bin/llama-bench`) +3. If ROCm, the `ENV_ARGS` logic already handles `ROCBLAS_USE_HIPBLASLT=1` +4. Add the toolbox to `refresh-toolboxes.sh` in the [toolboxes repo](https://github.com/kyuz0/amd-strix-halo-toolboxes) +5. Validate: `make benchmark-setup` then `make benchmark` + +### Add an Optimization Script + +1. Create `scripts/optimize/my-optimization.sh` sourcing `lib/common.sh` (and `detect.sh` / `format.sh` as needed) +2. Add root check at top if the script modifies system state: `[[ $EUID -ne 0 ]] && { log_error "Requires root"; exit 1; }` +3. Add a corresponding case to `bin/optimize` +4. Add a Makefile target +5. Add verification criteria to `scripts/optimize/verify.sh` +6. If the optimization is reversible, add rollback logic to `scripts/optimize/rollback.sh` +7. Document in [docs/optimization.md](docs/optimization.md) + +### Add a Monitoring Metric + +1. In `scripts/monitor/log-metrics.sh`, cache the sysfs path at startup (avoid per-sample globbing) +2. Read with `read -r var < "$SYSFS_PATH" 2>/dev/null || var=0` (no subshells in the hot loop) +3. Add the column to the CSV header and the `echo` line +4. Update the CSV schema in [docs/architecture.md](docs/architecture.md) + +## File Responsibility Map + +| Want to change... | Touch these files | +|-------------------|-------------------| +| What `make audit` shows | `scripts/audit/quick-glance.sh`, `lib/detect.sh` | +| JSON audit output | `scripts/audit/system-report.sh`, `lib/detect.sh` | +| Dashboard layout | `scripts/monitor/dashboard.sh` | +| Metric collection | `scripts/monitor/log-metrics.sh` | +| Benchmark parameters | `scripts/benchmark/run-baseline.sh`, `run-suite.sh` | +| Result comparison | `scripts/benchmark/compare.sh` | +| Kernel params | `scripts/optimize/kernel-params.sh`, `lib/detect.sh` (recommended values) | +| Optimization checks | `scripts/optimize/verify.sh`, `scripts/audit/quick-glance.sh` | +| Shared utilities | `lib/common.sh` (logging), `lib/format.sh` (output), `lib/detect.sh` (hardware) | +| External links | `docs/references.md` (single source of truth) | + +## Dependencies by Workflow + +| Workflow | Requires | +|----------|----------| +| Audit | `bc`, `python3` | +| Monitor | `tmux`, `amdgpu_top` or `nvtop`, `btop` or `htop` | +| Benchmark | `toolbox`, `podman`, GGUF models in `data/models/`, `python3` | +| Optimize | `sudo`/root, `grubby` or `grub2-mkconfig`, `tuned-adm`, `python3` | diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..9319d35 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,48 @@ +# CLAUDE.md — AI Assistant Context + +Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See [README.md](README.md) for user-facing commands. + +## Architecture + +`bin/` dispatchers → `scripts/` implementations → `lib/` shared libraries. All scripts source libs in order: `common.sh` → `detect.sh` → `format.sh`. Runtime data goes to `data/` (gitignored). Full details in [docs/architecture.md](docs/architecture.md). + +## Safety Rules + +- **`scripts/optimize/kernel-params.sh`** modifies `/etc/default/grub` — requires root, backs up to `data/backups/` first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code). +- **`scripts/optimize/tuned-profile.sh`** and **`rollback.sh`** require root and save previous state for rollback. +- **`data/backups/`** contains GRUB backups and tuned profile snapshots — never delete these. +- Optimization scripts that require root check `$EUID` at the top and exit immediately if not root. +- All Python blocks receive data via environment variables (`os.environ`), never via shell interpolation into Python source. This prevents injection. **Do not revert to `'''$var'''` or `"$var"` patterns inside Python heredocs.** + +## Key Technical Details + +- **GPU sysfs**: Auto-detected by `find_gpu_card()` in `lib/detect.sh` (matches vendor `0x1002`). Falls back to first card with `mem_info_vram_total`. +- **Memory recommendations**: `recommended_gttsize_mib()` in `detect.sh` computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB. +- **Kernel param detection**: `detect_kernel_param()` uses word-boundary-anchored regex to avoid `iommu` matching `amd_iommu`. +- **Benchmark invocation**: `toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N`. ENV_ARGS passed as a proper bash array (not string splitting). +- **llama-bench output**: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing. +- **ROCm for gfx1151**: `ROCBLAS_USE_HIPBLASLT=1`, `HSA_OVERRIDE_GFX_VERSION=11.5.1`. +- **Fedora GRUB**: Prefers `grubby` (BLS) over `grub2-mkconfig`. Both paths are handled. + +## Conventions + +- `set -euo pipefail` in every executable script +- `snake_case` function names, `UPPER_CASE` for constants and loop variables +- 4-space indentation, no tabs +- `lib/` files are sourced (no shebang enforcement), but include `#!/usr/bin/env bash` for editor support +- Colors gated on `[[ -t 1 ]]` (disabled when piped) +- `bc` used for float math; `python3` for JSON and GRUB editing only + +## Validating Changes + +```bash +make audit # Quick check — shows system status with pass/fail indicators +make verify # 9-point optimization checklist +bin/audit --json | python3 -m json.tool # Verify JSON output is valid +``` + +## External Resources + +All external links are centralized in [docs/references.md](docs/references.md). Key ones: +- AMD ROCm Strix Halo guide (kernel params, GTT configuration) +- Donato Capitella toolboxes (container images, benchmarks, VRAM estimator) diff --git a/README.md b/README.md new file mode 100644 index 0000000..e4d76de --- /dev/null +++ b/README.md @@ -0,0 +1,114 @@ +# Strix Halo Optimization Toolkit + +Audit, monitor, benchmark, and optimize AMD Strix Halo integrated GPU systems for LLM inference workloads. + +**Target hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) with 64 GB unified memory, on Fedora 43. Tested on HP ZBook Ultra G1a. + +## Quick Start + +```bash +make audit # See current system status and optimization score +make monitor-install # Install amdgpu_top + btop +make benchmark-setup # Create toolbox containers + download test model +make benchmark-baseline # Capture performance before optimization +``` + +## System Status + +`make audit` produces a single-screen overview: + +``` +=== Memory Allocation === + [!!] VRAM (dedicated) 32.0 GiB — should be 0.5 GiB in BIOS + [!!] GTT (dynamic) 15.5 GiB — should be ~59.0 GiB with kernel params + +=== Kernel Boot Parameters === + [!!] iommu=pt MISSING + [!!] amdgpu.gttsize MISSING — recommended: 60416 + [!!] ttm.pages_limit MISSING — recommended: 15466496 + +=== Performance Profile === + [!!] Tuned profile throughput-performance — recommended: accelerator-performance + +=== Optimization Score === + 2 / 8 checks passing +``` + +Each `[!!]` is an optimization opportunity. Run `make optimize` to address them. + +## Commands + +| Command | Description | +|---------|-------------| +| `make audit` | Quick system status (single screen) | +| `make audit-full` | Full system report (saved to data/audits/) | +| `make monitor` | Launch tmux monitoring dashboard | +| `make monitor-simple` | Launch amdgpu_top only | +| `make monitor-install` | Install monitoring tools (amdgpu_top, btop) | +| `make monitor-log` | Start background CSV metric logger | +| `make benchmark-setup` | Ensure toolboxes and test models are ready | +| `make benchmark-baseline` | Capture pre-optimization baseline | +| `make benchmark` | Run full benchmark suite | +| `make benchmark-compare` | Compare two runs (`BEFORE=dir AFTER=dir`) | +| `sudo make optimize` | Interactive optimization walkthrough | +| `sudo make optimize-kernel` | Configure kernel boot parameters | +| `sudo make optimize-tuned` | Switch to accelerator-performance profile | +| `make optimize-vram` | BIOS VRAM guidance + GTT verification | +| `make verify` | Post-optimization verification checklist | +| `sudo make rollback` | Rollback optimizations | + +## Optimization Workflow + +``` +1. Audit make audit + │ +2. Monitor make monitor-install && make monitor + │ +3. Baseline make benchmark-setup && make benchmark-baseline + │ +4. Optimize sudo make optimize + │ ├── tuned profile (instant, +5-8% pp) + │ ├── kernel params (reboot required) + │ └── BIOS VRAM (reboot + BIOS access) + │ +5. Verify make verify + │ +6. Re-benchmark make benchmark && make benchmark-compare BEFORE=... AFTER=... +``` + +See [docs/optimization.md](docs/optimization.md) for the full walkthrough with explanations. + +## Project Structure + +``` +bin/ Entry points (audit, monitor, benchmark, optimize) +lib/ Shared bash libraries (common, detect, format) +scripts/ Implementation organized by function +configs/ Reference configuration templates +data/ Runtime output: audits, benchmarks, logs, backups (gitignored) +docs/ Technical documentation +``` + +See [docs/architecture.md](docs/architecture.md) for the full architecture, data flow, and JSON schemas. + +## Requirements + +- **OS**: Fedora 43 (tested), Fedora 42+ should work +- **Hardware**: AMD Strix Halo (Ryzen AI MAX / MAX+) with RDNA 3.5 iGPU +- **Tools**: `bc`, `python3`, `tmux`, `podman`, `toolbox` +- **Optional**: `amdgpu_top` (installed via `make monitor-install`), `huggingface-cli` (for model downloads) + +## Documentation + +| Document | Contents | +|----------|----------| +| [docs/architecture.md](docs/architecture.md) | Script layers, data flow, unified memory model, JSON schemas | +| [docs/optimization.md](docs/optimization.md) | Step-by-step optimization walkthrough | +| [docs/benchmarking.md](docs/benchmarking.md) | Benchmark methodology, test params, result interpretation | +| [docs/bios-vram-guide.md](docs/bios-vram-guide.md) | HP ZBook BIOS configuration for VRAM | +| [docs/troubleshooting.md](docs/troubleshooting.md) | Common issues and fixes | +| [docs/references.md](docs/references.md) | External links: AMD docs, toolboxes, community resources | + +## Contributing + +AI assistants: see [CLAUDE.md](CLAUDE.md) for safety rules and technical context. Agent workflows are in [AGENTS.md](AGENTS.md). diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..72aa1ac --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,118 @@ +# Architecture + +## Script Layers + +``` +bin/ User entry points (thin dispatchers) + audit → scripts/audit/ + monitor → scripts/monitor/ + benchmark → scripts/benchmark/ + optimize → scripts/optimize/ + +scripts/ Implementation (sourcing lib/ for shared logic) + audit/ System assessment + monitor/ GPU/system monitoring + metrics logging + benchmark/ llama-bench via toolbox containers + optimize/ GRUB, tuned, BIOS guidance, verify, rollback + +lib/ Shared bash libraries (sourced, not executed) + common.sh Logging, root checks, confirm prompts, paths + detect.sh Hardware/config detection from sysfs + system tools + format.sh Color output, human_bytes, status indicators, tables + +configs/ Reference configuration templates +data/ Runtime output (gitignored) +docs/ Documentation +``` + +Every script sources libs in order: `common.sh` → `detect.sh` → `format.sh`. `format.sh` depends on color variables defined in `common.sh`. + +## Data Flow + +``` +/sys/class/drm/card1/device/ ──→ lib/detect.sh ──→ scripts/audit/ +/proc/cpuinfo, /proc/meminfo ──→ (detect_*) ──→ scripts/monitor/ +/proc/cmdline ──→ ──→ scripts/optimize/ +tuned-adm, rpm, lspci ──→ ──→ scripts/benchmark/ + │ + ▼ + data/ + ├── audits/*.json + ├── logs/*.csv + ├── baselines/*/summary.json + └── benchmarks/*/summary.json +``` + +## Unified Memory Model + +AMD Strix Halo shares physical RAM between CPU and GPU. Two allocation mechanisms: + +| Type | Description | Configuration | +|------|-------------|---------------| +| **VRAM (dedicated)** | Permanently reserved for GPU framebuffer | BIOS setting (UMA Frame Buffer Size) | +| **GTT (dynamic)** | System RAM mapped into GPU address space on demand | Kernel boot params: `amdgpu.gttsize`, `ttm.pages_limit` | + +**Optimal for LLM workloads**: Minimize VRAM (0.5 GiB), maximize GTT (~60 GiB on 64 GB system). The GPU borrows memory when needed and releases it when idle. + +### Kernel Parameter Math (64 GB system) + +``` +Total physical RAM: 64 GiB +OS reserve: 4 GiB +Available for GTT: 60 GiB = 61440 MiB + +amdgpu.gttsize = 60 * 1024 = 61440 (MiB) +ttm.pages_limit = 60 * 1024 * 256 = 15728640 (4K pages) +iommu = pt (passthrough, lower latency) +``` + +The toolkit computes these dynamically via `recommended_gttsize_mib()` and `recommended_pages_limit()` in `lib/detect.sh`, based on detected total physical RAM (visible + VRAM). + +### Sysfs Paths + +| Path | Content | +|------|---------| +| `/sys/class/drm/card1/device/mem_info_vram_total` | Dedicated VRAM in bytes | +| `/sys/class/drm/card1/device/mem_info_vram_used` | VRAM currently in use | +| `/sys/class/drm/card1/device/mem_info_gtt_total` | GTT allocation in bytes | +| `/sys/class/drm/card1/device/mem_info_gtt_used` | GTT currently in use | +| `/sys/class/drm/card1/device/gpu_busy_percent` | GPU utilization 0-100 | +| `/sys/class/drm/card1/device/hwmon/hwmon*/temp1_input` | Temperature in millidegrees C | +| `/sys/class/drm/card1/device/hwmon/hwmon*/power1_average` | Power in microwatts | + +Card number is auto-detected by `find_gpu_card()` (matches AMD vendor ID `0x1002`). + +## JSON Output Schemas + +### system-state.json (from `audit --json`) + +```json +{ + "timestamp": "20260325-120000", + "hardware": { "cpu_model": "...", "cpu_cores": 16, "cpu_threads": 32, "gpu_name": "...", "gpu_device_id": "1586", "system_ram_kb": 32609248 }, + "memory": { "vram_total_bytes": 0, "vram_used_bytes": 0, "gtt_total_bytes": 0, "gtt_used_bytes": 0, "recommended_gttsize_mib": 0, "recommended_pages_limit": 0 }, + "kernel": { "version": "...", "cmdline": "...", "param_iommu": "", "param_gttsize": "", "param_pages_limit": "" }, + "firmware": "...", "tuned_profile": "...", "rocm_version": "...", + "vulkan": { "driver": "...", "version": "..." }, + "sensors": { "gpu_temp_mc": 0, "gpu_power_uw": 0, "gpu_busy_pct": 0 }, + "toolboxes": [], "stacks": { "ollama": "...", "lmstudio": "...", "llamacpp": "...", "opencode": "..." } +} +``` + +### summary.json (from benchmark runs) + +```json +{ + "results": [ + { "file": "model__backend__fa1.log", "model": "...", "size": "...", "backend": "Vulkan", "test": "pp512", "tokens_per_sec": 548.18, "raw": "548.18 +/- 1.59" } + ] +} +``` + +### metrics.csv (from `monitor --log`) + +``` +timestamp,gpu_busy_pct,vram_used_mib,gtt_used_mib,gpu_temp_c,gpu_power_w,cpu_pct,ram_used_mib +``` + +Sampled every 2 seconds by default. Pure bash implementation (no subshell forks per sample). diff --git a/docs/benchmarking.md b/docs/benchmarking.md new file mode 100644 index 0000000..ab9158a --- /dev/null +++ b/docs/benchmarking.md @@ -0,0 +1,94 @@ +# Benchmarking + +## What We Measure + +All benchmarks use [llama-bench](https://github.com/ggml-org/llama.cpp) (part of llama.cpp) running inside toolbox containers. Two test types: + +| Metric | Meaning | Test Params | +|--------|---------|-------------| +| **pp** (prompt processing) | How fast the model ingests input tokens | Default: 512 tokens | +| **tg** (token generation) | How fast the model produces output tokens | Default: 128 tokens | + +Results are in **tokens/second (t/s)**. Higher is better. + +## Test Parameters + +### Standard Test +``` +-ngl 99 -mmp 0 -fa 1 -r 5 +``` +- `-ngl 99` — all layers on GPU +- `-mmp 0` — disable memory mapping (`--no-mmap`) +- `-fa 1` — flash attention enabled +- `-r 5` — 5 repetitions for statistical confidence + +### Long-Context Test +``` +-ngl 99 -mmp 0 -fa 1 -p 2048 -n 32 -d 32768 -ub SIZE -r 3 +``` +- `-p 2048` — 2048 prompt tokens +- `-n 32` — generate 32 tokens +- `-d 32768` — 32K context window +- `-ub SIZE` — micro-batch size (512 for Vulkan, 2048 for ROCm) +- `-r 3` — 3 repetitions (long-context tests are slow) + +The `-fa 1 --no-mmap -ngl 999` flags are **mandatory** on Strix Halo to avoid crashes. + +## Available Backends + +| Backend | Container | Technology | Notes | +|---------|-----------|------------|-------| +| `llama-vulkan-radv` | Mesa RADV | Vulkan | Most stable, recommended default | +| `llama-vulkan-amdvlk` | AMDVLK | Vulkan | Fastest when it works, 2GB buffer limit | +| `llama-rocm-6.4.4` | ROCm 6.4.4 | HIP | Proven stable | +| `llama-rocm-7.2` | ROCm 7.2 | HIP | Latest, compiler fixes applied | + +Containers are from [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes). Set up with `make benchmark-setup`. + +## Workflow + +```bash +# 1. Setup (one-time) +make benchmark-setup + +# 2. Capture baseline (before optimization) +make benchmark-baseline + +# 3. After optimizing, run again +make benchmark # or: bin/benchmark run --tag post-opt + +# 4. Compare +make benchmark-compare BEFORE=data/baselines/20260325-120000 AFTER=data/benchmarks/post-opt-20260326-100000 +``` + +## Result Format + +Each run produces a directory under `data/baselines/` or `data/benchmarks/`: + +``` +TIMESTAMP/ + system-state.json # Full system audit snapshot + summary.json # Parsed results (model, backend, test, t/s) + metrics.csv # GPU/CPU metrics during the run + *.log # Raw llama-bench output per backend+model+test +``` + +### Comparison Output + +``` +Backend | Model | Test | Before | After | Delta +vulkan-radv | qwen3-4b | pp512 | 548 t/s | 612 t/s | +11.7% +vulkan-radv | qwen3-4b | tg128 | 13.9 | 15.2 | +9.4% +``` + +Configuration changes between runs (VRAM, GTT, kernel params, tuned profile) are shown if system-state.json differs. + +## Recommended Test Models + +| Size | Model | File | Disk | Use Case | +|------|-------|------|------|----------| +| Small | Qwen3-4B | Q4_K_M.gguf | ~3 GB | Quick smoke tests | +| Medium | Qwen3-14B | Q4_K_M.gguf | ~9 GB | Standard benchmarks | +| Large | Qwen3-32B | Q4_K_M.gguf | ~20 GB | Memory pressure tests | + +Place models in `data/models/`. The VRAM estimator from the [toolboxes project](https://github.com/kyuz0/amd-strix-halo-toolboxes) (`gguf-vram-estimator.py`) can help plan which models fit. diff --git a/docs/bios-vram-guide.md b/docs/bios-vram-guide.md index e8fa32b..31a2218 100644 --- a/docs/bios-vram-guide.md +++ b/docs/bios-vram-guide.md @@ -1,5 +1,7 @@ # BIOS VRAM Configuration — HP ZBook Ultra G1a +> Part of the [optimization workflow](optimization.md). For the full context on unified memory, see [architecture.md](architecture.md). + ## Why Change VRAM? AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it. diff --git a/docs/optimization.md b/docs/optimization.md new file mode 100644 index 0000000..6f0ed19 --- /dev/null +++ b/docs/optimization.md @@ -0,0 +1,84 @@ +# Optimization Guide + +Complete walkthrough for optimizing AMD Strix Halo for LLM workloads. + +**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers. + +## Step 1: Tuned Profile (no reboot) + +```bash +sudo make optimize-tuned +``` + +Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput. + +Takes effect immediately. Previous profile is saved for rollback. + +## Step 2: Kernel Boot Parameters (reboot required) + +```bash +sudo make optimize-kernel +``` + +Adds three parameters to GRUB: + +| Parameter | Value (64 GB) | Purpose | +|-----------|--------------|---------| +| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency | +| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB | +| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory | + +Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it. + +See [docs/architecture.md](architecture.md) for the math behind these values. + +## Step 3: BIOS VRAM Reduction (reboot + BIOS access) + +```bash +make optimize-vram +``` + +This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT. + +See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough. + +**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally. + +## Step 4: Verify + +```bash +make verify +``` + +Checks 9 criteria and reports a score. Target: 9/9. + +## Step 5: Measure Impact + +```bash +make benchmark +make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP +``` + +See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation. + +## Expected Impact + +| Optimization | pp512 Improvement | tg128 Improvement | +|-------------|-------------------|-------------------| +| Tuned profile | +5-8% | +2-3% | +| Kernel params + BIOS VRAM | +10-20% | +5-15% | +| **Combined** | **+15-25%** | **+8-18%** | + +Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion. + +## Rollback + +```bash +sudo make rollback +``` + +Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size). + +## Troubleshooting + +If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md). diff --git a/docs/references.md b/docs/references.md new file mode 100644 index 0000000..41fd423 --- /dev/null +++ b/docs/references.md @@ -0,0 +1,49 @@ +# External References + +Single source of truth for all external links used across this project. + +## AMD Official + +- [ROCm Strix Halo Optimization Guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html) — BIOS, kernel params, GTT/TTM configuration +- [ROCm System Optimization Index](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/index.html) — General ROCm tuning +- [ROCm Installation Guide (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) — Package installation +- [AMD SMI Documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) — GPU monitoring API +- [ROCm GitHub](https://github.com/ROCm/ROCm) — Source and issue tracker + +## Strix Halo Toolboxes (Donato Capitella) + +The most comprehensive community resource for Strix Halo LLM optimization. + +- [strix-halo-toolboxes.com](https://strix-halo-toolboxes.com/) — Documentation, benchmarks, guides +- [GitHub: kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) — Container images, benchmark scripts, VRAM estimator +- [Benchmark Results Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) — Interactive performance charts + +## Community + +- [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility +- [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration +- [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results +- [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database + +## Other Strix Halo Repos + +- [pablo-ross/strix-halo-gmktec-evo-x2](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2) — GMKtec EVO X2 optimization +- [kyuz0/amd-strix-halo-llm-finetuning](https://github.com/kyuz0/amd-strix-halo-llm-finetuning) — Fine-tuning guides (Gemma-3, Qwen-3) + +## Monitoring Tools + +- [amdgpu_top](https://github.com/Umio-Yasuno/amdgpu_top) — Best AMD GPU monitor (TUI/GUI/JSON) +- [nvtop](https://github.com/Syllo/nvtop) — Cross-vendor GPU monitor +- [btop](https://github.com/aristocratos/btop) — System resource monitor + +## LLM Inference + +- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine (Vulkan + ROCm) +- [ollama](https://ollama.com/) — LLM runtime with model management +- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving +- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking + +## AMD GPU Profiling + +- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling +- [Radeon GPU Analyzer (RGA)](https://gpuopen.com/rga/) — Offline shader/kernel analysis diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..d8bb729 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,96 @@ +# Troubleshooting + +## Firmware: linux-firmware 20251125 Causes ROCm Crashes + +**Symptoms**: Arbitrary crashes, instability, or mysterious failures with ROCm workloads. + +**Check**: `rpm -qa | grep linux-firmware` + +**Fix**: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware: +```bash +sudo dracut -f --kver $(uname -r) +``` + +The toolkit checks this automatically — `make audit` shows firmware status. + +## amdgpu_top: Cargo Build Fails (gix-hash error) + +**Symptoms**: `error: Please set either the sha1 or sha256 feature flag` during `cargo install amdgpu_top`. + +**Cause**: Rust toolchain version incompatibility with the `gix-hash` dependency. + +**Fix**: Use the pre-built RPM instead: +```bash +make monitor-install +``` +The install script downloads the RPM from GitHub releases, bypassing cargo entirely. + +## Toolbox GPU Access Failure + +**Symptoms**: `llama-cli --list-devices` shows no GPU inside a toolbox container. + +**Check**: Device mappings when creating the toolbox: +- Vulkan backends need: `--device /dev/dri` +- ROCm backends need: `--device /dev/dri --device /dev/kfd` + +**Fix**: Recreate the toolbox with correct device flags. The [refresh-toolboxes.sh](https://github.com/kyuz0/amd-strix-halo-toolboxes) script handles this automatically. + +Also ensure your user is in the `video` and `render` groups: +```bash +sudo usermod -aG video,render $USER +``` + +## GRUB Changes Not Taking Effect + +**Symptoms**: After `make optimize-kernel` and reboot, `make audit` still shows missing params. + +**Possible causes**: + +1. **BLS (Boot Loader Spec)**: Modern Fedora uses BLS entries. The script uses `grubby` when available, but verify: + ```bash + grubby --info=ALL | grep args + ``` + +2. **Wrong GRUB config path**: Check which config is actually used: + ```bash + cat /proc/cmdline # what the kernel actually booted with + cat /etc/default/grub # what the script modified + ``` + +3. **GRUB not regenerated**: Manually regenerate: + ```bash + sudo grub2-mkconfig -o /boot/grub2/grub.cfg + ``` + +## Memory Unchanged After BIOS Change + +**Symptoms**: Changed VRAM in BIOS but `make audit` still shows 32 GiB. + +**Check**: +```bash +cat /sys/class/drm/card1/device/mem_info_vram_total +``` + +**Possible causes**: +- BIOS change not saved (verify by re-entering BIOS) +- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory") +- Kernel params not applied (VRAM reduction requires kernel params to be useful) + +## Benchmark Failures + +**Symptoms**: `make benchmark-baseline` reports "FAILED" for some backends. + +**Common fixes**: +- Ensure model exists: `ls data/models/*.gguf` +- Check model fits in memory: small models (4B) for initial testing +- Try `llama-vulkan-radv` first (most stable backend) +- Check dmesg for GPU errors: `dmesg | tail -30` + +## Rollback + +If optimization causes issues: +```bash +sudo make rollback +``` + +This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See [docs/optimization.md](optimization.md) for the full rollback procedure.