docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite
- README.md: project overview, quick start, command reference, workflow - CLAUDE.md: AI safety rules, technical details, conventions - AGENTS.md: agent workflows, file responsibility map, dependency matrix - docs/architecture.md: script layers, data flow, unified memory, JSON schemas - docs/optimization.md: step-by-step optimization walkthrough - docs/benchmarking.md: methodology, test params, result interpretation - docs/troubleshooting.md: common issues and fixes - docs/references.md: centralized external links (single source of truth) - docs/bios-vram-guide.md: add back-link to optimization workflow Cross-linked non-redundantly: each doc owns one layer, others link to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
62
AGENTS.md
Normal file
62
AGENTS.md
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
# AGENTS.md — Agent Workflows
|
||||||
|
|
||||||
|
Read [CLAUDE.md](CLAUDE.md) first for safety rules and technical context.
|
||||||
|
|
||||||
|
## Common Workflows
|
||||||
|
|
||||||
|
### Add a Detection Function
|
||||||
|
|
||||||
|
1. Add the function to `lib/detect.sh` following `detect_*` naming convention
|
||||||
|
2. If it reads sysfs, use `$GPU_SYSFS` (auto-detected) with a `2>/dev/null` fallback
|
||||||
|
3. Wire it into `scripts/audit/quick-glance.sh` (display) and/or `scripts/audit/system-report.sh` (JSON output)
|
||||||
|
4. If it has an optimal value, add a check to `scripts/optimize/verify.sh`
|
||||||
|
5. Validate: `make audit` and `bin/audit --json | python3 -m json.tool`
|
||||||
|
|
||||||
|
### Add a Benchmark Backend
|
||||||
|
|
||||||
|
1. Add the toolbox entry to `BENCH_PATHS` associative array in both `scripts/benchmark/run-baseline.sh` and `scripts/benchmark/run-suite.sh`
|
||||||
|
2. Map the toolbox name → llama-bench binary path (Vulkan: `/usr/sbin/llama-bench`, ROCm: `/usr/local/bin/llama-bench`)
|
||||||
|
3. If ROCm, the `ENV_ARGS` logic already handles `ROCBLAS_USE_HIPBLASLT=1`
|
||||||
|
4. Add the toolbox to `refresh-toolboxes.sh` in the [toolboxes repo](https://github.com/kyuz0/amd-strix-halo-toolboxes)
|
||||||
|
5. Validate: `make benchmark-setup` then `make benchmark`
|
||||||
|
|
||||||
|
### Add an Optimization Script
|
||||||
|
|
||||||
|
1. Create `scripts/optimize/my-optimization.sh` sourcing `lib/common.sh` (and `detect.sh` / `format.sh` as needed)
|
||||||
|
2. Add root check at top if the script modifies system state: `[[ $EUID -ne 0 ]] && { log_error "Requires root"; exit 1; }`
|
||||||
|
3. Add a corresponding case to `bin/optimize`
|
||||||
|
4. Add a Makefile target
|
||||||
|
5. Add verification criteria to `scripts/optimize/verify.sh`
|
||||||
|
6. If the optimization is reversible, add rollback logic to `scripts/optimize/rollback.sh`
|
||||||
|
7. Document in [docs/optimization.md](docs/optimization.md)
|
||||||
|
|
||||||
|
### Add a Monitoring Metric
|
||||||
|
|
||||||
|
1. In `scripts/monitor/log-metrics.sh`, cache the sysfs path at startup (avoid per-sample globbing)
|
||||||
|
2. Read with `read -r var < "$SYSFS_PATH" 2>/dev/null || var=0` (no subshells in the hot loop)
|
||||||
|
3. Add the column to the CSV header and the `echo` line
|
||||||
|
4. Update the CSV schema in [docs/architecture.md](docs/architecture.md)
|
||||||
|
|
||||||
|
## File Responsibility Map
|
||||||
|
|
||||||
|
| Want to change... | Touch these files |
|
||||||
|
|-------------------|-------------------|
|
||||||
|
| What `make audit` shows | `scripts/audit/quick-glance.sh`, `lib/detect.sh` |
|
||||||
|
| JSON audit output | `scripts/audit/system-report.sh`, `lib/detect.sh` |
|
||||||
|
| Dashboard layout | `scripts/monitor/dashboard.sh` |
|
||||||
|
| Metric collection | `scripts/monitor/log-metrics.sh` |
|
||||||
|
| Benchmark parameters | `scripts/benchmark/run-baseline.sh`, `run-suite.sh` |
|
||||||
|
| Result comparison | `scripts/benchmark/compare.sh` |
|
||||||
|
| Kernel params | `scripts/optimize/kernel-params.sh`, `lib/detect.sh` (recommended values) |
|
||||||
|
| Optimization checks | `scripts/optimize/verify.sh`, `scripts/audit/quick-glance.sh` |
|
||||||
|
| Shared utilities | `lib/common.sh` (logging), `lib/format.sh` (output), `lib/detect.sh` (hardware) |
|
||||||
|
| External links | `docs/references.md` (single source of truth) |
|
||||||
|
|
||||||
|
## Dependencies by Workflow
|
||||||
|
|
||||||
|
| Workflow | Requires |
|
||||||
|
|----------|----------|
|
||||||
|
| Audit | `bc`, `python3` |
|
||||||
|
| Monitor | `tmux`, `amdgpu_top` or `nvtop`, `btop` or `htop` |
|
||||||
|
| Benchmark | `toolbox`, `podman`, GGUF models in `data/models/`, `python3` |
|
||||||
|
| Optimize | `sudo`/root, `grubby` or `grub2-mkconfig`, `tuned-adm`, `python3` |
|
||||||
48
CLAUDE.md
Normal file
48
CLAUDE.md
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
# CLAUDE.md — AI Assistant Context
|
||||||
|
|
||||||
|
Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See [README.md](README.md) for user-facing commands.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
`bin/` dispatchers → `scripts/` implementations → `lib/` shared libraries. All scripts source libs in order: `common.sh` → `detect.sh` → `format.sh`. Runtime data goes to `data/` (gitignored). Full details in [docs/architecture.md](docs/architecture.md).
|
||||||
|
|
||||||
|
## Safety Rules
|
||||||
|
|
||||||
|
- **`scripts/optimize/kernel-params.sh`** modifies `/etc/default/grub` — requires root, backs up to `data/backups/` first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).
|
||||||
|
- **`scripts/optimize/tuned-profile.sh`** and **`rollback.sh`** require root and save previous state for rollback.
|
||||||
|
- **`data/backups/`** contains GRUB backups and tuned profile snapshots — never delete these.
|
||||||
|
- Optimization scripts that require root check `$EUID` at the top and exit immediately if not root.
|
||||||
|
- All Python blocks receive data via environment variables (`os.environ`), never via shell interpolation into Python source. This prevents injection. **Do not revert to `'''$var'''` or `"$var"` patterns inside Python heredocs.**
|
||||||
|
|
||||||
|
## Key Technical Details
|
||||||
|
|
||||||
|
- **GPU sysfs**: Auto-detected by `find_gpu_card()` in `lib/detect.sh` (matches vendor `0x1002`). Falls back to first card with `mem_info_vram_total`.
|
||||||
|
- **Memory recommendations**: `recommended_gttsize_mib()` in `detect.sh` computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB.
|
||||||
|
- **Kernel param detection**: `detect_kernel_param()` uses word-boundary-anchored regex to avoid `iommu` matching `amd_iommu`.
|
||||||
|
- **Benchmark invocation**: `toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N`. ENV_ARGS passed as a proper bash array (not string splitting).
|
||||||
|
- **llama-bench output**: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
|
||||||
|
- **ROCm for gfx1151**: `ROCBLAS_USE_HIPBLASLT=1`, `HSA_OVERRIDE_GFX_VERSION=11.5.1`.
|
||||||
|
- **Fedora GRUB**: Prefers `grubby` (BLS) over `grub2-mkconfig`. Both paths are handled.
|
||||||
|
|
||||||
|
## Conventions
|
||||||
|
|
||||||
|
- `set -euo pipefail` in every executable script
|
||||||
|
- `snake_case` function names, `UPPER_CASE` for constants and loop variables
|
||||||
|
- 4-space indentation, no tabs
|
||||||
|
- `lib/` files are sourced (no shebang enforcement), but include `#!/usr/bin/env bash` for editor support
|
||||||
|
- Colors gated on `[[ -t 1 ]]` (disabled when piped)
|
||||||
|
- `bc` used for float math; `python3` for JSON and GRUB editing only
|
||||||
|
|
||||||
|
## Validating Changes
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make audit # Quick check — shows system status with pass/fail indicators
|
||||||
|
make verify # 9-point optimization checklist
|
||||||
|
bin/audit --json | python3 -m json.tool # Verify JSON output is valid
|
||||||
|
```
|
||||||
|
|
||||||
|
## External Resources
|
||||||
|
|
||||||
|
All external links are centralized in [docs/references.md](docs/references.md). Key ones:
|
||||||
|
- AMD ROCm Strix Halo guide (kernel params, GTT configuration)
|
||||||
|
- Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)
|
||||||
114
README.md
Normal file
114
README.md
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
# Strix Halo Optimization Toolkit
|
||||||
|
|
||||||
|
Audit, monitor, benchmark, and optimize AMD Strix Halo integrated GPU systems for LLM inference workloads.
|
||||||
|
|
||||||
|
**Target hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) with 64 GB unified memory, on Fedora 43. Tested on HP ZBook Ultra G1a.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make audit # See current system status and optimization score
|
||||||
|
make monitor-install # Install amdgpu_top + btop
|
||||||
|
make benchmark-setup # Create toolbox containers + download test model
|
||||||
|
make benchmark-baseline # Capture performance before optimization
|
||||||
|
```
|
||||||
|
|
||||||
|
## System Status
|
||||||
|
|
||||||
|
`make audit` produces a single-screen overview:
|
||||||
|
|
||||||
|
```
|
||||||
|
=== Memory Allocation ===
|
||||||
|
[!!] VRAM (dedicated) 32.0 GiB — should be 0.5 GiB in BIOS
|
||||||
|
[!!] GTT (dynamic) 15.5 GiB — should be ~59.0 GiB with kernel params
|
||||||
|
|
||||||
|
=== Kernel Boot Parameters ===
|
||||||
|
[!!] iommu=pt MISSING
|
||||||
|
[!!] amdgpu.gttsize MISSING — recommended: 60416
|
||||||
|
[!!] ttm.pages_limit MISSING — recommended: 15466496
|
||||||
|
|
||||||
|
=== Performance Profile ===
|
||||||
|
[!!] Tuned profile throughput-performance — recommended: accelerator-performance
|
||||||
|
|
||||||
|
=== Optimization Score ===
|
||||||
|
2 / 8 checks passing
|
||||||
|
```
|
||||||
|
|
||||||
|
Each `[!!]` is an optimization opportunity. Run `make optimize` to address them.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
| Command | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `make audit` | Quick system status (single screen) |
|
||||||
|
| `make audit-full` | Full system report (saved to data/audits/) |
|
||||||
|
| `make monitor` | Launch tmux monitoring dashboard |
|
||||||
|
| `make monitor-simple` | Launch amdgpu_top only |
|
||||||
|
| `make monitor-install` | Install monitoring tools (amdgpu_top, btop) |
|
||||||
|
| `make monitor-log` | Start background CSV metric logger |
|
||||||
|
| `make benchmark-setup` | Ensure toolboxes and test models are ready |
|
||||||
|
| `make benchmark-baseline` | Capture pre-optimization baseline |
|
||||||
|
| `make benchmark` | Run full benchmark suite |
|
||||||
|
| `make benchmark-compare` | Compare two runs (`BEFORE=dir AFTER=dir`) |
|
||||||
|
| `sudo make optimize` | Interactive optimization walkthrough |
|
||||||
|
| `sudo make optimize-kernel` | Configure kernel boot parameters |
|
||||||
|
| `sudo make optimize-tuned` | Switch to accelerator-performance profile |
|
||||||
|
| `make optimize-vram` | BIOS VRAM guidance + GTT verification |
|
||||||
|
| `make verify` | Post-optimization verification checklist |
|
||||||
|
| `sudo make rollback` | Rollback optimizations |
|
||||||
|
|
||||||
|
## Optimization Workflow
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Audit make audit
|
||||||
|
│
|
||||||
|
2. Monitor make monitor-install && make monitor
|
||||||
|
│
|
||||||
|
3. Baseline make benchmark-setup && make benchmark-baseline
|
||||||
|
│
|
||||||
|
4. Optimize sudo make optimize
|
||||||
|
│ ├── tuned profile (instant, +5-8% pp)
|
||||||
|
│ ├── kernel params (reboot required)
|
||||||
|
│ └── BIOS VRAM (reboot + BIOS access)
|
||||||
|
│
|
||||||
|
5. Verify make verify
|
||||||
|
│
|
||||||
|
6. Re-benchmark make benchmark && make benchmark-compare BEFORE=... AFTER=...
|
||||||
|
```
|
||||||
|
|
||||||
|
See [docs/optimization.md](docs/optimization.md) for the full walkthrough with explanations.
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
bin/ Entry points (audit, monitor, benchmark, optimize)
|
||||||
|
lib/ Shared bash libraries (common, detect, format)
|
||||||
|
scripts/ Implementation organized by function
|
||||||
|
configs/ Reference configuration templates
|
||||||
|
data/ Runtime output: audits, benchmarks, logs, backups (gitignored)
|
||||||
|
docs/ Technical documentation
|
||||||
|
```
|
||||||
|
|
||||||
|
See [docs/architecture.md](docs/architecture.md) for the full architecture, data flow, and JSON schemas.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- **OS**: Fedora 43 (tested), Fedora 42+ should work
|
||||||
|
- **Hardware**: AMD Strix Halo (Ryzen AI MAX / MAX+) with RDNA 3.5 iGPU
|
||||||
|
- **Tools**: `bc`, `python3`, `tmux`, `podman`, `toolbox`
|
||||||
|
- **Optional**: `amdgpu_top` (installed via `make monitor-install`), `huggingface-cli` (for model downloads)
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
| Document | Contents |
|
||||||
|
|----------|----------|
|
||||||
|
| [docs/architecture.md](docs/architecture.md) | Script layers, data flow, unified memory model, JSON schemas |
|
||||||
|
| [docs/optimization.md](docs/optimization.md) | Step-by-step optimization walkthrough |
|
||||||
|
| [docs/benchmarking.md](docs/benchmarking.md) | Benchmark methodology, test params, result interpretation |
|
||||||
|
| [docs/bios-vram-guide.md](docs/bios-vram-guide.md) | HP ZBook BIOS configuration for VRAM |
|
||||||
|
| [docs/troubleshooting.md](docs/troubleshooting.md) | Common issues and fixes |
|
||||||
|
| [docs/references.md](docs/references.md) | External links: AMD docs, toolboxes, community resources |
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
AI assistants: see [CLAUDE.md](CLAUDE.md) for safety rules and technical context. Agent workflows are in [AGENTS.md](AGENTS.md).
|
||||||
118
docs/architecture.md
Normal file
118
docs/architecture.md
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
# Architecture
|
||||||
|
|
||||||
|
## Script Layers
|
||||||
|
|
||||||
|
```
|
||||||
|
bin/ User entry points (thin dispatchers)
|
||||||
|
audit → scripts/audit/
|
||||||
|
monitor → scripts/monitor/
|
||||||
|
benchmark → scripts/benchmark/
|
||||||
|
optimize → scripts/optimize/
|
||||||
|
|
||||||
|
scripts/ Implementation (sourcing lib/ for shared logic)
|
||||||
|
audit/ System assessment
|
||||||
|
monitor/ GPU/system monitoring + metrics logging
|
||||||
|
benchmark/ llama-bench via toolbox containers
|
||||||
|
optimize/ GRUB, tuned, BIOS guidance, verify, rollback
|
||||||
|
|
||||||
|
lib/ Shared bash libraries (sourced, not executed)
|
||||||
|
common.sh Logging, root checks, confirm prompts, paths
|
||||||
|
detect.sh Hardware/config detection from sysfs + system tools
|
||||||
|
format.sh Color output, human_bytes, status indicators, tables
|
||||||
|
|
||||||
|
configs/ Reference configuration templates
|
||||||
|
data/ Runtime output (gitignored)
|
||||||
|
docs/ Documentation
|
||||||
|
```
|
||||||
|
|
||||||
|
Every script sources libs in order: `common.sh` → `detect.sh` → `format.sh`. `format.sh` depends on color variables defined in `common.sh`.
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
/sys/class/drm/card1/device/ ──→ lib/detect.sh ──→ scripts/audit/
|
||||||
|
/proc/cpuinfo, /proc/meminfo ──→ (detect_*) ──→ scripts/monitor/
|
||||||
|
/proc/cmdline ──→ ──→ scripts/optimize/
|
||||||
|
tuned-adm, rpm, lspci ──→ ──→ scripts/benchmark/
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
data/
|
||||||
|
├── audits/*.json
|
||||||
|
├── logs/*.csv
|
||||||
|
├── baselines/*/summary.json
|
||||||
|
└── benchmarks/*/summary.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Unified Memory Model
|
||||||
|
|
||||||
|
AMD Strix Halo shares physical RAM between CPU and GPU. Two allocation mechanisms:
|
||||||
|
|
||||||
|
| Type | Description | Configuration |
|
||||||
|
|------|-------------|---------------|
|
||||||
|
| **VRAM (dedicated)** | Permanently reserved for GPU framebuffer | BIOS setting (UMA Frame Buffer Size) |
|
||||||
|
| **GTT (dynamic)** | System RAM mapped into GPU address space on demand | Kernel boot params: `amdgpu.gttsize`, `ttm.pages_limit` |
|
||||||
|
|
||||||
|
**Optimal for LLM workloads**: Minimize VRAM (0.5 GiB), maximize GTT (~60 GiB on 64 GB system). The GPU borrows memory when needed and releases it when idle.
|
||||||
|
|
||||||
|
### Kernel Parameter Math (64 GB system)
|
||||||
|
|
||||||
|
```
|
||||||
|
Total physical RAM: 64 GiB
|
||||||
|
OS reserve: 4 GiB
|
||||||
|
Available for GTT: 60 GiB = 61440 MiB
|
||||||
|
|
||||||
|
amdgpu.gttsize = 60 * 1024 = 61440 (MiB)
|
||||||
|
ttm.pages_limit = 60 * 1024 * 256 = 15728640 (4K pages)
|
||||||
|
iommu = pt (passthrough, lower latency)
|
||||||
|
```
|
||||||
|
|
||||||
|
The toolkit computes these dynamically via `recommended_gttsize_mib()` and `recommended_pages_limit()` in `lib/detect.sh`, based on detected total physical RAM (visible + VRAM).
|
||||||
|
|
||||||
|
### Sysfs Paths
|
||||||
|
|
||||||
|
| Path | Content |
|
||||||
|
|------|---------|
|
||||||
|
| `/sys/class/drm/card1/device/mem_info_vram_total` | Dedicated VRAM in bytes |
|
||||||
|
| `/sys/class/drm/card1/device/mem_info_vram_used` | VRAM currently in use |
|
||||||
|
| `/sys/class/drm/card1/device/mem_info_gtt_total` | GTT allocation in bytes |
|
||||||
|
| `/sys/class/drm/card1/device/mem_info_gtt_used` | GTT currently in use |
|
||||||
|
| `/sys/class/drm/card1/device/gpu_busy_percent` | GPU utilization 0-100 |
|
||||||
|
| `/sys/class/drm/card1/device/hwmon/hwmon*/temp1_input` | Temperature in millidegrees C |
|
||||||
|
| `/sys/class/drm/card1/device/hwmon/hwmon*/power1_average` | Power in microwatts |
|
||||||
|
|
||||||
|
Card number is auto-detected by `find_gpu_card()` (matches AMD vendor ID `0x1002`).
|
||||||
|
|
||||||
|
## JSON Output Schemas
|
||||||
|
|
||||||
|
### system-state.json (from `audit --json`)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"timestamp": "20260325-120000",
|
||||||
|
"hardware": { "cpu_model": "...", "cpu_cores": 16, "cpu_threads": 32, "gpu_name": "...", "gpu_device_id": "1586", "system_ram_kb": 32609248 },
|
||||||
|
"memory": { "vram_total_bytes": 0, "vram_used_bytes": 0, "gtt_total_bytes": 0, "gtt_used_bytes": 0, "recommended_gttsize_mib": 0, "recommended_pages_limit": 0 },
|
||||||
|
"kernel": { "version": "...", "cmdline": "...", "param_iommu": "", "param_gttsize": "", "param_pages_limit": "" },
|
||||||
|
"firmware": "...", "tuned_profile": "...", "rocm_version": "...",
|
||||||
|
"vulkan": { "driver": "...", "version": "..." },
|
||||||
|
"sensors": { "gpu_temp_mc": 0, "gpu_power_uw": 0, "gpu_busy_pct": 0 },
|
||||||
|
"toolboxes": [], "stacks": { "ollama": "...", "lmstudio": "...", "llamacpp": "...", "opencode": "..." }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### summary.json (from benchmark runs)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"results": [
|
||||||
|
{ "file": "model__backend__fa1.log", "model": "...", "size": "...", "backend": "Vulkan", "test": "pp512", "tokens_per_sec": 548.18, "raw": "548.18 +/- 1.59" }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### metrics.csv (from `monitor --log`)
|
||||||
|
|
||||||
|
```
|
||||||
|
timestamp,gpu_busy_pct,vram_used_mib,gtt_used_mib,gpu_temp_c,gpu_power_w,cpu_pct,ram_used_mib
|
||||||
|
```
|
||||||
|
|
||||||
|
Sampled every 2 seconds by default. Pure bash implementation (no subshell forks per sample).
|
||||||
94
docs/benchmarking.md
Normal file
94
docs/benchmarking.md
Normal file
@@ -0,0 +1,94 @@
|
|||||||
|
# Benchmarking
|
||||||
|
|
||||||
|
## What We Measure
|
||||||
|
|
||||||
|
All benchmarks use [llama-bench](https://github.com/ggml-org/llama.cpp) (part of llama.cpp) running inside toolbox containers. Two test types:
|
||||||
|
|
||||||
|
| Metric | Meaning | Test Params |
|
||||||
|
|--------|---------|-------------|
|
||||||
|
| **pp** (prompt processing) | How fast the model ingests input tokens | Default: 512 tokens |
|
||||||
|
| **tg** (token generation) | How fast the model produces output tokens | Default: 128 tokens |
|
||||||
|
|
||||||
|
Results are in **tokens/second (t/s)**. Higher is better.
|
||||||
|
|
||||||
|
## Test Parameters
|
||||||
|
|
||||||
|
### Standard Test
|
||||||
|
```
|
||||||
|
-ngl 99 -mmp 0 -fa 1 -r 5
|
||||||
|
```
|
||||||
|
- `-ngl 99` — all layers on GPU
|
||||||
|
- `-mmp 0` — disable memory mapping (`--no-mmap`)
|
||||||
|
- `-fa 1` — flash attention enabled
|
||||||
|
- `-r 5` — 5 repetitions for statistical confidence
|
||||||
|
|
||||||
|
### Long-Context Test
|
||||||
|
```
|
||||||
|
-ngl 99 -mmp 0 -fa 1 -p 2048 -n 32 -d 32768 -ub SIZE -r 3
|
||||||
|
```
|
||||||
|
- `-p 2048` — 2048 prompt tokens
|
||||||
|
- `-n 32` — generate 32 tokens
|
||||||
|
- `-d 32768` — 32K context window
|
||||||
|
- `-ub SIZE` — micro-batch size (512 for Vulkan, 2048 for ROCm)
|
||||||
|
- `-r 3` — 3 repetitions (long-context tests are slow)
|
||||||
|
|
||||||
|
The `-fa 1 --no-mmap -ngl 999` flags are **mandatory** on Strix Halo to avoid crashes.
|
||||||
|
|
||||||
|
## Available Backends
|
||||||
|
|
||||||
|
| Backend | Container | Technology | Notes |
|
||||||
|
|---------|-----------|------------|-------|
|
||||||
|
| `llama-vulkan-radv` | Mesa RADV | Vulkan | Most stable, recommended default |
|
||||||
|
| `llama-vulkan-amdvlk` | AMDVLK | Vulkan | Fastest when it works, 2GB buffer limit |
|
||||||
|
| `llama-rocm-6.4.4` | ROCm 6.4.4 | HIP | Proven stable |
|
||||||
|
| `llama-rocm-7.2` | ROCm 7.2 | HIP | Latest, compiler fixes applied |
|
||||||
|
|
||||||
|
Containers are from [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes). Set up with `make benchmark-setup`.
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Setup (one-time)
|
||||||
|
make benchmark-setup
|
||||||
|
|
||||||
|
# 2. Capture baseline (before optimization)
|
||||||
|
make benchmark-baseline
|
||||||
|
|
||||||
|
# 3. After optimizing, run again
|
||||||
|
make benchmark # or: bin/benchmark run --tag post-opt
|
||||||
|
|
||||||
|
# 4. Compare
|
||||||
|
make benchmark-compare BEFORE=data/baselines/20260325-120000 AFTER=data/benchmarks/post-opt-20260326-100000
|
||||||
|
```
|
||||||
|
|
||||||
|
## Result Format
|
||||||
|
|
||||||
|
Each run produces a directory under `data/baselines/` or `data/benchmarks/`:
|
||||||
|
|
||||||
|
```
|
||||||
|
TIMESTAMP/
|
||||||
|
system-state.json # Full system audit snapshot
|
||||||
|
summary.json # Parsed results (model, backend, test, t/s)
|
||||||
|
metrics.csv # GPU/CPU metrics during the run
|
||||||
|
*.log # Raw llama-bench output per backend+model+test
|
||||||
|
```
|
||||||
|
|
||||||
|
### Comparison Output
|
||||||
|
|
||||||
|
```
|
||||||
|
Backend | Model | Test | Before | After | Delta
|
||||||
|
vulkan-radv | qwen3-4b | pp512 | 548 t/s | 612 t/s | +11.7%
|
||||||
|
vulkan-radv | qwen3-4b | tg128 | 13.9 | 15.2 | +9.4%
|
||||||
|
```
|
||||||
|
|
||||||
|
Configuration changes between runs (VRAM, GTT, kernel params, tuned profile) are shown if system-state.json differs.
|
||||||
|
|
||||||
|
## Recommended Test Models
|
||||||
|
|
||||||
|
| Size | Model | File | Disk | Use Case |
|
||||||
|
|------|-------|------|------|----------|
|
||||||
|
| Small | Qwen3-4B | Q4_K_M.gguf | ~3 GB | Quick smoke tests |
|
||||||
|
| Medium | Qwen3-14B | Q4_K_M.gguf | ~9 GB | Standard benchmarks |
|
||||||
|
| Large | Qwen3-32B | Q4_K_M.gguf | ~20 GB | Memory pressure tests |
|
||||||
|
|
||||||
|
Place models in `data/models/`. The VRAM estimator from the [toolboxes project](https://github.com/kyuz0/amd-strix-halo-toolboxes) (`gguf-vram-estimator.py`) can help plan which models fit.
|
||||||
@@ -1,5 +1,7 @@
|
|||||||
# BIOS VRAM Configuration — HP ZBook Ultra G1a
|
# BIOS VRAM Configuration — HP ZBook Ultra G1a
|
||||||
|
|
||||||
|
> Part of the [optimization workflow](optimization.md). For the full context on unified memory, see [architecture.md](architecture.md).
|
||||||
|
|
||||||
## Why Change VRAM?
|
## Why Change VRAM?
|
||||||
|
|
||||||
AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it.
|
AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it.
|
||||||
|
|||||||
84
docs/optimization.md
Normal file
84
docs/optimization.md
Normal file
@@ -0,0 +1,84 @@
|
|||||||
|
# Optimization Guide
|
||||||
|
|
||||||
|
Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
|
||||||
|
|
||||||
|
**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.
|
||||||
|
|
||||||
|
## Step 1: Tuned Profile (no reboot)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo make optimize-tuned
|
||||||
|
```
|
||||||
|
|
||||||
|
Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
|
||||||
|
|
||||||
|
Takes effect immediately. Previous profile is saved for rollback.
|
||||||
|
|
||||||
|
## Step 2: Kernel Boot Parameters (reboot required)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo make optimize-kernel
|
||||||
|
```
|
||||||
|
|
||||||
|
Adds three parameters to GRUB:
|
||||||
|
|
||||||
|
| Parameter | Value (64 GB) | Purpose |
|
||||||
|
|-----------|--------------|---------|
|
||||||
|
| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
|
||||||
|
| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
|
||||||
|
| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
|
||||||
|
|
||||||
|
Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
|
||||||
|
|
||||||
|
See [docs/architecture.md](architecture.md) for the math behind these values.
|
||||||
|
|
||||||
|
## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make optimize-vram
|
||||||
|
```
|
||||||
|
|
||||||
|
This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
|
||||||
|
|
||||||
|
See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
|
||||||
|
|
||||||
|
**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
|
||||||
|
|
||||||
|
## Step 4: Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make verify
|
||||||
|
```
|
||||||
|
|
||||||
|
Checks 9 criteria and reports a score. Target: 9/9.
|
||||||
|
|
||||||
|
## Step 5: Measure Impact
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make benchmark
|
||||||
|
make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
|
||||||
|
```
|
||||||
|
|
||||||
|
See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
|
||||||
|
|
||||||
|
## Expected Impact
|
||||||
|
|
||||||
|
| Optimization | pp512 Improvement | tg128 Improvement |
|
||||||
|
|-------------|-------------------|-------------------|
|
||||||
|
| Tuned profile | +5-8% | +2-3% |
|
||||||
|
| Kernel params + BIOS VRAM | +10-20% | +5-15% |
|
||||||
|
| **Combined** | **+15-25%** | **+8-18%** |
|
||||||
|
|
||||||
|
Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo make rollback
|
||||||
|
```
|
||||||
|
|
||||||
|
Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).
|
||||||
49
docs/references.md
Normal file
49
docs/references.md
Normal file
@@ -0,0 +1,49 @@
|
|||||||
|
# External References
|
||||||
|
|
||||||
|
Single source of truth for all external links used across this project.
|
||||||
|
|
||||||
|
## AMD Official
|
||||||
|
|
||||||
|
- [ROCm Strix Halo Optimization Guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html) — BIOS, kernel params, GTT/TTM configuration
|
||||||
|
- [ROCm System Optimization Index](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/index.html) — General ROCm tuning
|
||||||
|
- [ROCm Installation Guide (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) — Package installation
|
||||||
|
- [AMD SMI Documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) — GPU monitoring API
|
||||||
|
- [ROCm GitHub](https://github.com/ROCm/ROCm) — Source and issue tracker
|
||||||
|
|
||||||
|
## Strix Halo Toolboxes (Donato Capitella)
|
||||||
|
|
||||||
|
The most comprehensive community resource for Strix Halo LLM optimization.
|
||||||
|
|
||||||
|
- [strix-halo-toolboxes.com](https://strix-halo-toolboxes.com/) — Documentation, benchmarks, guides
|
||||||
|
- [GitHub: kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) — Container images, benchmark scripts, VRAM estimator
|
||||||
|
- [Benchmark Results Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) — Interactive performance charts
|
||||||
|
|
||||||
|
## Community
|
||||||
|
|
||||||
|
- [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
|
||||||
|
- [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
|
||||||
|
- [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
|
||||||
|
- [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database
|
||||||
|
|
||||||
|
## Other Strix Halo Repos
|
||||||
|
|
||||||
|
- [pablo-ross/strix-halo-gmktec-evo-x2](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2) — GMKtec EVO X2 optimization
|
||||||
|
- [kyuz0/amd-strix-halo-llm-finetuning](https://github.com/kyuz0/amd-strix-halo-llm-finetuning) — Fine-tuning guides (Gemma-3, Qwen-3)
|
||||||
|
|
||||||
|
## Monitoring Tools
|
||||||
|
|
||||||
|
- [amdgpu_top](https://github.com/Umio-Yasuno/amdgpu_top) — Best AMD GPU monitor (TUI/GUI/JSON)
|
||||||
|
- [nvtop](https://github.com/Syllo/nvtop) — Cross-vendor GPU monitor
|
||||||
|
- [btop](https://github.com/aristocratos/btop) — System resource monitor
|
||||||
|
|
||||||
|
## LLM Inference
|
||||||
|
|
||||||
|
- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine (Vulkan + ROCm)
|
||||||
|
- [ollama](https://ollama.com/) — LLM runtime with model management
|
||||||
|
- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
|
||||||
|
- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking
|
||||||
|
|
||||||
|
## AMD GPU Profiling
|
||||||
|
|
||||||
|
- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling
|
||||||
|
- [Radeon GPU Analyzer (RGA)](https://gpuopen.com/rga/) — Offline shader/kernel analysis
|
||||||
96
docs/troubleshooting.md
Normal file
96
docs/troubleshooting.md
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
# Troubleshooting
|
||||||
|
|
||||||
|
## Firmware: linux-firmware 20251125 Causes ROCm Crashes
|
||||||
|
|
||||||
|
**Symptoms**: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.
|
||||||
|
|
||||||
|
**Check**: `rpm -qa | grep linux-firmware`
|
||||||
|
|
||||||
|
**Fix**: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:
|
||||||
|
```bash
|
||||||
|
sudo dracut -f --kver $(uname -r)
|
||||||
|
```
|
||||||
|
|
||||||
|
The toolkit checks this automatically — `make audit` shows firmware status.
|
||||||
|
|
||||||
|
## amdgpu_top: Cargo Build Fails (gix-hash error)
|
||||||
|
|
||||||
|
**Symptoms**: `error: Please set either the sha1 or sha256 feature flag` during `cargo install amdgpu_top`.
|
||||||
|
|
||||||
|
**Cause**: Rust toolchain version incompatibility with the `gix-hash` dependency.
|
||||||
|
|
||||||
|
**Fix**: Use the pre-built RPM instead:
|
||||||
|
```bash
|
||||||
|
make monitor-install
|
||||||
|
```
|
||||||
|
The install script downloads the RPM from GitHub releases, bypassing cargo entirely.
|
||||||
|
|
||||||
|
## Toolbox GPU Access Failure
|
||||||
|
|
||||||
|
**Symptoms**: `llama-cli --list-devices` shows no GPU inside a toolbox container.
|
||||||
|
|
||||||
|
**Check**: Device mappings when creating the toolbox:
|
||||||
|
- Vulkan backends need: `--device /dev/dri`
|
||||||
|
- ROCm backends need: `--device /dev/dri --device /dev/kfd`
|
||||||
|
|
||||||
|
**Fix**: Recreate the toolbox with correct device flags. The [refresh-toolboxes.sh](https://github.com/kyuz0/amd-strix-halo-toolboxes) script handles this automatically.
|
||||||
|
|
||||||
|
Also ensure your user is in the `video` and `render` groups:
|
||||||
|
```bash
|
||||||
|
sudo usermod -aG video,render $USER
|
||||||
|
```
|
||||||
|
|
||||||
|
## GRUB Changes Not Taking Effect
|
||||||
|
|
||||||
|
**Symptoms**: After `make optimize-kernel` and reboot, `make audit` still shows missing params.
|
||||||
|
|
||||||
|
**Possible causes**:
|
||||||
|
|
||||||
|
1. **BLS (Boot Loader Spec)**: Modern Fedora uses BLS entries. The script uses `grubby` when available, but verify:
|
||||||
|
```bash
|
||||||
|
grubby --info=ALL | grep args
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Wrong GRUB config path**: Check which config is actually used:
|
||||||
|
```bash
|
||||||
|
cat /proc/cmdline # what the kernel actually booted with
|
||||||
|
cat /etc/default/grub # what the script modified
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **GRUB not regenerated**: Manually regenerate:
|
||||||
|
```bash
|
||||||
|
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||||||
|
```
|
||||||
|
|
||||||
|
## Memory Unchanged After BIOS Change
|
||||||
|
|
||||||
|
**Symptoms**: Changed VRAM in BIOS but `make audit` still shows 32 GiB.
|
||||||
|
|
||||||
|
**Check**:
|
||||||
|
```bash
|
||||||
|
cat /sys/class/drm/card1/device/mem_info_vram_total
|
||||||
|
```
|
||||||
|
|
||||||
|
**Possible causes**:
|
||||||
|
- BIOS change not saved (verify by re-entering BIOS)
|
||||||
|
- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
|
||||||
|
- Kernel params not applied (VRAM reduction requires kernel params to be useful)
|
||||||
|
|
||||||
|
## Benchmark Failures
|
||||||
|
|
||||||
|
**Symptoms**: `make benchmark-baseline` reports "FAILED" for some backends.
|
||||||
|
|
||||||
|
**Common fixes**:
|
||||||
|
- Ensure model exists: `ls data/models/*.gguf`
|
||||||
|
- Check model fits in memory: small models (4B) for initial testing
|
||||||
|
- Try `llama-vulkan-radv` first (most stable backend)
|
||||||
|
- Check dmesg for GPU errors: `dmesg | tail -30`
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
If optimization causes issues:
|
||||||
|
```bash
|
||||||
|
sudo make rollback
|
||||||
|
```
|
||||||
|
|
||||||
|
This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See [docs/optimization.md](optimization.md) for the full rollback procedure.
|
||||||
Reference in New Issue
Block a user