From 5b81437637295a37627cd8169334ae80fbaa80f0 Mon Sep 17 00:00:00 2001
From: Felipe Cardoso <felipe.cardoso@shootify.io>
Date: Wed, 25 Mar 2026 20:50:00 +0100
Subject: [PATCH] docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite

- README.md: project overview, quick start, command reference, workflow
- CLAUDE.md: AI safety rules, technical details, conventions
- AGENTS.md: agent workflows, file responsibility map, dependency matrix
- docs/architecture.md: script layers, data flow, unified memory, JSON schemas
- docs/optimization.md: step-by-step optimization walkthrough
- docs/benchmarking.md: methodology, test params, result interpretation
- docs/troubleshooting.md: common issues and fixes
- docs/references.md: centralized external links (single source of truth)
- docs/bios-vram-guide.md: add back-link to optimization workflow

Cross-linked non-redundantly: each doc owns one layer, others link to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 AGENTS.md               |  62 +++++++++++++++++++++
 CLAUDE.md               |  48 ++++++++++++++++
 README.md               | 114 ++++++++++++++++++++++++++++++++++++++
 docs/architecture.md    | 118 ++++++++++++++++++++++++++++++++++++++++
 docs/benchmarking.md    |  94 ++++++++++++++++++++++++++++++++
 docs/bios-vram-guide.md |   2 +
 docs/optimization.md    |  84 ++++++++++++++++++++++++++++
 docs/references.md      |  49 +++++++++++++++++
 docs/troubleshooting.md |  96 ++++++++++++++++++++++++++++++++
 9 files changed, 667 insertions(+)
 create mode 100644 AGENTS.md
 create mode 100644 CLAUDE.md
 create mode 100644 README.md
 create mode 100644 docs/architecture.md
 create mode 100644 docs/benchmarking.md
 create mode 100644 docs/optimization.md
 create mode 100644 docs/references.md
 create mode 100644 docs/troubleshooting.md

diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000..4d6573d
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,62 @@
+# AGENTS.md — Agent Workflows
+
+Read [CLAUDE.md](CLAUDE.md) first for safety rules and technical context.
+
+## Common Workflows
+
+### Add a Detection Function
+
+1. Add the function to `lib/detect.sh` following `detect_*` naming convention
+2. If it reads sysfs, use `$GPU_SYSFS` (auto-detected) with a `2>/dev/null` fallback
+3. Wire it into `scripts/audit/quick-glance.sh` (display) and/or `scripts/audit/system-report.sh` (JSON output)
+4. If it has an optimal value, add a check to `scripts/optimize/verify.sh`
+5. Validate: `make audit` and `bin/audit --json | python3 -m json.tool`
+
+### Add a Benchmark Backend
+
+1. Add the toolbox entry to `BENCH_PATHS` associative array in both `scripts/benchmark/run-baseline.sh` and `scripts/benchmark/run-suite.sh`
+2. Map the toolbox name → llama-bench binary path (Vulkan: `/usr/sbin/llama-bench`, ROCm: `/usr/local/bin/llama-bench`)
+3. If ROCm, the `ENV_ARGS` logic already handles `ROCBLAS_USE_HIPBLASLT=1`
+4. Add the toolbox to `refresh-toolboxes.sh` in the [toolboxes repo](https://github.com/kyuz0/amd-strix-halo-toolboxes)
+5. Validate: `make benchmark-setup` then `make benchmark`
+
+### Add an Optimization Script
+
+1. Create `scripts/optimize/my-optimization.sh` sourcing `lib/common.sh` (and `detect.sh` / `format.sh` as needed)
+2. Add root check at top if the script modifies system state: `[[ $EUID -ne 0 ]] && { log_error "Requires root"; exit 1; }`
+3. Add a corresponding case to `bin/optimize`
+4. Add a Makefile target
+5. Add verification criteria to `scripts/optimize/verify.sh`
+6. If the optimization is reversible, add rollback logic to `scripts/optimize/rollback.sh`
+7. Document in [docs/optimization.md](docs/optimization.md)
+
+### Add a Monitoring Metric
+
+1. In `scripts/monitor/log-metrics.sh`, cache the sysfs path at startup (avoid per-sample globbing)
+2. Read with `read -r var < "$SYSFS_PATH" 2>/dev/null || var=0` (no subshells in the hot loop)
+3. Add the column to the CSV header and the `echo` line
+4. Update the CSV schema in [docs/architecture.md](docs/architecture.md)
+
+## File Responsibility Map
+
+| Want to change... | Touch these files |
+|-------------------|-------------------|
+| What `make audit` shows | `scripts/audit/quick-glance.sh`, `lib/detect.sh` |
+| JSON audit output | `scripts/audit/system-report.sh`, `lib/detect.sh` |
+| Dashboard layout | `scripts/monitor/dashboard.sh` |
+| Metric collection | `scripts/monitor/log-metrics.sh` |
+| Benchmark parameters | `scripts/benchmark/run-baseline.sh`, `run-suite.sh` |
+| Result comparison | `scripts/benchmark/compare.sh` |
+| Kernel params | `scripts/optimize/kernel-params.sh`, `lib/detect.sh` (recommended values) |
+| Optimization checks | `scripts/optimize/verify.sh`, `scripts/audit/quick-glance.sh` |
+| Shared utilities | `lib/common.sh` (logging), `lib/format.sh` (output), `lib/detect.sh` (hardware) |
+| External links | `docs/references.md` (single source of truth) |
+
+## Dependencies by Workflow
+
+| Workflow | Requires |
+|----------|----------|
+| Audit | `bc`, `python3` |
+| Monitor | `tmux`, `amdgpu_top` or `nvtop`, `btop` or `htop` |
+| Benchmark | `toolbox`, `podman`, GGUF models in `data/models/`, `python3` |
+| Optimize | `sudo`/root, `grubby` or `grub2-mkconfig`, `tuned-adm`, `python3` |
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..9319d35
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,48 @@
+# CLAUDE.md — AI Assistant Context
+
+Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See [README.md](README.md) for user-facing commands.
+
+## Architecture
+
+`bin/` dispatchers → `scripts/` implementations → `lib/` shared libraries. All scripts source libs in order: `common.sh` → `detect.sh` → `format.sh`. Runtime data goes to `data/` (gitignored). Full details in [docs/architecture.md](docs/architecture.md).
+
+## Safety Rules
+
+- **`scripts/optimize/kernel-params.sh`** modifies `/etc/default/grub` — requires root, backs up to `data/backups/` first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).
+- **`scripts/optimize/tuned-profile.sh`** and **`rollback.sh`** require root and save previous state for rollback.
+- **`data/backups/`** contains GRUB backups and tuned profile snapshots — never delete these.
+- Optimization scripts that require root check `$EUID` at the top and exit immediately if not root.
+- All Python blocks receive data via environment variables (`os.environ`), never via shell interpolation into Python source. This prevents injection. **Do not revert to `'''$var'''` or `"$var"` patterns inside Python heredocs.**
+
+## Key Technical Details
+
+- **GPU sysfs**: Auto-detected by `find_gpu_card()` in `lib/detect.sh` (matches vendor `0x1002`). Falls back to first card with `mem_info_vram_total`.
+- **Memory recommendations**: `recommended_gttsize_mib()` in `detect.sh` computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB.
+- **Kernel param detection**: `detect_kernel_param()` uses word-boundary-anchored regex to avoid `iommu` matching `amd_iommu`.
+- **Benchmark invocation**: `toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N`. ENV_ARGS passed as a proper bash array (not string splitting).
+- **llama-bench output**: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
+- **ROCm for gfx1151**: `ROCBLAS_USE_HIPBLASLT=1`, `HSA_OVERRIDE_GFX_VERSION=11.5.1`.
+- **Fedora GRUB**: Prefers `grubby` (BLS) over `grub2-mkconfig`. Both paths are handled.
+
+## Conventions
+
+- `set -euo pipefail` in every executable script
+- `snake_case` function names, `UPPER_CASE` for constants and loop variables
+- 4-space indentation, no tabs
+- `lib/` files are sourced (no shebang enforcement), but include `#!/usr/bin/env bash` for editor support
+- Colors gated on `[[ -t 1 ]]` (disabled when piped)
+- `bc` used for float math; `python3` for JSON and GRUB editing only
+
+## Validating Changes
+
+```bash
+make audit          # Quick check — shows system status with pass/fail indicators
+make verify         # 9-point optimization checklist
+bin/audit --json | python3 -m json.tool   # Verify JSON output is valid
+```
+
+## External Resources
+
+All external links are centralized in [docs/references.md](docs/references.md). Key ones:
+- AMD ROCm Strix Halo guide (kernel params, GTT configuration)
+- Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..e4d76de
--- /dev/null
+++ b/README.md
@@ -0,0 +1,114 @@
+# Strix Halo Optimization Toolkit
+
+Audit, monitor, benchmark, and optimize AMD Strix Halo integrated GPU systems for LLM inference workloads.
+
+**Target hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151) with 64 GB unified memory, on Fedora 43. Tested on HP ZBook Ultra G1a.
+
+## Quick Start
+
+```bash
+make audit                # See current system status and optimization score
+make monitor-install      # Install amdgpu_top + btop
+make benchmark-setup      # Create toolbox containers + download test model
+make benchmark-baseline   # Capture performance before optimization
+```
+
+## System Status
+
+`make audit` produces a single-screen overview:
+
+```
+=== Memory Allocation ===
+  [!!] VRAM (dedicated)               32.0 GiB — should be 0.5 GiB in BIOS
+  [!!] GTT (dynamic)                  15.5 GiB — should be ~59.0 GiB with kernel params
+
+=== Kernel Boot Parameters ===
+  [!!] iommu=pt                       MISSING
+  [!!] amdgpu.gttsize                 MISSING — recommended: 60416
+  [!!] ttm.pages_limit                MISSING — recommended: 15466496
+
+=== Performance Profile ===
+  [!!] Tuned profile                  throughput-performance — recommended: accelerator-performance
+
+=== Optimization Score ===
+  2 / 8 checks passing
+```
+
+Each `[!!]` is an optimization opportunity. Run `make optimize` to address them.
+
+## Commands
+
+| Command | Description |
+|---------|-------------|
+| `make audit` | Quick system status (single screen) |
+| `make audit-full` | Full system report (saved to data/audits/) |
+| `make monitor` | Launch tmux monitoring dashboard |
+| `make monitor-simple` | Launch amdgpu_top only |
+| `make monitor-install` | Install monitoring tools (amdgpu_top, btop) |
+| `make monitor-log` | Start background CSV metric logger |
+| `make benchmark-setup` | Ensure toolboxes and test models are ready |
+| `make benchmark-baseline` | Capture pre-optimization baseline |
+| `make benchmark` | Run full benchmark suite |
+| `make benchmark-compare` | Compare two runs (`BEFORE=dir AFTER=dir`) |
+| `sudo make optimize` | Interactive optimization walkthrough |
+| `sudo make optimize-kernel` | Configure kernel boot parameters |
+| `sudo make optimize-tuned` | Switch to accelerator-performance profile |
+| `make optimize-vram` | BIOS VRAM guidance + GTT verification |
+| `make verify` | Post-optimization verification checklist |
+| `sudo make rollback` | Rollback optimizations |
+
+## Optimization Workflow
+
+```
+1. Audit          make audit
+      │
+2. Monitor        make monitor-install && make monitor
+      │
+3. Baseline       make benchmark-setup && make benchmark-baseline
+      │
+4. Optimize       sudo make optimize
+      │               ├── tuned profile  (instant, +5-8% pp)
+      │               ├── kernel params  (reboot required)
+      │               └── BIOS VRAM      (reboot + BIOS access)
+      │
+5. Verify         make verify
+      │
+6. Re-benchmark   make benchmark && make benchmark-compare BEFORE=... AFTER=...
+```
+
+See [docs/optimization.md](docs/optimization.md) for the full walkthrough with explanations.
+
+## Project Structure
+
+```
+bin/            Entry points (audit, monitor, benchmark, optimize)
+lib/            Shared bash libraries (common, detect, format)
+scripts/        Implementation organized by function
+configs/        Reference configuration templates
+data/           Runtime output: audits, benchmarks, logs, backups (gitignored)
+docs/           Technical documentation
+```
+
+See [docs/architecture.md](docs/architecture.md) for the full architecture, data flow, and JSON schemas.
+
+## Requirements
+
+- **OS**: Fedora 43 (tested), Fedora 42+ should work
+- **Hardware**: AMD Strix Halo (Ryzen AI MAX / MAX+) with RDNA 3.5 iGPU
+- **Tools**: `bc`, `python3`, `tmux`, `podman`, `toolbox`
+- **Optional**: `amdgpu_top` (installed via `make monitor-install`), `huggingface-cli` (for model downloads)
+
+## Documentation
+
+| Document | Contents |
+|----------|----------|
+| [docs/architecture.md](docs/architecture.md) | Script layers, data flow, unified memory model, JSON schemas |
+| [docs/optimization.md](docs/optimization.md) | Step-by-step optimization walkthrough |
+| [docs/benchmarking.md](docs/benchmarking.md) | Benchmark methodology, test params, result interpretation |
+| [docs/bios-vram-guide.md](docs/bios-vram-guide.md) | HP ZBook BIOS configuration for VRAM |
+| [docs/troubleshooting.md](docs/troubleshooting.md) | Common issues and fixes |
+| [docs/references.md](docs/references.md) | External links: AMD docs, toolboxes, community resources |
+
+## Contributing
+
+AI assistants: see [CLAUDE.md](CLAUDE.md) for safety rules and technical context. Agent workflows are in [AGENTS.md](AGENTS.md).
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 0000000..72aa1ac
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,118 @@
+# Architecture
+
+## Script Layers
+
+```
+bin/                    User entry points (thin dispatchers)
+  audit                   → scripts/audit/
+  monitor                 → scripts/monitor/
+  benchmark               → scripts/benchmark/
+  optimize                → scripts/optimize/
+
+scripts/                Implementation (sourcing lib/ for shared logic)
+  audit/                  System assessment
+  monitor/                GPU/system monitoring + metrics logging
+  benchmark/              llama-bench via toolbox containers
+  optimize/               GRUB, tuned, BIOS guidance, verify, rollback
+
+lib/                    Shared bash libraries (sourced, not executed)
+  common.sh               Logging, root checks, confirm prompts, paths
+  detect.sh               Hardware/config detection from sysfs + system tools
+  format.sh               Color output, human_bytes, status indicators, tables
+
+configs/                Reference configuration templates
+data/                   Runtime output (gitignored)
+docs/                   Documentation
+```
+
+Every script sources libs in order: `common.sh` → `detect.sh` → `format.sh`. `format.sh` depends on color variables defined in `common.sh`.
+
+## Data Flow
+
+```
+/sys/class/drm/card1/device/   ──→  lib/detect.sh  ──→  scripts/audit/
+/proc/cpuinfo, /proc/meminfo   ──→      (detect_*)  ──→  scripts/monitor/
+/proc/cmdline                   ──→                  ──→  scripts/optimize/
+tuned-adm, rpm, lspci          ──→                  ──→  scripts/benchmark/
+                                                              │
+                                                              ▼
+                                                         data/
+                                                           ├── audits/*.json
+                                                           ├── logs/*.csv
+                                                           ├── baselines/*/summary.json
+                                                           └── benchmarks/*/summary.json
+```
+
+## Unified Memory Model
+
+AMD Strix Halo shares physical RAM between CPU and GPU. Two allocation mechanisms:
+
+| Type | Description | Configuration |
+|------|-------------|---------------|
+| **VRAM (dedicated)** | Permanently reserved for GPU framebuffer | BIOS setting (UMA Frame Buffer Size) |
+| **GTT (dynamic)** | System RAM mapped into GPU address space on demand | Kernel boot params: `amdgpu.gttsize`, `ttm.pages_limit` |
+
+**Optimal for LLM workloads**: Minimize VRAM (0.5 GiB), maximize GTT (~60 GiB on 64 GB system). The GPU borrows memory when needed and releases it when idle.
+
+### Kernel Parameter Math (64 GB system)
+
+```
+Total physical RAM:     64 GiB
+OS reserve:              4 GiB
+Available for GTT:      60 GiB = 61440 MiB
+
+amdgpu.gttsize = 60 * 1024         = 61440   (MiB)
+ttm.pages_limit = 60 * 1024 * 256  = 15728640 (4K pages)
+iommu = pt                                     (passthrough, lower latency)
+```
+
+The toolkit computes these dynamically via `recommended_gttsize_mib()` and `recommended_pages_limit()` in `lib/detect.sh`, based on detected total physical RAM (visible + VRAM).
+
+### Sysfs Paths
+
+| Path | Content |
+|------|---------|
+| `/sys/class/drm/card1/device/mem_info_vram_total` | Dedicated VRAM in bytes |
+| `/sys/class/drm/card1/device/mem_info_vram_used` | VRAM currently in use |
+| `/sys/class/drm/card1/device/mem_info_gtt_total` | GTT allocation in bytes |
+| `/sys/class/drm/card1/device/mem_info_gtt_used` | GTT currently in use |
+| `/sys/class/drm/card1/device/gpu_busy_percent` | GPU utilization 0-100 |
+| `/sys/class/drm/card1/device/hwmon/hwmon*/temp1_input` | Temperature in millidegrees C |
+| `/sys/class/drm/card1/device/hwmon/hwmon*/power1_average` | Power in microwatts |
+
+Card number is auto-detected by `find_gpu_card()` (matches AMD vendor ID `0x1002`).
+
+## JSON Output Schemas
+
+### system-state.json (from `audit --json`)
+
+```json
+{
+  "timestamp": "20260325-120000",
+  "hardware": { "cpu_model": "...", "cpu_cores": 16, "cpu_threads": 32, "gpu_name": "...", "gpu_device_id": "1586", "system_ram_kb": 32609248 },
+  "memory": { "vram_total_bytes": 0, "vram_used_bytes": 0, "gtt_total_bytes": 0, "gtt_used_bytes": 0, "recommended_gttsize_mib": 0, "recommended_pages_limit": 0 },
+  "kernel": { "version": "...", "cmdline": "...", "param_iommu": "", "param_gttsize": "", "param_pages_limit": "" },
+  "firmware": "...", "tuned_profile": "...", "rocm_version": "...",
+  "vulkan": { "driver": "...", "version": "..." },
+  "sensors": { "gpu_temp_mc": 0, "gpu_power_uw": 0, "gpu_busy_pct": 0 },
+  "toolboxes": [], "stacks": { "ollama": "...", "lmstudio": "...", "llamacpp": "...", "opencode": "..." }
+}
+```
+
+### summary.json (from benchmark runs)
+
+```json
+{
+  "results": [
+    { "file": "model__backend__fa1.log", "model": "...", "size": "...", "backend": "Vulkan", "test": "pp512", "tokens_per_sec": 548.18, "raw": "548.18 +/- 1.59" }
+  ]
+}
+```
+
+### metrics.csv (from `monitor --log`)
+
+```
+timestamp,gpu_busy_pct,vram_used_mib,gtt_used_mib,gpu_temp_c,gpu_power_w,cpu_pct,ram_used_mib
+```
+
+Sampled every 2 seconds by default. Pure bash implementation (no subshell forks per sample).
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
new file mode 100644
index 0000000..ab9158a
--- /dev/null
+++ b/docs/benchmarking.md
@@ -0,0 +1,94 @@
+# Benchmarking
+
+## What We Measure
+
+All benchmarks use [llama-bench](https://github.com/ggml-org/llama.cpp) (part of llama.cpp) running inside toolbox containers. Two test types:
+
+| Metric | Meaning | Test Params |
+|--------|---------|-------------|
+| **pp** (prompt processing) | How fast the model ingests input tokens | Default: 512 tokens |
+| **tg** (token generation) | How fast the model produces output tokens | Default: 128 tokens |
+
+Results are in **tokens/second (t/s)**. Higher is better.
+
+## Test Parameters
+
+### Standard Test
+```
+-ngl 99 -mmp 0 -fa 1 -r 5
+```
+- `-ngl 99` — all layers on GPU
+- `-mmp 0` — disable memory mapping (`--no-mmap`)
+- `-fa 1` — flash attention enabled
+- `-r 5` — 5 repetitions for statistical confidence
+
+### Long-Context Test
+```
+-ngl 99 -mmp 0 -fa 1 -p 2048 -n 32 -d 32768 -ub SIZE -r 3
+```
+- `-p 2048` — 2048 prompt tokens
+- `-n 32` — generate 32 tokens
+- `-d 32768` — 32K context window
+- `-ub SIZE` — micro-batch size (512 for Vulkan, 2048 for ROCm)
+- `-r 3` — 3 repetitions (long-context tests are slow)
+
+The `-fa 1 --no-mmap -ngl 999` flags are **mandatory** on Strix Halo to avoid crashes.
+
+## Available Backends
+
+| Backend | Container | Technology | Notes |
+|---------|-----------|------------|-------|
+| `llama-vulkan-radv` | Mesa RADV | Vulkan | Most stable, recommended default |
+| `llama-vulkan-amdvlk` | AMDVLK | Vulkan | Fastest when it works, 2GB buffer limit |
+| `llama-rocm-6.4.4` | ROCm 6.4.4 | HIP | Proven stable |
+| `llama-rocm-7.2` | ROCm 7.2 | HIP | Latest, compiler fixes applied |
+
+Containers are from [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes). Set up with `make benchmark-setup`.
+
+## Workflow
+
+```bash
+# 1. Setup (one-time)
+make benchmark-setup
+
+# 2. Capture baseline (before optimization)
+make benchmark-baseline
+
+# 3. After optimizing, run again
+make benchmark              # or: bin/benchmark run --tag post-opt
+
+# 4. Compare
+make benchmark-compare BEFORE=data/baselines/20260325-120000 AFTER=data/benchmarks/post-opt-20260326-100000
+```
+
+## Result Format
+
+Each run produces a directory under `data/baselines/` or `data/benchmarks/`:
+
+```
+TIMESTAMP/
+  system-state.json    # Full system audit snapshot
+  summary.json         # Parsed results (model, backend, test, t/s)
+  metrics.csv          # GPU/CPU metrics during the run
+  *.log                # Raw llama-bench output per backend+model+test
+```
+
+### Comparison Output
+
+```
+Backend     | Model     | Test  | Before  | After   | Delta
+vulkan-radv | qwen3-4b  | pp512 | 548 t/s | 612 t/s | +11.7%
+vulkan-radv | qwen3-4b  | tg128 | 13.9    | 15.2    | +9.4%
+```
+
+Configuration changes between runs (VRAM, GTT, kernel params, tuned profile) are shown if system-state.json differs.
+
+## Recommended Test Models
+
+| Size | Model | File | Disk | Use Case |
+|------|-------|------|------|----------|
+| Small | Qwen3-4B | Q4_K_M.gguf | ~3 GB | Quick smoke tests |
+| Medium | Qwen3-14B | Q4_K_M.gguf | ~9 GB | Standard benchmarks |
+| Large | Qwen3-32B | Q4_K_M.gguf | ~20 GB | Memory pressure tests |
+
+Place models in `data/models/`. The VRAM estimator from the [toolboxes project](https://github.com/kyuz0/amd-strix-halo-toolboxes) (`gguf-vram-estimator.py`) can help plan which models fit.
diff --git a/docs/bios-vram-guide.md b/docs/bios-vram-guide.md
index e8fa32b..31a2218 100644
--- a/docs/bios-vram-guide.md
+++ b/docs/bios-vram-guide.md
@@ -1,5 +1,7 @@
 # BIOS VRAM Configuration — HP ZBook Ultra G1a
 
+> Part of the [optimization workflow](optimization.md). For the full context on unified memory, see [architecture.md](architecture.md).
+
 ## Why Change VRAM?
 
 AMD Strix Halo uses **unified memory** — the CPU and GPU share the same physical RAM. By default, the HP ZBook allocates **32 GB as dedicated VRAM**, permanently locking that memory away from the OS even when the GPU isn't using it.
diff --git a/docs/optimization.md b/docs/optimization.md
new file mode 100644
index 0000000..6f0ed19
--- /dev/null
+++ b/docs/optimization.md
@@ -0,0 +1,84 @@
+# Optimization Guide
+
+Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
+
+**Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.
+
+## Step 1: Tuned Profile (no reboot)
+
+```bash
+sudo make optimize-tuned
+```
+
+Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
+
+Takes effect immediately. Previous profile is saved for rollback.
+
+## Step 2: Kernel Boot Parameters (reboot required)
+
+```bash
+sudo make optimize-kernel
+```
+
+Adds three parameters to GRUB:
+
+| Parameter | Value (64 GB) | Purpose |
+|-----------|--------------|---------|
+| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
+| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
+| `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |
+
+Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
+
+See [docs/architecture.md](architecture.md) for the math behind these values.
+
+## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
+
+```bash
+make optimize-vram
+```
+
+This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
+
+See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
+
+**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
+
+## Step 4: Verify
+
+```bash
+make verify
+```
+
+Checks 9 criteria and reports a score. Target: 9/9.
+
+## Step 5: Measure Impact
+
+```bash
+make benchmark
+make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
+```
+
+See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
+
+## Expected Impact
+
+| Optimization | pp512 Improvement | tg128 Improvement |
+|-------------|-------------------|-------------------|
+| Tuned profile | +5-8% | +2-3% |
+| Kernel params + BIOS VRAM | +10-20% | +5-15% |
+| **Combined** | **+15-25%** | **+8-18%** |
+
+Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
+
+## Rollback
+
+```bash
+sudo make rollback
+```
+
+Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
+
+## Troubleshooting
+
+If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).
diff --git a/docs/references.md b/docs/references.md
new file mode 100644
index 0000000..41fd423
--- /dev/null
+++ b/docs/references.md
@@ -0,0 +1,49 @@
+# External References
+
+Single source of truth for all external links used across this project.
+
+## AMD Official
+
+- [ROCm Strix Halo Optimization Guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html) — BIOS, kernel params, GTT/TTM configuration
+- [ROCm System Optimization Index](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/index.html) — General ROCm tuning
+- [ROCm Installation Guide (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) — Package installation
+- [AMD SMI Documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) — GPU monitoring API
+- [ROCm GitHub](https://github.com/ROCm/ROCm) — Source and issue tracker
+
+## Strix Halo Toolboxes (Donato Capitella)
+
+The most comprehensive community resource for Strix Halo LLM optimization.
+
+- [strix-halo-toolboxes.com](https://strix-halo-toolboxes.com/) — Documentation, benchmarks, guides
+- [GitHub: kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) — Container images, benchmark scripts, VRAM estimator
+- [Benchmark Results Viewer](https://kyuz0.github.io/amd-strix-halo-toolboxes/) — Interactive performance charts
+
+## Community
+
+- [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
+- [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
+- [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
+- [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database
+
+## Other Strix Halo Repos
+
+- [pablo-ross/strix-halo-gmktec-evo-x2](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2) — GMKtec EVO X2 optimization
+- [kyuz0/amd-strix-halo-llm-finetuning](https://github.com/kyuz0/amd-strix-halo-llm-finetuning) — Fine-tuning guides (Gemma-3, Qwen-3)
+
+## Monitoring Tools
+
+- [amdgpu_top](https://github.com/Umio-Yasuno/amdgpu_top) — Best AMD GPU monitor (TUI/GUI/JSON)
+- [nvtop](https://github.com/Syllo/nvtop) — Cross-vendor GPU monitor
+- [btop](https://github.com/aristocratos/btop) — System resource monitor
+
+## LLM Inference
+
+- [llama.cpp](https://github.com/ggml-org/llama.cpp) — LLM inference engine (Vulkan + ROCm)
+- [ollama](https://ollama.com/) — LLM runtime with model management
+- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
+- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking
+
+## AMD GPU Profiling
+
+- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling
+- [Radeon GPU Analyzer (RGA)](https://gpuopen.com/rga/) — Offline shader/kernel analysis
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
new file mode 100644
index 0000000..d8bb729
--- /dev/null
+++ b/docs/troubleshooting.md
@@ -0,0 +1,96 @@
+# Troubleshooting
+
+## Firmware: linux-firmware 20251125 Causes ROCm Crashes
+
+**Symptoms**: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.
+
+**Check**: `rpm -qa | grep linux-firmware`
+
+**Fix**: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:
+```bash
+sudo dracut -f --kver $(uname -r)
+```
+
+The toolkit checks this automatically — `make audit` shows firmware status.
+
+## amdgpu_top: Cargo Build Fails (gix-hash error)
+
+**Symptoms**: `error: Please set either the sha1 or sha256 feature flag` during `cargo install amdgpu_top`.
+
+**Cause**: Rust toolchain version incompatibility with the `gix-hash` dependency.
+
+**Fix**: Use the pre-built RPM instead:
+```bash
+make monitor-install
+```
+The install script downloads the RPM from GitHub releases, bypassing cargo entirely.
+
+## Toolbox GPU Access Failure
+
+**Symptoms**: `llama-cli --list-devices` shows no GPU inside a toolbox container.
+
+**Check**: Device mappings when creating the toolbox:
+- Vulkan backends need: `--device /dev/dri`
+- ROCm backends need: `--device /dev/dri --device /dev/kfd`
+
+**Fix**: Recreate the toolbox with correct device flags. The [refresh-toolboxes.sh](https://github.com/kyuz0/amd-strix-halo-toolboxes) script handles this automatically.
+
+Also ensure your user is in the `video` and `render` groups:
+```bash
+sudo usermod -aG video,render $USER
+```
+
+## GRUB Changes Not Taking Effect
+
+**Symptoms**: After `make optimize-kernel` and reboot, `make audit` still shows missing params.
+
+**Possible causes**:
+
+1. **BLS (Boot Loader Spec)**: Modern Fedora uses BLS entries. The script uses `grubby` when available, but verify:
+   ```bash
+   grubby --info=ALL | grep args
+   ```
+
+2. **Wrong GRUB config path**: Check which config is actually used:
+   ```bash
+   cat /proc/cmdline    # what the kernel actually booted with
+   cat /etc/default/grub  # what the script modified
+   ```
+
+3. **GRUB not regenerated**: Manually regenerate:
+   ```bash
+   sudo grub2-mkconfig -o /boot/grub2/grub.cfg
+   ```
+
+## Memory Unchanged After BIOS Change
+
+**Symptoms**: Changed VRAM in BIOS but `make audit` still shows 32 GiB.
+
+**Check**:
+```bash
+cat /sys/class/drm/card1/device/mem_info_vram_total
+```
+
+**Possible causes**:
+- BIOS change not saved (verify by re-entering BIOS)
+- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
+- Kernel params not applied (VRAM reduction requires kernel params to be useful)
+
+## Benchmark Failures
+
+**Symptoms**: `make benchmark-baseline` reports "FAILED" for some backends.
+
+**Common fixes**:
+- Ensure model exists: `ls data/models/*.gguf`
+- Check model fits in memory: small models (4B) for initial testing
+- Try `llama-vulkan-radv` first (most stable backend)
+- Check dmesg for GPU errors: `dmesg | tail -30`
+
+## Rollback
+
+If optimization causes issues:
+```bash
+sudo make rollback
+```
+
+This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See [docs/optimization.md](optimization.md) for the full rollback procedure.