docs: add strix-benchmarks deep-dive and exploitation plan

Aggregate findings from the 0xSero strix-benchmarks project (identical gfx1151 hardware) into a single reference: production-config comparison, real-workload timings, PR #21344 (ROCm MMQ tuning) and PR #20075 (Vulkan spec decode) status, kernel-param caveats, and a payoff-ordered exploitation plan against the current repo state to drive follow-up optimisation work.
2026-04-26 20:06:23 +02:00
parent 751180fdc1
commit 8872e2c96e
1 changed files with 419 additions and 0 deletions
--- a/docs/strix-benchmarks-findings.md
+++ b/docs/strix-benchmarks-findings.md
@@ -0,0 +1,419 @@
 # Strix Halo Benchmark Deep-Dive — Research + Exploitation Plan
 Source: https://strix-benchmarks.vercel.app/ (0xSero / framework-max project)
 Captured: 2026-04-17
 Revised: 2026-04-17 (post-review fixes — see "Revision notes" at end)
 Target system (verified from `/proc/meminfo`, `/proc/cmdline`, sysfs): **HP ZBook Ultra G1a, Ryzen AI MAX+ 395, Radeon 8060S (gfx1151), 65 074 108 kB RAM (~62 GiB visible + 512 MiB BIOS VRAM = ~64 GB total), Fedora 43 kernel 6.19.8, currently booted with `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`, actual GTT total ~59.0 GiB, 70 W sustained (HP firmware cap)**
 Reference system: Framework Desktop, same CPU/GPU, **128 GB LPDDR5X-8000, 120 W sustained**
 ---
 ## 1. Context
 The 0xSero team spent ~1 month exhaustively benchmarking llama.cpp on Strix Halo hardware identical to ours (same APU, same driver family), documenting every regression and filing 2 upstream PRs + 1 issue. Their hardware has 2× our RAM and 1.7× our power budget, so their absolute numbers don't transfer 1:1, but the **relative findings, flag choices, and build recipes do**. This document aggregates their findings, cross-references them against the current repo state (`scripts/optimize/`, `scripts/serve/`, `configs/models.conf`), and describes how to exploit each insight — ordered by payoff-to-effort ratio.
 **TL;DR outcome**: ~7 concrete levers, of which 3 are zero-risk one-liners and 2 are larger experiments (custom ROCm container with PR #21344, draft-model spec decode) that could unlock +20–40 % real-workload throughput on MoE models we already run. Plus 3 genuine enhancement opportunities surfaced during review (§9): host-memory prompt cache for agentic evals, GLM-4.7-Flash Vulkan workaround, and the `HSA_OVERRIDE_GFX_VERSION` rocBLAS kernel fallback.
 **Cross-reference**: [`docs/inference-optimization-landscape.md`](inference-optimization-landscape.md) is the broader survey (engines, quantisation, attention, MoE, OS). This document is the **strix-benchmarks-specific overlay** — findings, PR status, and the exploitation plan against the current repo state. Where the two overlap (e.g. `-DGGML_HIP_UMA=ON`, `--no-mmap`), the landscape doc is the canonical reference; this doc should be read as an addendum with the specific numbers and PR traces.
 ---
 ## 2. Source summary — what strix-benchmarks.vercel.app actually proves
 ### 2.1 Four production configurations compared (Qwen3.5-122B-A10B-REAP-20-Q6_K, 76 GB)
 | Config | Backend | Patches | pp512 | tg128 | Best for |
 |---|---|---|---|---|---|
 | Vulkan stock | Vulkan RADV | — | 303 | 24.7 | Minimum latency |
 | Vulkan + spec | Vulkan | PR #20075 + 4 fixes | 302 | **35.3** | Fast decode, easy deploy |
 | ROCm + MMQ | ROCm 7.2.1 | PR #21344 | **406** | 18.2 | Long prefill, RAG |
 | ROCm full stack | ROCm 7.2.1 | #21344 + #20075 + 4 fixes | 404 | **40.1** | Overall chat throughput |
 ### 2.2 Real-workload timing (server mode, 256-token generation, no-think, temp=0.3)
 | Workload | Vulkan stock | Vk + spec | ROCm + MMQ | ROCm full | Winner |
 |---|---|---|---|---|---|
 | Chat (30 in / 1K out) | 41.5 s | 31.9 s (+30 %) | 56.9 s (−27 %) | **28.3 s (+47 %)** | Full stack |
 | Code gen (2K in / 2K out) | 118.3 s | 99.1 s (+19 %) | 138.0 s (−14 %) | **80.9 s (+46 %)** | Full stack |
 | Summarize (8K in / 256 out) | 155.9 s | 153.5 s (+2 %) | 114.5 s (+36 %) | **107.2 s (+46 %)** | Full stack |
 ### 2.3 Five key technical findings
 1. **Vulkan wins memory mapping**. `vkAllocateMemory(HOST_VISIBLE_BIT)` maps the full GTT correctly. ROCm's `hipMalloc` is trapped in BIOS VRAM **unless `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` is set** — without it, the 122B model takes >1 hour due to 94 graph splits.
 2. **ROCm wins batched compute**. PR #21344 (MMQ VGPR tuning for gfx1151) gives **+19–35 % prefill** on large MoE models. Still open; gfx1151-specific tunings: `mmq_x_max=48`, `mmq_y=64`, `nwarps=4`.
 3. **Spec decode stacks orthogonally**. ROCm single-token decode is slower than Vulkan, but its batched verify is so cheap that ROCm + draft-model spec decode becomes **the fastest overall config** (40.1 tg vs Vulkan stock's 24.7 = +62 %).
 4. **Decode is context-flat; prefill is context-expensive**. 512 tokens or 131K tokens → tg stays at 22–24 t/s. Only prefill degrades.
 5. **MUL_MAT_ID is the hidden villain on Vulkan**. 42–66 % of prefill time on large MoE models. Filed as issue #21948 — **closed as "not planned"**. No upstream fix coming. Metal's PR #13388 (map → batched matmul → unmap, 1.8–4.1× speedup) is the template but no Vulkan port exists.
 ### 2.4 Kernel parameters (critical distinction vs our current setup)
 Strix-benchmarks prescribes **`amd_iommu=off`** (not `iommu=pt`) for 128 GB systems. This is required for any Vulkan model >64 GB to load — without it the kernel hangs for >20 min (llama.cpp issue #14854). They also use `ttm.pages_limit=335544321` and `amdgpu.gttsize=122880` (120 GiB GTT).
 Our repo currently applies `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496` (60 GiB GTT, 64 GB system). **The `amd_iommu=off` threshold (>64 GB model) is never crossed on a 64 GB box** — our biggest cataloged model is REAP-40 Q4_K_M at 44 GB, well under the cliff. So `iommu=pt` remains correct for us. No change needed here.
 ### 2.5 KV cache quantisation (tested on Vulkan at 32K context)
 | Combo | pp t/s | tg t/s | KV mem | Reduction |
 |---|---|---|---|---|
 | f16/f16 | 264.4 | 24.53 | 8.19 GB | baseline |
 | q8_0/q8_0 | 244.7 | 24.32 | 4.09 GB | −50 % |
 | **q4_0/q4_0** | 246.2 | 24.35 | **2.05 GB** | **−75 %** |
 **All asymmetric K/V combos timed out or degraded >3×** at 32K (q8_0/f16, f16/q8_0, q4_0/f16, f16/q4_0, q4_0/q8_0, q8_0/q4_0). Our `launch.sh` already defaults to symmetric `q4_0/q4_0` — aligned ✓.
 ### 2.6 Failure catalogue (what NOT to do)
 | Attempt | Failure | Takeaway |
 |---|---|---|
 | Q8_0 full 122B benchmarks | 105 GB > 120 GiB GTT → OOM kills | Stay ≤ Q6_K for 128 GB, ≤ Q4 for 64 GB |
 | ROCm Q6_K without UMA env var | 94 graph splits, >1 hr runs | **`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` is mandatory** |
 | ROCm + lhl's FATTN patches | Won't apply to current master | Needs rebase per llama.cpp version |
 | Stacked `-ffast-math` + hipBLASLt + #21344 | 354 → 230 t/s regression | Test optimisations in isolation |
 | Q6_K Vulkan without kernel fix (128 GB sys) | >20 min hang to load | `amd_iommu=off` required (>64 GB only) |
 ### 2.7 Upstream status of referenced PRs (verified 2026-04-17)
 | PR/Issue | Status | Relevance to us |
 |---|---|---|
 | [#21344 MMQ VGPR gfx1151](https://github.com/ggml-org/llama.cpp/pull/21344) | **Open** — has regressions on dense models at large batch | Apply via custom build **for MoE only** (21B–44B cataloged) |
 | [#20075 Hybrid SSM/MoE spec decode](https://github.com/ggml-org/llama.cpp/pull/20075) | **Open** — plus 4 extra fixes needed for non-Metal | Required if we want draft-model spec decode on Qwen3.5 MoE |
 | [#21948 Vulkan MUL_MAT_ID](https://github.com/ggml-org/llama.cpp/issues/21948) | **Closed, not planned** | Use ROCm for prefill-heavy MoE workloads |
 | [#13388 Metal MoE optimisation](https://github.com/ggml-org/llama.cpp/pull/13388) | **Merged (May 2025)** | Metal-only — reference for future Vulkan port |
 | [#15524 Vulkan subgroup opt](https://github.com/ggml-org/llama.cpp/pull/15524) | **Merged** | Already in our upstream-tracking toolbox images |
 | [#14854 >64 GB load hang](https://github.com/ggml-org/llama.cpp/issues/14854) | Open, unconfirmed | Workaround (`amd_iommu=off`) irrelevant for 64 GB box |
 ---
 ## 3. Delta analysis — current repo vs strix-benchmarks findings
 Legend: ✅ already applied · 🟡 partial · ❌ missing · ⚪️ n/a for our hardware
 | Item | Status | Location / notes |
 |---|---|---|
 | Symmetric q4_0/q4_0 KV cache default | ✅ | `scripts/serve/launch.sh:111` |
 | `-fa on`, `-ngl 99`, `--no-mmap` | ✅ | `scripts/serve/launch.sh:106–108` |
 | `ROCBLAS_USE_HIPBLASLT=1` for ROCm | ✅ | `scripts/serve/launch.sh:101` |
 | RADV `nogttspill` | ✅ | `scripts/optimize/power-profile.sh` (env.d) |
 | GTT sized to total RAM – 4 GiB | ✅ | `lib/detect.sh:recommended_gttsize_mib` |
 | `amd_iommu=off` for >64 GB models | ⚪️ | Not applicable — 64 GB system, largest model 44 GB |
 | **`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` when using ROCm** | ❌ | **Missing from `ENV_ARGS` in `launch.sh:99–102`** |
 | Draft-model speculative decoding (`-md` flag) | ❌ | Only n-gram simple is wired (`launch.sh:123–129`). Draft model `qwen3.5-0.8b-q8-draft` is in `models.conf:41` but unused |
 | PR #21344 MMQ VGPR gfx1151 tuning | ❌ | Toolbox containers track llama.cpp master; PR still open. Needs custom build |
 | `--reasoning-budget 0` wired for no-think | ✅ | `launch.sh:119` |
 | Docs: pp/tg vs context length decision matrix | ❌ | No "when to pick Vulkan vs ROCm" guidance in `docs/` |
 | Benchmark matrix covers `--ngram` / draft spec | 🟡 | `launch.sh` supports `--ngram`, but `scripts/benchmark/` doesn't include spec in sweeps |
 | Mixed K/V cache warning | ❌ | Not documented in `docs/` — users could try asymmetric combos and hit the 3× regression |
 | Batch stabilisation (`-b 256`) on small-batch crashes | ❌ | No fallback; community reports GPU hangs on small batches |
 Four real gaps worth closing, in priority order: (1) UMA env var for ROCm, (2) draft-model spec decode, (3) documentation of Vulkan-vs-ROCm decision matrix, (4) benchmark coverage of spec decode.
 ---
 ## 4. Hardware-adjusted expectations
 The 0xSero numbers are on **128 GB / 120 W** Framework Desktop. Our HP ZBook is **64 GB / 70 W sustained**. Two scalings matter:
 - **Memory ceiling**: REAP-20 Q6_K (76 GB) and Q8_0 stock (105 GB) are out of reach. Our realistic "max" is **REAP-40 Q4_K_M at 46 GB** (listed in `configs/models.conf:38` as 46 GB) on a GTT that currently exposes ~59 GiB — leaves ~13 GiB for KV cache/activations. Meaning: the +46 % "Full stack" chat number (28.3 s vs 41.5 s) is achievable qualitatively on our smaller models but absolute t/s will differ, and we should not chase larger quants without shrinking context first.
 - **Power ceiling**: 70 W limits sustained t/s by ~25–35 % vs 120 W based on our own Phase 2 power-profile benchmarks (41 → 57 t/s with ryzenadj 70 W cap). We already push as hard as the HP firmware allows.
 **What this means for the plan**: favour changes with high % wins (spec decode +30–40 %, UMA env var fixes pathological slowness) over absolute numbers that assume 120 W headroom.
 ---
 ## 5. Prioritised exploitation plan
 ### P0 — Zero-risk one-liners
 #### 5.1 Enable ROCm unified memory — carefully
 **File**: `scripts/serve/launch.sh:99–102`, and any ROCm path in `scripts/benchmark/run-suite.sh`.
 **The nuance (do not skip this)**: llama.cpp has **two** UMA mechanisms for HIP builds:
 - `-DGGML_HIP_UMA=ON` at build time (the canonical ROCm unified-memory flag; already baked into kyuz0's `llama-rocm-*` toolbox containers that this repo consumes per `docs/references.md`).
 - `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` at runtime (historically a CUDA flag; works on HIP builds because llama.cpp's CUDA backend is re-used via HIP translation — this is what 0xSero prescribes).
 Additionally, llama.cpp issue [#18159](https://github.com/ggml-org/llama.cpp/issues/18159) documents a **UMA-detection bug on AMD APUs with large TTM allocations** (our exact case): the detector reads `/proc/meminfo` when `integrated=true` and can under-report usable memory. Forcing `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can **aggravate** this on systems where it's already set via build flag.
 **Prescribed action** (with test gates, not a blind flag flip):
 ```bash
 if [[ "$BACKEND" == *rocm* ]]; then
    ENV_ARGS=(env ROCBLAS_USE_HIPBLASLT=1)
    if [[ "${ROCM_UMA_ENV:-auto}" == "1" ]]; then
        ENV_ARGS+=(GGML_CUDA_ENABLE_UNIFIED_MEMORY=1)
    fi
 fi
 ```
 Add opt-in flag `--rocm-uma-env` that sets `ROCM_UMA_ENV=1`. **Test plan**:
 1. Baseline: launch any ROCm backend with REAP-40 Q4_K_M (46 GB, largest model in catalog) and `-ngl 99`. Confirm there are no "graph split" warnings in server output and model loads in < 60 s. If yes → `GGML_HIP_UMA=ON` in the container is already doing the job, **do not set** the env var.
 2. If graph splits appear or load is pathologically slow (minutes for 46 GB), flip `--rocm-uma-env` and re-test.
 3. If still broken, inspect: `podman inspect <toolbox>` for the llama.cpp build flags, or build a custom container with `-DGGML_HIP_UMA=ON` explicitly.
 **Why this ordering**: 0xSero's warning ("without UMA env var, ROCm does 94 graph splits, >1 hr load") is from a custom container they built **without** `-DGGML_HIP_UMA`, where the runtime env var was the only path. The kyuz0 toolboxes we already use ship with HIP UMA baked in, so the env var may be redundant or harmful. Test before shipping.
 #### 5.2 Document the Vulkan-vs-ROCm decision matrix
 **File**: new section in `docs/inference-optimization-landscape.md` (existing doc, right home) or `docs/benchmarking.md`.
 **Content**: publish the context-scaling winners table from 0xSero (Appendix B below), annotated with "Vulkan < 2K ctx, ROCm ≥ 8K ctx for prefill" and "ROCm + spec decode = best tg in every bucket".
 **Why**: currently the repo has two ROCm toolboxes and two Vulkan ones but zero prose on when to pick which. This unblocks anyone (including future-you) from having to rediscover the same trade-off.
 #### 5.3 Add a warning against asymmetric K/V cache quantisation
 **File**: `scripts/serve/launch.sh` header comment + `docs/inference-optimization-landscape.md`.
 **Content**: explicit "do not mix cache-type-k ≠ cache-type-v on Vulkan — 3× regression or timeout at 32K context." Keep the current symmetric q4_0 default.
 ### P1 — Medium effort, high payoff
 #### 5.4 Wire draft-model speculative decoding in `launch.sh`
 **File**: `scripts/serve/launch.sh` — add `--draft` flag (long form `-md`/`--model-draft` in llama.cpp) alongside existing `--ngram`.
 **Implementation sketch** (aligned with current flag parsing style at lines 19–52):
 ```bash
 --draft)  DRAFT_MODEL="$2"; shift 2 ;;
 # ...
 if [[ -n "$DRAFT_MODEL" ]]; then
    DRAFT_PATH="$(find -L "$MODEL_DIR" -type f -name "$DRAFT_MODEL" -print -quit)"
    SERVER_ARGS+=(
        -md "$(realpath_for_toolbox "$DRAFT_PATH")"
        -ngld 99
        --draft-max 8 --draft-min 0
        --draft-p-min 0.5
    )
 fi
 ```
 **⚠️ Tokenizer compatibility — the make-or-break constraint for draft-model spec decode.** llama.cpp requires the target and draft models to share the **same tokenizer/vocabulary** (otherwise token IDs are incoherent across the two models and it refuses to load, or produces garbage). The families in our current catalog split into **three disjoint tokenizer buckets**:
 | Target family | Tokenizer / vocab | Compatible draft from our catalog? |
 |---|---|---|
 | Qwen3.5 dense + Qwen3.5 MoE (incl. REAP-40) | Qwen3.5-tokenizer, vocab **248320** | ✅ `qwen3.5-0.8b-q8-draft` (confirmed by lmstudio-bug-tracker #1597 — same vocab size and token IDs across dense & MoE; llama.cpp itself does not block this, only LM Studio filters it out) |
 | Qwen3.6 MoE (hybrid DeltaNet) | Qwen3.6 tokenizer, vocab likely ≠ 248320 | ❌ need a Qwen3.6 draft — not in catalog |
 | Qwen3-Coder family (Qwen3-Coder-30B-A3B, Qwen3-Coder-Next) | Qwen2Tokenizer, vocab **151 936** | ❌ **cannot use `qwen3.5-0.8b-q8-draft`** — different tokenizer, different vocab size (Qwen3-Coder retains the older 151936 Qwen2 vocab; Qwen3.5 uses 248320) |
 | GLM-4.7-Flash | ChatGLM tokenizer | ❌ no draft in catalog |
 | Gemma-4 family | Gemma tokenizer | ❌ no draft in catalog |
 | Nemotron-Cascade-2 | Mixed SSM/attention; separate tokenizer | ❌ no draft in catalog |
 **What to actually enable first (corrected from v1 of this doc)**:
 - ✅ `--draft qwen3.5-0.8b-q8-draft.gguf` with target `qwen3.5-35b-a3b-*` or `qwen3.5-122b-a10b-reap40-q4` — **this is the only safe pair in our catalog today.**
 - ❌ Do **not** pair `qwen3.5-0.8b-q8-draft` with Qwen3-Coder models; load will fail. (Options for Qwen3-Coder: either add a small Qwen3-Coder draft to the catalog if Unsloth publishes one, or use `--ngram` instead — repetitive coding outputs are the exact case n-gram excels at.)
 **Additional caveat for SSM/MoE hybrids**: Qwen3.5 MoE + draft decode is a hybrid SSM/attention architecture, and 0xSero documented that stock llama.cpp master has four bugs that corrupt spec-decode state on these models (hybrid `seq_rm` attention cleanup, soft rollback position erase, compat check needs checkpoint, recurrent reserve undersize). Upstream PR #20075 addresses some; the four extra fixes are not yet upstream. **Concrete implication**: even with matching tokenizer, expect wrong outputs on Qwen3.5-35B-A3B until those patches land (see P2 container build). Validate on a pure-attention target first — but our catalog doesn't have one in the draft-eligible size range. **Net recommendation for tomorrow**: wire the `--draft` flag in launch.sh, but only flip it on after P2.6 + P2.7 (patched ROCm container) ships, OR use it on the Vulkan stock build with the expectation that hybrid rollback corruption may bite on long outputs. Smoke-test with a short chat workload first.
 **Expected impact (if target–draft pair is compatible AND patches are in place)**: +20–40 % real-workload tg on Qwen3.5 MoE per 0xSero (no_think mode hits 90 % acceptance, +40 %).
 **⚠️ llama-server spec-decode reuse bug (issue [#19231](https://github.com/ggml-org/llama.cpp/issues/19231))**: spec decoding was confirmed to work only on the **first** request to `/v1/chat/completions`; subsequent requests on the same slot lose all draft acceptance and run **slower** than no-spec (46.55 vs 88.90 t/s reported). Filed Jan 31 2026. Related fix PR #19261 exists but not confirmed merged/effective. **Validation protocol** before adding `--draft` to the user-facing launch script: run 10 consecutive `/v1/chat/completions` requests against the same slot and confirm acceptance rate remains stable in the server log. If it decays, hold the feature on `/v1/completions` only (raw completion endpoint) or pin a known-good llama.cpp commit. The agentic eval harness at `scripts/agentic/run-eval.sh` uses the chat endpoint, so this bug would actively harm eval runs if left unvalidated.
 **Slot initialization quirk (issue #17989)**: `--parallel 1` in current llama-server initializes **4 slots** instead of 1 despite what the docs say. On a 64 GB system this silently eats KV-cache headroom. Mitigation: pass `-np 1` explicitly (accepted as truthy) and inspect server startup log for `n_slots = 1`. If log shows 4, either update to a fixed llama.cpp commit or adjust `--ctx` expectations accordingly (KV for 4 slots at 131 K ctx with q4_0 ≈ 8 GB, not 2 GB).
 #### 5.5 Extend benchmark matrix to cover spec decode
 **File**: `scripts/benchmark/run-suite.sh` — add `--spec` / `--draft MODEL` flag that, when set, runs a second pass with spec decode enabled and produces a delta row in `summary.json`.
 **Why**: the existing suite gives us no visibility into whether draft decode is actually helping on *our* hardware with *our* quants. Can't optimise what we don't measure.
 ### P2 — Larger experiments (backlog, require custom build)
 #### 5.6 Build a custom ROCm container with PR #21344 applied
 **What**: fork one of the `llama-rocm-7.2.1` toolbox Containerfiles (kyuz0/amd-strix-halo-toolboxes is the current upstream the repo references in `docs/references.md`), apply PR #21344, and produce a sibling image `llama-rocm-7.2.1-mmq` or similar. Then expose it as a selectable backend in `launch.sh:77–84`.
 **Reframing** (important context for the decision): PR #21344 is arguably **regression recovery** rather than pure optimisation. llama.cpp issue [#17917](https://github.com/ggml-org/llama.cpp/issues/17917) documents that commit 668ed76 (enabled WMMA-MMQ INT kernels for RDNA 3) **regressed pp2048 on gfx1151 from ~900 t/s to ~543 t/s** (−40 %). Issue was closed as "not planned". PR #21344 tunes the exact same MMQ code path (tile sizes, warp counts, VGPR pressure). So when 0xSero reports +19–35 % prefill on a patched ROCm 7.2.1 container, part of that gain is likely **recovering the 40 % that 668ed76 threw away**. This is also why the stock kyuz0 `llama-rocm-7.2.1` container (which tracks master including 668ed76) looks slow on MoE prefill. Our repo's existing [`docs/inference-optimization-landscape.md:48–49`](inference-optimization-landscape.md) already flags #17917 — this fixes it.
 **Why**: PR #21344 is the only path to the +35 % prefill / +19 % ROCm path. Upstream is unlikely to merge soon (dense-model regressions unresolved). Our cataloged models are **all MoE** in the target size range (GLM-4.7-Flash, Qwen3.5-35B-A3B, Qwen3.6-35B-A3B, Qwen3-Coder-30B-A3B, REAP-40) — exactly the class where PR #21344 helps. Dense-model regression is not our problem (we have no dense model >27 B in rotation for ROCm).
 **Cost**: ~1–2 hours to set up the fork + CI to rebuild weekly.
 **Build flag nuance**: the PR author's isolated-validation build uses `-DGGML_HIP_ROCWMMA_FATTN=OFF` to isolate MMQ gains; production builds on gfx1151 for **long-context** decode (≥ 16 K) benefit from `-DGGML_HIP_ROCWMMA_FATTN=ON` per llm-tracker's findings and per the existing `inference-optimization-landscape.md:35–38` (which prescribes both `UMA=ON` and `ROCWMMA_FATTN=ON`). For our MoE-prefill-heavy case, **prefer `=ON`** (we hit 8 K–32 K contexts routinely via `--ctx 131072`), and confirm via A/B. Keep `-DGGML_HIP_UMA=ON` so the container doesn't depend on the runtime env var (see 5.1 nuance). Full build flag set consistent with the existing landscape doc: `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`.
 **Guardrail**: keep stock `llama-rocm-7.2.1` alongside so we can A/B.
 #### 5.7 Add PR #20075 + 4 spec-decode fixes to the patched ROCm container
 **What**: same container as 5.6 picks up PR #20075 plus the four additional fixes documented by 0xSero (hybrid `seq_rm` attention cleanup, soft rollback position erase, compat check checkpoint creation, recurrent reserve = `max(16, n_seq_max * 16)`).
 **Why**: unlocks draft spec decode for Qwen3.5 MoE hybrids — the REAP-40 and Qwen3.6-35B-A3B models we actually run. Without this, draft decode on these models produces stale-state corruption.
 **Risk**: patches are not upstream; if/when they merge, the fork becomes redundant. Track the PR status.
 ### P3 — Nice-to-have, low urgency
 - Add a `--batch-size 256` fallback in `launch.sh` (`-b 256`) documented as "enable only if you hit GPU hangs on small-batch MoE." Not default — the stabilisation comes with throughput cost. (llm-tracker documents small-batch GPU hangs on HIP and Vulkan; not currently observed on our system but cheap insurance.)
 - Track PR #13388's porting to Vulkan (unlikely soon but worth watching; would fix MUL_MAT_ID bottleneck and flip Vulkan prefill competitiveness).
 - Monitor issue #14854 resolution for the general-purpose fix to >64 GB Vulkan loads (irrelevant to us at 64 GB today but becomes relevant if we ever move to 128 GB hardware).
 - Track [TurboQuant KV cache](https://github.com/ggml-org/llama.cpp/discussions/20969) upstream integration — TQ3 (3-bit) / TQ4 (4-bit) KV cache types claim 4.6–4.9× compression vs f16 with near-zero quality loss, and a HIP/ROCm port for gfx1100 already exists in discussion #21526. Direct relevance on 64 GB: would push the practical context ceiling from ~131 K to ~256 K on MoE without hitting GTT pressure. Already watchlisted in `inference-optimization-landscape.md:246–254`.
 ---
 ## 9. Enhancement opportunities (surfaced during review, not in the source site)
 These are genuinely net-new on top of the strix-benchmarks findings — they are not documented at 0xSero's site but are high-ROI additions for our specific use cases (agentic evals, mixed-model catalog).
 ### 9.1 Host-memory prompt cache (`--cram`) for agentic eval loops
 **What**: llama-server gained a host-memory prompt cache in 2026 (llama.cpp discussion [#20574](https://github.com/ggml-org/llama.cpp/discussions/20574)). Flag: `--cram N` (N = MiB, typical 256–1024). Stores pre-computed prompt representations in system RAM; on subsequent requests with a shared prefix, the server skips prefill entirely for the matched portion.
 **Why it matters for us**: `scripts/agentic/run-eval.sh` runs hundreds of requests against the same system prompt + tool schema. Reported TTFT reduction: **93 %** on 8 K system prompts (4.3 s → 0.3 s). On a 70 W power-capped machine, eliminating redundant prefill is pure profit.
 **Action**: add `--cram 512` to `launch.sh` when `--parallel > 1` OR when a new `--agentic` convenience flag is set. Cost: ~2 GB system RAM. Risk: low (fails open if cache misses).
 **Caveat**: interaction with spec decode + quantised KV cache is not documented upstream. Validate that cache hits don't corrupt draft acceptance stats before combining with P1.4.
 ### 9.2 GLM-4.7-Flash — do not run on Vulkan (output-corruption regression)
 **What**: llama.cpp issue [#18835](https://github.com/ggml-org/llama.cpp/issues/18835) documents a Vulkan regression affecting **all GLM-4.7 releases after b7667**: longer prompts produce repeating Chinese characters / symbols. CUDA/ROCm work correctly. No upstream fix landed.
 **Direct impact on us**: `configs/models.conf:15` lists `glm-4.7-flash-q6` as a MoE daily-driver-class model in the catalog. Our default backend `launch.sh:11` is `llama-vulkan-radv`. **A user launching `bin/serve -m GLM-4.7-Flash-UD-Q6_K_XL.gguf` today on defaults gets corrupted long-prompt output.**
 **Action**: either (a) add a model → backend preference table in `launch.sh` that routes GLM-4.7-* to a ROCm backend automatically, or (b) hard-warn at startup when launching GLM-4.7-* on a Vulkan backend. Option (a) is preferred because it's a silent correctness fix. Cost: ~10 lines of bash in `launch.sh`.
 ### 9.3 `HSA_OVERRIDE_GFX_VERSION=11.0.0` — gfx1100 kernel fallback for rocBLAS paths
 **What**: llm-tracker documents that hipBLASLt lacks many optimized kernels for gfx1151 and that `HSA_OVERRIDE_GFX_VERSION=11.0.0` forces rocBLAS to use gfx1100 kernels, which can be **2–6× faster** on small matmul paths for certain operations (1024×1024 at ~6 TFLOPS → ~20 TFLOPS). This does NOT override llama.cpp's AMDGPU target — the kernels are still compiled for gfx1151 — but the rocBLAS runtime library picks gfx1100 paths when available.
 **Why it matters for us**: complementary to `ROCBLAS_USE_HIPBLASLT=1` (already set). Zero build cost, just an env var, reversible. Could help ROCm backends on prefill paths where MMQ isn't the bottleneck.
 **Action**: treat the same way as 5.1 — opt-in env var (`ROCM_GFX_OVERRIDE=1`), validate on our benchmark suite before making default. Risk: some operations may produce incorrect results if gfx1100 kernels have instruction-set expectations that don't hold on gfx1151; the claim of "correctness" comes from community reports, not AMD validation.
 ### 9.4 Wire spec-decode (`--draft`, `--ngram`) through `bin/agentic`
 **What**: `scripts/agentic/run-eval.sh` currently hits whatever llama-server is running without controlling its config. If P1.4 ships a `--draft` flag on launch.sh, the agentic eval binary should be able to pass it through so eval runs can A/B spec decode on/off.
 **Why it matters**: our agentic eval is the most repetitive-output workload we run (structured tool calls, JSON outputs) — exactly the case where draft acceptance is highest. Could halve eval wall-clock time once P1.4 is solid.
 **Action**: add matching `--draft`, `--ngram`, `--cram` pass-through flags to `scripts/agentic/run-eval.sh`; document in `docs/agentic-benchmarks.md`. Small change, wait until P1.4 has been validated against bug #19231.
 ### 9.5 Model → backend routing table in `launch.sh`
 **What**: summarise the decision matrix (§5.2 + §9.2) into an executable default. Route known-problematic target → known-good backend automatically; let the user override with `--backend` as today.
 **Why it matters**: right now users must read docs to know that GLM-4.7 wants ROCm, that > 16 K context prefers ROCm+MMQ, that Qwen3.5 MoE + spec decode needs the patched container. A 20-line table in `launch.sh` turns that knowledge into correct-by-default behaviour.
 **Action**: define as `declare -A DEFAULT_BACKEND` mapping glob patterns to backends; resolve after model detection at `launch.sh:64–68`; print the routing decision in the startup log so it's visible and overridable.
 ---
 ## 6. Critical files (for implementation when the plan is executed)
 - `scripts/serve/launch.sh` — env vars, flag wiring for P0.1 and P1.4
 - `scripts/benchmark/run-suite.sh` — spec-decode matrix for P1.5
 - `configs/models.conf` — draft-target pairing hint (line 41 already exists)
 - `docs/inference-optimization-landscape.md` — decision matrix, K/V warning (P0.2, P0.3)
 - `docs/references.md` — add strix-benchmarks.vercel.app and framework-max repo
 - Future: new `containers/llama-rocm-7.2.1-mmq/` tree for P2.6
 ---
 ## 7. Verification (after each change is applied)
 1. `make audit` — ensure no regression in system status.
 2. `make verify` — 9-point optimisation checklist still passes.
 3. For P0.1 (UMA env var): launch a ROCm backend with a >35 GB MoE model and confirm **no graph-split warnings** in server output, load completes in seconds (not hours).
 4. For P1.4 (draft decode): run `bin/serve -m qwen3-coder-30b-a3b-...gguf --draft qwen3.5-0.8b-q8-draft.gguf`, measure tg/s against the no-draft baseline on an agentic coding prompt.
 5. For P1.5 (bench matrix): compare `summary.json` delta between `--spec`/no-spec runs; expect +15–40 % tg depending on workload repetitiveness.
 6. Do not stack PR #21344 + `-ffast-math` + hipBLASLt simultaneously (strix-benchmarks confirmed 354 → 230 t/s regression). Test each patch in isolation first.
 ---
 ## 8. References
 - **Source site**: https://strix-benchmarks.vercel.app/
 - **Framework-max repo (404 at time of check)**: https://github.com/0xSero/framework-max
 - **HF models**: https://huggingface.co/0xSero (REAP-20, REAP-40 variants)
 - **Upstream PRs**: llama.cpp #21344, #20075, #13388, #15524
 - **Upstream issues**: llama.cpp #21948 (closed not-planned), #14854 (unconfirmed)
 - **Chipsandcheese RDNA3 Infinity Cache**: https://chipsandcheese.com/p/rdna-3s-infinity-cache-friend-or
 - **Community cross-refs**: kyuz0/amd-strix-halo-toolboxes (toolbox images we already consume), llm-tracker.info Strix-Halo page (hipBLASLt recipe)
 ---
 ## Appendix A — Full upstream-contribution log (for credit/attribution)
 | Date | Type | Link | Impact |
 |---|---|---|---|
 | Apr 15 | Issue | #21948 | Documented MUL_MAT_ID profiling bottleneck |
 | Apr 15 | PR | #21344 | Validated MMQ VGPR tuning: +19–35 % prefill on gfx1151 MoE |
 | Apr 15 | PR | #20075 | Four hybrid SSM/MoE spec decode fixes |
 | Apr 14 | Kernel workaround | N/A | `amd_iommu=off` resolution (for 128 GB systems) |
 ## Appendix B — Full context-scaling table (0xSero, Vulkan vs ROCm+MMQ)
 | Ctx | Vk pp | Vk tg | Vk+spec tg | ROCm pp | ROCm tg | ROCm+spec tg | Best decode |
 |---|---|---|---|---|---|---|---|
 | 64 | 98 | 24.3 | 26.6 | 130 | 17.8 | 28.5 | ROCm+spec |
 | 512 | 303 | 24.4 | 28.1 | 376 | 17.7 | 23.3 | Vk+spec |
 | 2K | 353 | 24.2 | 25.2 | 423 | 17.7 | 26.7 | ROCm+spec |
 | 4K | 362 | 24.2 | 25.5 | 413 | 17.7 | 25.3 | ROCm+spec |
 | 8K | 371 | 23.9 | 25.1 | 407 | 17.4 | 27.4 | ROCm+spec |
 | 16K | 353 | 23.4 | 25.0 | 371 | 17.0 | 23.6 | Vk+spec |
 | 32K | 314 | 22.6 | 18.6 | 315 | 16.2 | 22.9 | ROCm+spec |
 Rule-of-thumb: **ROCm's MMQ prefill advantage fades after 16K**; beyond that flash attention dominates and Vulkan catches up.
 ## Appendix C — MUL_MAT_ID profiling (why MoE prefill is slow on Vulkan)
 | Context | MUL_MAT_ID % | Total prefill ms |
 |---|---|---|
 | 512 | 66.2 % | 1700 |
 | 8K | 57.6 % | 1839 |
 | 32K | 41.9 % | 2556 |
 | 128K | 19.7 % | 5351 |
 Constant ~1050 ms overhead regardless of context → compute dispatch inefficiency in expert routing, **not** bandwidth. Closed as not-planned upstream. Workaround: use ROCm for MoE prefill-heavy workloads (8K–16K ctx sweet spot).
 ---
 ## Revision notes (2026-04-17 review pass)
 A self-review sweep after v1 caught and fixed the following issues. Recording them here so future-you can see what was wrong and why, in case residual errors surface.
 1. **Wrong draft-model pairing in P1.4 (critical)**. v1 recommended enabling `--draft qwen3.5-0.8b-q8-draft.gguf` first with `qwen3-coder-30b-a3b`. **Wrong**: Qwen3-Coder uses the Qwen2 tokenizer (vocab **151 936**) while Qwen3.5 uses vocab **248 320** (lmstudio-bug-tracker #1597 confirms the Qwen3.5 dense–MoE pair is vocab-identical; Qwen3-Coder is not). llama.cpp will refuse that pair. Corrected: the only tokenizer-compatible draft pair in the current catalog is `qwen3.5-0.8b-q8-draft` with `qwen3.5-35b-a3b-*` or `qwen3.5-122b-a10b-reap40-q4`. Qwen3-Coder should use `--ngram` until a Qwen3-Coder-specific draft ships.
 2. **UMA env var recommendation was too blunt**. v1 prescribed always setting `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` for ROCm backends. llama.cpp issue [#18159](https://github.com/ggml-org/llama.cpp/issues/18159) documents that on AMD APUs with large TTM allocations (exactly our case) the UMA detector can mis-read `/proc/meminfo` and **forcing the env var can aggravate it**. The kyuz0 containers we already consume are likely built with `-DGGML_HIP_UMA=ON`, making the runtime env var redundant or counterproductive. Corrected: opt-in flag with test gates, not unconditional.
 3. **REAP-40 size off by 2 GB**. v1 said 44 GB; `configs/models.conf:38` lists 46 GB. Corrected throughout.
 4. **GTT actual ≠ requested**. v1 said "60 GB GTT"; actual sysfs reports ~59.0 GiB (63 350 767 616 bytes). Leaves ~13 GiB headroom above the 46 GB REAP-40 — enough for KV cache at q4_0 but tight. Corrected in §4.
 5. **PR #21344 build flag nuance**. v1 quoted `-DGGML_HIP_ROCWMMA_FATTN=OFF` from the PR without context. That's the author's isolation-test build; production gfx1151 users (llm-tracker) prefer `ON` for long-context decode. Flagged in §5.6.
 6. **Hardware baseline was from memory, not sysfs**. v1 recited 64 GB / gfx1151 / Fedora 43 from user memory. Re-verified against `/proc/meminfo` (65 074 108 kB), `/proc/cmdline` (kernel 6.19.8, `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`), and sysfs (GTT total 63.35 GB, VRAM 512 MiB). Updated preamble.
 **Still-unverified claims in this doc** (worth validating empirically before committing effort):
 - "+20–40 % spec decode gain transfers to our 35B MoE on our 70 W cap" — the 0xSero numbers are on a 122B MoE on 120 W. Scaling is not guaranteed. Measure once 5.4 + 5.5 ship.
 - "PR #21344 helps all MoE in our catalog" — PR validated on Qwen3.5-122B and Mistral-Small/Nemotron-120B; extrapolation to Qwen3.6-35B-A3B / Qwen3-Coder-30B-A3B / Gemma-4-26B-A4B is an assumption, not a measurement.
 - "kyuz0 containers ship `-DGGML_HIP_UMA=ON`" — Containerfile fetch returned 404; could not directly inspect. Assumption based on DeepWiki description and the existing `inference-optimization-landscape.md:35–37` which prescribes that flag as canonical. Verify by cloning the repo locally and reading `Containerfile.rocm-*` before building our fork.
 - "llama.cpp issue #19231 is still live on master" — confirmed as of Jan 31 2026 with related PR #19261; did not confirm whether #19261 has merged. Validate with a live test against our current kyuz0 container before trusting spec decode in production.
 ---
 ## Revision notes — second sweep (2026-04-17)
 A deeper review after v2 surfaced the following issues and enhancement opportunities. Recording here so the history is auditable.
 ### Additional corrections
 1. **PR #21344 reframed as regression recovery**. First sweep characterised it as a pure +19–35 % optimisation. Second sweep found [issue #17917](https://github.com/ggml-org/llama.cpp/issues/17917) documenting that commit 668ed76 (WMMA-MMQ INT kernels for RDNA 3) regressed gfx1151 prefill by **−40 %** (pp2048 900 → 543 t/s). PR #21344 tunes the same code path. Much of 0xSero's reported "gain" is likely recovering what 668ed76 threw away. §5.6 now says so explicitly; this changes how you pitch the work ("we're catching up to our own April 2025 performance") rather than "we're unlocking something new".
 2. **llama-server spec decode `/v1/chat/completions` reuse bug** added to §5.4. llama.cpp issue [#19231](https://github.com/ggml-org/llama.cpp/issues/19231) shows spec decode only works on the first chat-completion request; subsequent slots drop draft acceptance entirely and run **slower** than no-spec. This is live as of Jan 31 2026 and has an unmerged fix PR #19261. Validate before trusting spec decode in agentic eval loops.
 3. **GLM-4.7-Flash Vulkan output corruption** (§9.2). llama.cpp issue [#18835](https://github.com/ggml-org/llama.cpp/issues/18835) — Vulkan produces repeating Chinese characters on long prompts post-b7667. GLM-4.7-Flash is in our catalog (`configs/models.conf:15`) and defaults to Vulkan. A user running `bin/serve -m GLM-4.7-Flash-UD-Q6_K_XL.gguf` today gets corrupted output. Needs either automatic backend routing or a startup warning.
 4. **`--parallel 1` slot bug** (§5.4). Issue [#17989](https://github.com/ggml-org/llama.cpp/issues/17989) — passing `--parallel 1` initialises 4 slots on current master. Silent 4× KV cache overhead on a constrained 64 GB system. Mitigation: pass `-np 1` and verify `n_slots=1` in startup log.
 ### New enhancement opportunities (§9)
 5. **Host-memory prompt cache `--cram`** (§9.1) — 93 % TTFT reduction for agentic eval loops per llama.cpp discussion #20574. Not covered by 0xSero.
 6. **Model-to-backend routing table** (§9.5) — turns the decision matrix into a correct-by-default `launch.sh`. Silences the GLM-4.7 corruption bug without user intervention.
 7. **`HSA_OVERRIDE_GFX_VERSION=11.0.0`** (§9.3) — gfx1100 rocBLAS kernel fallback, 2–6× on small-matmul paths per llm-tracker. Complementary to the existing `ROCBLAS_USE_HIPBLASLT=1`.
 8. **TurboQuant KV cache watch-item** in P3 — 4.6–4.9× KV compression vs f16; HIP port for gfx1100 exists per discussion [#21526](https://github.com/ggml-org/llama.cpp/discussions/21526); would push our context ceiling from 131 K to ~256 K without GTT pressure.
 9. **Cross-reference to `inference-optimization-landscape.md`** added at the top of §1 to prevent duplication. That existing doc already covers the canonical Strix Halo build flags, TurboQuant, Unsloth Dynamic 2.0, and issue #17917 — this overlay doc shouldn't re-litigate those.
 ### Confirmed OK (no change needed)
 - `--no-mmap` flag — matches `docs/inference-optimization-landscape.md:43–45` ("critical for ROCm on Strix Halo to avoid catastrophically slow loading"), already in `launch.sh:107`.
 - `iommu=pt` choice for 64 GB system (not `amd_iommu=off`) — aligns with kyuz0 toolboxes' production config and Gygeek/Framework-strix-halo-llm-setup; the strix-benchmarks `amd_iommu=off` threshold is >64 GB model loads, never hit on a 64 GB box.
 - Firmware version `linux-firmware-20260309-1.fc43.noarch` is **not** the known-bad 20251125 flagged in `lib/detect.sh:detect_firmware_bad`.
 - Q4_0/Q4_0 symmetric KV cache default — matches both 0xSero's Vulkan sweep and our existing `launch.sh:111–112`.
 ### Things worth considering but deliberately out of scope here
 - Evaluating KTransformers / vLLM for agentic serving — covered in `inference-optimization-landscape.md:§1.2, §1.7`; requires a separate decision point about when local llama.cpp stops being sufficient.
 - Fine-tuning flows (PyTorch + ROCm + aotriton) — covered in llm-tracker; not in this repo's remit.
 - Dense Qwen3.5-27B via ROCm with PR #21344 — PR has documented **regressions** on 8B dense at batch ≥ 128. Not tested for 27B; assume risk and prefer Vulkan for dense models until measured.
 ## Appendix D — Exact reproduce commands (from 0xSero, for reference)
 ### Hardware/OS baseline
 - AMD Ryzen AI MAX+ 395 (16C/32T Zen 5), Radeon 8060S (gfx1151, RDNA 3.5, 40 CU), 128 GB LPDDR5X-8000, Fedora 43, kernel 6.17.1
 ### Kernel params (128 GB system)
 ```
 sudo grubby --update-kernel=ALL --args='amd_iommu=off ttm.pages_limit=335544321 amdgpu.gttsize=122880'
 # verify:
 cat /sys/class/drm/card*/device/mem_info_gtt_total   # ~128849018880 (120 GiB)
 ```
 ### Vulkan stock baseline bench
 ```
 export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH
 $HOME/.local/opt/llama.cpp/current/llama-bench \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 -n 128
 ```
 ### ROCm with PR #21344 (inside custom podman container)
 ```
 export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
 llama-bench \
  -m /models/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 -n 128
 # Build flags used: -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DGGML_HIP_ROCWMMA_FATTN=OFF
 ```
 ### Speculative decoding (server mode)
 ```
 $HOME/.local/opt/llama.cpp/current/llama-server \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -md $HOME/.local/share/models/gguf/Qwen3.5-0.8B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 -ngld 99 -fa 1 -c 4096 \
  --draft-max 8 --parallel 1
 ```