Files

Felipe Cardoso 8872e2c96e docs: add strix-benchmarks deep-dive and exploitation plan

Aggregate findings from the 0xSero strix-benchmarks project (identical
gfx1151 hardware) into a single reference: production-config comparison,
real-workload timings, PR #21344 (ROCm MMQ tuning) and PR #20075 (Vulkan
spec decode) status, kernel-param caveats, and a payoff-ordered
exploitation plan against the current repo state to drive follow-up
optimisation work.

2026-04-26 20:06:23 +02:00

41 KiB

Raw Permalink Blame History

Strix Halo Benchmark Deep-Dive — Research + Exploitation Plan

Source: https://strix-benchmarks.vercel.app/ (0xSero / framework-max project) Captured: 2026-04-17 Revised: 2026-04-17 (post-review fixes — see "Revision notes" at end) Target system (verified from /proc/meminfo, /proc/cmdline, sysfs): HP ZBook Ultra G1a, Ryzen AI MAX+ 395, Radeon 8060S (gfx1151), 65 074 108 kB RAM (~62 GiB visible + 512 MiB BIOS VRAM = ~64 GB total), Fedora 43 kernel 6.19.8, currently booted with iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496, actual GTT total ~59.0 GiB, 70 W sustained (HP firmware cap) Reference system: Framework Desktop, same CPU/GPU, 128 GB LPDDR5X-8000, 120 W sustained

1. Context

The 0xSero team spent ~1 month exhaustively benchmarking llama.cpp on Strix Halo hardware identical to ours (same APU, same driver family), documenting every regression and filing 2 upstream PRs + 1 issue. Their hardware has 2× our RAM and 1.7× our power budget, so their absolute numbers don't transfer 1:1, but the relative findings, flag choices, and build recipes do. This document aggregates their findings, cross-references them against the current repo state (scripts/optimize/, scripts/serve/, configs/models.conf), and describes how to exploit each insight — ordered by payoff-to-effort ratio.

TL;DR outcome: ~7 concrete levers, of which 3 are zero-risk one-liners and 2 are larger experiments (custom ROCm container with PR #21344, draft-model spec decode) that could unlock +20–40 % real-workload throughput on MoE models we already run. Plus 3 genuine enhancement opportunities surfaced during review (§9): host-memory prompt cache for agentic evals, GLM-4.7-Flash Vulkan workaround, and the HSA_OVERRIDE_GFX_VERSION rocBLAS kernel fallback.

Cross-reference: docs/inference-optimization-landscape.md is the broader survey (engines, quantisation, attention, MoE, OS). This document is the strix-benchmarks-specific overlay — findings, PR status, and the exploitation plan against the current repo state. Where the two overlap (e.g. -DGGML_HIP_UMA=ON, --no-mmap), the landscape doc is the canonical reference; this doc should be read as an addendum with the specific numbers and PR traces.

2. Source summary — what strix-benchmarks.vercel.app actually proves

2.1 Four production configurations compared (Qwen3.5-122B-A10B-REAP-20-Q6_K, 76 GB)

Config	Backend	Patches	pp512	tg128	Best for
Vulkan stock	Vulkan RADV	—	303	24.7	Minimum latency
Vulkan + spec	Vulkan	PR #20075 + 4 fixes	302	35.3	Fast decode, easy deploy
ROCm + MMQ	ROCm 7.2.1	PR #21344	406	18.2	Long prefill, RAG
ROCm full stack	ROCm 7.2.1	#21344 + #20075 + 4 fixes	404	40.1	Overall chat throughput

2.2 Real-workload timing (server mode, 256-token generation, no-think, temp=0.3)

Workload	Vulkan stock	Vk + spec	ROCm + MMQ	ROCm full	Winner
Chat (30 in / 1K out)	41.5 s	31.9 s (+30 %)	56.9 s (−27 %)	28.3 s (+47 %)	Full stack
Code gen (2K in / 2K out)	118.3 s	99.1 s (+19 %)	138.0 s (−14 %)	80.9 s (+46 %)	Full stack
Summarize (8K in / 256 out)	155.9 s	153.5 s (+2 %)	114.5 s (+36 %)	107.2 s (+46 %)	Full stack

2.3 Five key technical findings

Vulkan wins memory mapping. vkAllocateMemory(HOST_VISIBLE_BIT) maps the full GTT correctly. ROCm's hipMalloc is trapped in BIOS VRAM unless GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is set — without it, the 122B model takes >1 hour due to 94 graph splits.
ROCm wins batched compute. PR #21344 (MMQ VGPR tuning for gfx1151) gives +19–35 % prefill on large MoE models. Still open; gfx1151-specific tunings: mmq_x_max=48, mmq_y=64, nwarps=4.
Spec decode stacks orthogonally. ROCm single-token decode is slower than Vulkan, but its batched verify is so cheap that ROCm + draft-model spec decode becomes the fastest overall config (40.1 tg vs Vulkan stock's 24.7 = +62 %).
Decode is context-flat; prefill is context-expensive. 512 tokens or 131K tokens → tg stays at 22–24 t/s. Only prefill degrades.
MUL_MAT_ID is the hidden villain on Vulkan. 42–66 % of prefill time on large MoE models. Filed as issue #21948 — closed as "not planned". No upstream fix coming. Metal's PR #13388 (map → batched matmul → unmap, 1.8–4.1× speedup) is the template but no Vulkan port exists.

2.4 Kernel parameters (critical distinction vs our current setup)

Strix-benchmarks prescribes amd_iommu=off (not iommu=pt) for 128 GB systems. This is required for any Vulkan model >64 GB to load — without it the kernel hangs for >20 min (llama.cpp issue #14854). They also use ttm.pages_limit=335544321 and amdgpu.gttsize=122880 (120 GiB GTT).

Our repo currently applies iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496 (60 GiB GTT, 64 GB system). The amd_iommu=off threshold (>64 GB model) is never crossed on a 64 GB box — our biggest cataloged model is REAP-40 Q4_K_M at 44 GB, well under the cliff. So iommu=pt remains correct for us. No change needed here.

2.5 KV cache quantisation (tested on Vulkan at 32K context)

Combo	pp t/s	tg t/s	KV mem	Reduction
f16/f16	264.4	24.53	8.19 GB	baseline
q8_0/q8_0	244.7	24.32	4.09 GB	−50 %
q4_0/q4_0	246.2	24.35	2.05 GB	−75 %

All asymmetric K/V combos timed out or degraded >3× at 32K (q8_0/f16, f16/q8_0, q4_0/f16, f16/q4_0, q4_0/q8_0, q8_0/q4_0). Our launch.sh already defaults to symmetric q4_0/q4_0 — aligned ✓.

2.6 Failure catalogue (what NOT to do)

Attempt	Failure	Takeaway
Q8_0 full 122B benchmarks	105 GB > 120 GiB GTT → OOM kills	Stay ≤ Q6_K for 128 GB, ≤ Q4 for 64 GB
ROCm Q6_K without UMA env var	94 graph splits, >1 hr runs	`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` is mandatory
ROCm + lhl's FATTN patches	Won't apply to current master	Needs rebase per llama.cpp version
Stacked `-ffast-math` + hipBLASLt + #21344	354 → 230 t/s regression	Test optimisations in isolation
Q6_K Vulkan without kernel fix (128 GB sys)	>20 min hang to load	`amd_iommu=off` required (>64 GB only)

2.7 Upstream status of referenced PRs (verified 2026-04-17)

PR/Issue	Status	Relevance to us
#21344 MMQ VGPR gfx1151	Open — has regressions on dense models at large batch	Apply via custom build for MoE only (21B–44B cataloged)
#20075 Hybrid SSM/MoE spec decode	Open — plus 4 extra fixes needed for non-Metal	Required if we want draft-model spec decode on Qwen3.5 MoE
#21948 Vulkan MUL_MAT_ID	Closed, not planned	Use ROCm for prefill-heavy MoE workloads
#13388 Metal MoE optimisation	Merged (May 2025)	Metal-only — reference for future Vulkan port
#15524 Vulkan subgroup opt	Merged	Already in our upstream-tracking toolbox images
#14854 >64 GB load hang	Open, unconfirmed	Workaround (`amd_iommu=off`) irrelevant for 64 GB box

3. Delta analysis — current repo vs strix-benchmarks findings

Legend: ✅ already applied · 🟡 partial · ❌ missing · ⚪️ n/a for our hardware

Item	Status	Location / notes
Symmetric q4_0/q4_0 KV cache default	✅	`scripts/serve/launch.sh:111`
`-fa on`, `-ngl 99`, `--no-mmap`	✅	`scripts/serve/launch.sh:106–108`
`ROCBLAS_USE_HIPBLASLT=1` for ROCm	✅	`scripts/serve/launch.sh:101`
RADV `nogttspill`	✅	`scripts/optimize/power-profile.sh` (env.d)
GTT sized to total RAM – 4 GiB	✅	`lib/detect.sh:recommended_gttsize_mib`
`amd_iommu=off` for >64 GB models	⚪️	Not applicable — 64 GB system, largest model 44 GB
`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` when using ROCm	❌	Missing from `ENV_ARGS` in `launch.sh:99–102`
Draft-model speculative decoding (`-md` flag)	❌	Only n-gram simple is wired (`launch.sh:123–129`). Draft model `qwen3.5-0.8b-q8-draft` is in `models.conf:41` but unused
PR #21344 MMQ VGPR gfx1151 tuning	❌	Toolbox containers track llama.cpp master; PR still open. Needs custom build
`--reasoning-budget 0` wired for no-think	✅	`launch.sh:119`
Docs: pp/tg vs context length decision matrix	❌	No "when to pick Vulkan vs ROCm" guidance in `docs/`
Benchmark matrix covers `--ngram` / draft spec	🟡	`launch.sh` supports `--ngram`, but `scripts/benchmark/` doesn't include spec in sweeps
Mixed K/V cache warning	❌	Not documented in `docs/` — users could try asymmetric combos and hit the 3× regression
Batch stabilisation (`-b 256`) on small-batch crashes	❌	No fallback; community reports GPU hangs on small batches

Four real gaps worth closing, in priority order: (1) UMA env var for ROCm, (2) draft-model spec decode, (3) documentation of Vulkan-vs-ROCm decision matrix, (4) benchmark coverage of spec decode.

4. Hardware-adjusted expectations

The 0xSero numbers are on 128 GB / 120 W Framework Desktop. Our HP ZBook is 64 GB / 70 W sustained. Two scalings matter:

Memory ceiling: REAP-20 Q6_K (76 GB) and Q8_0 stock (105 GB) are out of reach. Our realistic "max" is REAP-40 Q4_K_M at 46 GB (listed in configs/models.conf:38 as 46 GB) on a GTT that currently exposes ~59 GiB — leaves ~13 GiB for KV cache/activations. Meaning: the +46 % "Full stack" chat number (28.3 s vs 41.5 s) is achievable qualitatively on our smaller models but absolute t/s will differ, and we should not chase larger quants without shrinking context first.
Power ceiling: 70 W limits sustained t/s by ~25–35 % vs 120 W based on our own Phase 2 power-profile benchmarks (41 → 57 t/s with ryzenadj 70 W cap). We already push as hard as the HP firmware allows.

What this means for the plan: favour changes with high % wins (spec decode +30–40 %, UMA env var fixes pathological slowness) over absolute numbers that assume 120 W headroom.

5. Prioritised exploitation plan

P0 — Zero-risk one-liners

5.1 Enable ROCm unified memory — carefully

File: scripts/serve/launch.sh:99–102, and any ROCm path in scripts/benchmark/run-suite.sh.

The nuance (do not skip this): llama.cpp has two UMA mechanisms for HIP builds:

-DGGML_HIP_UMA=ON at build time (the canonical ROCm unified-memory flag; already baked into kyuz0's llama-rocm-* toolbox containers that this repo consumes per docs/references.md).
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 at runtime (historically a CUDA flag; works on HIP builds because llama.cpp's CUDA backend is re-used via HIP translation — this is what 0xSero prescribes).

Additionally, llama.cpp issue #18159 documents a UMA-detection bug on AMD APUs with large TTM allocations (our exact case): the detector reads /proc/meminfo when integrated=true and can under-report usable memory. Forcing GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 can aggravate this on systems where it's already set via build flag.

Prescribed action (with test gates, not a blind flag flip):

if [[ "$BACKEND" == *rocm* ]]; then
    ENV_ARGS=(env ROCBLAS_USE_HIPBLASLT=1)
    if [[ "${ROCM_UMA_ENV:-auto}" == "1" ]]; then
        ENV_ARGS+=(GGML_CUDA_ENABLE_UNIFIED_MEMORY=1)
    fi
fi

Add opt-in flag --rocm-uma-env that sets ROCM_UMA_ENV=1. Test plan:

Baseline: launch any ROCm backend with REAP-40 Q4_K_M (46 GB, largest model in catalog) and -ngl 99. Confirm there are no "graph split" warnings in server output and model loads in < 60 s. If yes → GGML_HIP_UMA=ON in the container is already doing the job, do not set the env var.
If graph splits appear or load is pathologically slow (minutes for 46 GB), flip --rocm-uma-env and re-test.
If still broken, inspect: podman inspect <toolbox> for the llama.cpp build flags, or build a custom container with -DGGML_HIP_UMA=ON explicitly.

Why this ordering: 0xSero's warning ("without UMA env var, ROCm does 94 graph splits, >1 hr load") is from a custom container they built without -DGGML_HIP_UMA, where the runtime env var was the only path. The kyuz0 toolboxes we already use ship with HIP UMA baked in, so the env var may be redundant or harmful. Test before shipping.

5.2 Document the Vulkan-vs-ROCm decision matrix

File: new section in docs/inference-optimization-landscape.md (existing doc, right home) or docs/benchmarking.md. Content: publish the context-scaling winners table from 0xSero (Appendix B below), annotated with "Vulkan < 2K ctx, ROCm ≥ 8K ctx for prefill" and "ROCm + spec decode = best tg in every bucket". Why: currently the repo has two ROCm toolboxes and two Vulkan ones but zero prose on when to pick which. This unblocks anyone (including future-you) from having to rediscover the same trade-off.

5.3 Add a warning against asymmetric K/V cache quantisation

File: scripts/serve/launch.sh header comment + docs/inference-optimization-landscape.md. Content: explicit "do not mix cache-type-k ≠ cache-type-v on Vulkan — 3× regression or timeout at 32K context." Keep the current symmetric q4_0 default.

P1 — Medium effort, high payoff

5.4 Wire draft-model speculative decoding in `launch.sh`

File: scripts/serve/launch.sh — add --draft flag (long form -md/--model-draft in llama.cpp) alongside existing --ngram. Implementation sketch (aligned with current flag parsing style at lines 19–52):

--draft)  DRAFT_MODEL="$2"; shift 2 ;;
# ...
if [[ -n "$DRAFT_MODEL" ]]; then
    DRAFT_PATH="$(find -L "$MODEL_DIR" -type f -name "$DRAFT_MODEL" -print -quit)"
    SERVER_ARGS+=(
        -md "$(realpath_for_toolbox "$DRAFT_PATH")"
        -ngld 99
        --draft-max 8 --draft-min 0
        --draft-p-min 0.5
    )
fi

⚠️ Tokenizer compatibility — the make-or-break constraint for draft-model spec decode. llama.cpp requires the target and draft models to share the same tokenizer/vocabulary (otherwise token IDs are incoherent across the two models and it refuses to load, or produces garbage). The families in our current catalog split into three disjoint tokenizer buckets:

Target family	Tokenizer / vocab	Compatible draft from our catalog?
Qwen3.5 dense + Qwen3.5 MoE (incl. REAP-40)	Qwen3.5-tokenizer, vocab 248320	✅ `qwen3.5-0.8b-q8-draft` (confirmed by lmstudio-bug-tracker #1597 — same vocab size and token IDs across dense & MoE; llama.cpp itself does not block this, only LM Studio filters it out)
Qwen3.6 MoE (hybrid DeltaNet)	Qwen3.6 tokenizer, vocab likely ≠ 248320	❌ need a Qwen3.6 draft — not in catalog
Qwen3-Coder family (Qwen3-Coder-30B-A3B, Qwen3-Coder-Next)	Qwen2Tokenizer, vocab 151 936	❌ cannot use `qwen3.5-0.8b-q8-draft` — different tokenizer, different vocab size (Qwen3-Coder retains the older 151936 Qwen2 vocab; Qwen3.5 uses 248320)
GLM-4.7-Flash	ChatGLM tokenizer	❌ no draft in catalog
Gemma-4 family	Gemma tokenizer	❌ no draft in catalog
Nemotron-Cascade-2	Mixed SSM/attention; separate tokenizer	❌ no draft in catalog

What to actually enable first (corrected from v1 of this doc):

✅ --draft qwen3.5-0.8b-q8-draft.gguf with target qwen3.5-35b-a3b-* or qwen3.5-122b-a10b-reap40-q4 — this is the only safe pair in our catalog today.
❌ Do not pair qwen3.5-0.8b-q8-draft with Qwen3-Coder models; load will fail. (Options for Qwen3-Coder: either add a small Qwen3-Coder draft to the catalog if Unsloth publishes one, or use --ngram instead — repetitive coding outputs are the exact case n-gram excels at.)

Additional caveat for SSM/MoE hybrids: Qwen3.5 MoE + draft decode is a hybrid SSM/attention architecture, and 0xSero documented that stock llama.cpp master has four bugs that corrupt spec-decode state on these models (hybrid seq_rm attention cleanup, soft rollback position erase, compat check needs checkpoint, recurrent reserve undersize). Upstream PR #20075 addresses some; the four extra fixes are not yet upstream. Concrete implication: even with matching tokenizer, expect wrong outputs on Qwen3.5-35B-A3B until those patches land (see P2 container build). Validate on a pure-attention target first — but our catalog doesn't have one in the draft-eligible size range. Net recommendation for tomorrow: wire the --draft flag in launch.sh, but only flip it on after P2.6 + P2.7 (patched ROCm container) ships, OR use it on the Vulkan stock build with the expectation that hybrid rollback corruption may bite on long outputs. Smoke-test with a short chat workload first.

Expected impact (if target–draft pair is compatible AND patches are in place): +20–40 % real-workload tg on Qwen3.5 MoE per 0xSero (no_think mode hits 90 % acceptance, +40 %).

⚠️ llama-server spec-decode reuse bug (issue #19231): spec decoding was confirmed to work only on the first request to /v1/chat/completions; subsequent requests on the same slot lose all draft acceptance and run slower than no-spec (46.55 vs 88.90 t/s reported). Filed Jan 31 2026. Related fix PR #19261 exists but not confirmed merged/effective. Validation protocol before adding --draft to the user-facing launch script: run 10 consecutive /v1/chat/completions requests against the same slot and confirm acceptance rate remains stable in the server log. If it decays, hold the feature on /v1/completions only (raw completion endpoint) or pin a known-good llama.cpp commit. The agentic eval harness at scripts/agentic/run-eval.sh uses the chat endpoint, so this bug would actively harm eval runs if left unvalidated.

Slot initialization quirk (issue #17989): --parallel 1 in current llama-server initializes 4 slots instead of 1 despite what the docs say. On a 64 GB system this silently eats KV-cache headroom. Mitigation: pass -np 1 explicitly (accepted as truthy) and inspect server startup log for n_slots = 1. If log shows 4, either update to a fixed llama.cpp commit or adjust --ctx expectations accordingly (KV for 4 slots at 131 K ctx with q4_0 ≈ 8 GB, not 2 GB).

5.5 Extend benchmark matrix to cover spec decode

File: scripts/benchmark/run-suite.sh — add --spec / --draft MODEL flag that, when set, runs a second pass with spec decode enabled and produces a delta row in summary.json. Why: the existing suite gives us no visibility into whether draft decode is actually helping on our hardware with our quants. Can't optimise what we don't measure.

P2 — Larger experiments (backlog, require custom build)

5.6 Build a custom ROCm container with PR #21344 applied

What: fork one of the llama-rocm-7.2.1 toolbox Containerfiles (kyuz0/amd-strix-halo-toolboxes is the current upstream the repo references in docs/references.md), apply PR #21344, and produce a sibling image llama-rocm-7.2.1-mmq or similar. Then expose it as a selectable backend in launch.sh:77–84. Reframing (important context for the decision): PR #21344 is arguably regression recovery rather than pure optimisation. llama.cpp issue #17917 documents that commit 668ed76 (enabled WMMA-MMQ INT kernels for RDNA 3) regressed pp2048 on gfx1151 from ~900 t/s to ~543 t/s (−40 %). Issue was closed as "not planned". PR #21344 tunes the exact same MMQ code path (tile sizes, warp counts, VGPR pressure). So when 0xSero reports +19–35 % prefill on a patched ROCm 7.2.1 container, part of that gain is likely recovering the 40 % that 668ed76 threw away. This is also why the stock kyuz0 llama-rocm-7.2.1 container (which tracks master including 668ed76) looks slow on MoE prefill. Our repo's existing docs/inference-optimization-landscape.md:48–49 already flags #17917 — this fixes it. Why: PR #21344 is the only path to the +35 % prefill / +19 % ROCm path. Upstream is unlikely to merge soon (dense-model regressions unresolved). Our cataloged models are all MoE in the target size range (GLM-4.7-Flash, Qwen3.5-35B-A3B, Qwen3.6-35B-A3B, Qwen3-Coder-30B-A3B, REAP-40) — exactly the class where PR #21344 helps. Dense-model regression is not our problem (we have no dense model >27 B in rotation for ROCm). Cost: ~1–2 hours to set up the fork + CI to rebuild weekly. Build flag nuance: the PR author's isolated-validation build uses -DGGML_HIP_ROCWMMA_FATTN=OFF to isolate MMQ gains; production builds on gfx1151 for long-context decode (≥ 16 K) benefit from -DGGML_HIP_ROCWMMA_FATTN=ON per llm-tracker's findings and per the existing inference-optimization-landscape.md:35–38 (which prescribes both UMA=ON and ROCWMMA_FATTN=ON). For our MoE-prefill-heavy case, prefer =ON (we hit 8 K–32 K contexts routinely via --ctx 131072), and confirm via A/B. Keep -DGGML_HIP_UMA=ON so the container doesn't depend on the runtime env var (see 5.1 nuance). Full build flag set consistent with the existing landscape doc: -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON. Guardrail: keep stock llama-rocm-7.2.1 alongside so we can A/B.

5.7 Add PR #20075 + 4 spec-decode fixes to the patched ROCm container

What: same container as 5.6 picks up PR #20075 plus the four additional fixes documented by 0xSero (hybrid seq_rm attention cleanup, soft rollback position erase, compat check checkpoint creation, recurrent reserve = max(16, n_seq_max * 16)). Why: unlocks draft spec decode for Qwen3.5 MoE hybrids — the REAP-40 and Qwen3.6-35B-A3B models we actually run. Without this, draft decode on these models produces stale-state corruption. Risk: patches are not upstream; if/when they merge, the fork becomes redundant. Track the PR status.

P3 — Nice-to-have, low urgency

Add a --batch-size 256 fallback in launch.sh (-b 256) documented as "enable only if you hit GPU hangs on small-batch MoE." Not default — the stabilisation comes with throughput cost. (llm-tracker documents small-batch GPU hangs on HIP and Vulkan; not currently observed on our system but cheap insurance.)
Track PR #13388's porting to Vulkan (unlikely soon but worth watching; would fix MUL_MAT_ID bottleneck and flip Vulkan prefill competitiveness).
Monitor issue #14854 resolution for the general-purpose fix to >64 GB Vulkan loads (irrelevant to us at 64 GB today but becomes relevant if we ever move to 128 GB hardware).
Track TurboQuant KV cache upstream integration — TQ3 (3-bit) / TQ4 (4-bit) KV cache types claim 4.6–4.9× compression vs f16 with near-zero quality loss, and a HIP/ROCm port for gfx1100 already exists in discussion #21526. Direct relevance on 64 GB: would push the practical context ceiling from ~131 K to ~256 K on MoE without hitting GTT pressure. Already watchlisted in inference-optimization-landscape.md:246–254.

9. Enhancement opportunities (surfaced during review, not in the source site)

These are genuinely net-new on top of the strix-benchmarks findings — they are not documented at 0xSero's site but are high-ROI additions for our specific use cases (agentic evals, mixed-model catalog).

9.1 Host-memory prompt cache (`--cram`) for agentic eval loops

What: llama-server gained a host-memory prompt cache in 2026 (llama.cpp discussion #20574). Flag: --cram N (N = MiB, typical 256–1024). Stores pre-computed prompt representations in system RAM; on subsequent requests with a shared prefix, the server skips prefill entirely for the matched portion. Why it matters for us: scripts/agentic/run-eval.sh runs hundreds of requests against the same system prompt + tool schema. Reported TTFT reduction: 93 % on 8 K system prompts (4.3 s → 0.3 s). On a 70 W power-capped machine, eliminating redundant prefill is pure profit. Action: add --cram 512 to launch.sh when --parallel > 1 OR when a new --agentic convenience flag is set. Cost: ~2 GB system RAM. Risk: low (fails open if cache misses). Caveat: interaction with spec decode + quantised KV cache is not documented upstream. Validate that cache hits don't corrupt draft acceptance stats before combining with P1.4.

9.2 GLM-4.7-Flash — do not run on Vulkan (output-corruption regression)

What: llama.cpp issue #18835 documents a Vulkan regression affecting all GLM-4.7 releases after b7667: longer prompts produce repeating Chinese characters / symbols. CUDA/ROCm work correctly. No upstream fix landed. Direct impact on us: configs/models.conf:15 lists glm-4.7-flash-q6 as a MoE daily-driver-class model in the catalog. Our default backend launch.sh:11 is llama-vulkan-radv. A user launching bin/serve -m GLM-4.7-Flash-UD-Q6_K_XL.gguf today on defaults gets corrupted long-prompt output. Action: either (a) add a model → backend preference table in launch.sh that routes GLM-4.7-* to a ROCm backend automatically, or (b) hard-warn at startup when launching GLM-4.7-* on a Vulkan backend. Option (a) is preferred because it's a silent correctness fix. Cost: ~10 lines of bash in launch.sh.

9.3 `HSA_OVERRIDE_GFX_VERSION=11.0.0` — gfx1100 kernel fallback for rocBLAS paths

What: llm-tracker documents that hipBLASLt lacks many optimized kernels for gfx1151 and that HSA_OVERRIDE_GFX_VERSION=11.0.0 forces rocBLAS to use gfx1100 kernels, which can be 2–6× faster on small matmul paths for certain operations (1024×1024 at ~6 TFLOPS → ~20 TFLOPS). This does NOT override llama.cpp's AMDGPU target — the kernels are still compiled for gfx1151 — but the rocBLAS runtime library picks gfx1100 paths when available. Why it matters for us: complementary to ROCBLAS_USE_HIPBLASLT=1 (already set). Zero build cost, just an env var, reversible. Could help ROCm backends on prefill paths where MMQ isn't the bottleneck. Action: treat the same way as 5.1 — opt-in env var (ROCM_GFX_OVERRIDE=1), validate on our benchmark suite before making default. Risk: some operations may produce incorrect results if gfx1100 kernels have instruction-set expectations that don't hold on gfx1151; the claim of "correctness" comes from community reports, not AMD validation.

9.4 Wire spec-decode (`--draft`, `--ngram`) through `bin/agentic`

What: scripts/agentic/run-eval.sh currently hits whatever llama-server is running without controlling its config. If P1.4 ships a --draft flag on launch.sh, the agentic eval binary should be able to pass it through so eval runs can A/B spec decode on/off. Why it matters: our agentic eval is the most repetitive-output workload we run (structured tool calls, JSON outputs) — exactly the case where draft acceptance is highest. Could halve eval wall-clock time once P1.4 is solid. Action: add matching --draft, --ngram, --cram pass-through flags to scripts/agentic/run-eval.sh; document in docs/agentic-benchmarks.md. Small change, wait until P1.4 has been validated against bug #19231.

9.5 Model → backend routing table in `launch.sh`

What: summarise the decision matrix (§5.2 + §9.2) into an executable default. Route known-problematic target → known-good backend automatically; let the user override with --backend as today. Why it matters: right now users must read docs to know that GLM-4.7 wants ROCm, that > 16 K context prefers ROCm+MMQ, that Qwen3.5 MoE + spec decode needs the patched container. A 20-line table in launch.sh turns that knowledge into correct-by-default behaviour. Action: define as declare -A DEFAULT_BACKEND mapping glob patterns to backends; resolve after model detection at launch.sh:64–68; print the routing decision in the startup log so it's visible and overridable.

6. Critical files (for implementation when the plan is executed)

scripts/serve/launch.sh — env vars, flag wiring for P0.1 and P1.4
scripts/benchmark/run-suite.sh — spec-decode matrix for P1.5
configs/models.conf — draft-target pairing hint (line 41 already exists)
docs/inference-optimization-landscape.md — decision matrix, K/V warning (P0.2, P0.3)
docs/references.md — add strix-benchmarks.vercel.app and framework-max repo
Future: new containers/llama-rocm-7.2.1-mmq/ tree for P2.6

7. Verification (after each change is applied)

make audit — ensure no regression in system status.
make verify — 9-point optimisation checklist still passes.
For P0.1 (UMA env var): launch a ROCm backend with a >35 GB MoE model and confirm no graph-split warnings in server output, load completes in seconds (not hours).
For P1.4 (draft decode): run bin/serve -m qwen3-coder-30b-a3b-...gguf --draft qwen3.5-0.8b-q8-draft.gguf, measure tg/s against the no-draft baseline on an agentic coding prompt.
For P1.5 (bench matrix): compare summary.json delta between --spec/no-spec runs; expect +15–40 % tg depending on workload repetitiveness.
Do not stack PR #21344 + -ffast-math + hipBLASLt simultaneously (strix-benchmarks confirmed 354 → 230 t/s regression). Test each patch in isolation first.

8. References

Source site: https://strix-benchmarks.vercel.app/
Framework-max repo (404 at time of check): https://github.com/0xSero/framework-max
HF models: https://huggingface.co/0xSero (REAP-20, REAP-40 variants)
Upstream PRs: llama.cpp #21344, #20075, #13388, #15524
Upstream issues: llama.cpp #21948 (closed not-planned), #14854 (unconfirmed)
Chipsandcheese RDNA3 Infinity Cache: https://chipsandcheese.com/p/rdna-3s-infinity-cache-friend-or
Community cross-refs: kyuz0/amd-strix-halo-toolboxes (toolbox images we already consume), llm-tracker.info Strix-Halo page (hipBLASLt recipe)

Appendix A — Full upstream-contribution log (for credit/attribution)

Date	Type	Link	Impact
Apr 15	Issue	#21948	Documented MUL_MAT_ID profiling bottleneck
Apr 15	PR	#21344	Validated MMQ VGPR tuning: +19–35 % prefill on gfx1151 MoE
Apr 15	PR	#20075	Four hybrid SSM/MoE spec decode fixes
Apr 14	Kernel workaround	N/A	`amd_iommu=off` resolution (for 128 GB systems)

Appendix B — Full context-scaling table (0xSero, Vulkan vs ROCm+MMQ)

Ctx	Vk pp	Vk tg	Vk+spec tg	ROCm pp	ROCm tg	ROCm+spec tg	Best decode
64	98	24.3	26.6	130	17.8	28.5	ROCm+spec
512	303	24.4	28.1	376	17.7	23.3	Vk+spec
2K	353	24.2	25.2	423	17.7	26.7	ROCm+spec
4K	362	24.2	25.5	413	17.7	25.3	ROCm+spec
8K	371	23.9	25.1	407	17.4	27.4	ROCm+spec
16K	353	23.4	25.0	371	17.0	23.6	Vk+spec
32K	314	22.6	18.6	315	16.2	22.9	ROCm+spec

Rule-of-thumb: ROCm's MMQ prefill advantage fades after 16K; beyond that flash attention dominates and Vulkan catches up.

Appendix C — MUL_MAT_ID profiling (why MoE prefill is slow on Vulkan)

Context	MUL_MAT_ID %	Total prefill ms
512	66.2 %	1700
8K	57.6 %	1839
32K	41.9 %	2556
128K	19.7 %	5351

Constant ~1050 ms overhead regardless of context → compute dispatch inefficiency in expert routing, not bandwidth. Closed as not-planned upstream. Workaround: use ROCm for MoE prefill-heavy workloads (8K–16K ctx sweet spot).

Revision notes (2026-04-17 review pass)

A self-review sweep after v1 caught and fixed the following issues. Recording them here so future-you can see what was wrong and why, in case residual errors surface.

Wrong draft-model pairing in P1.4 (critical). v1 recommended enabling --draft qwen3.5-0.8b-q8-draft.gguf first with qwen3-coder-30b-a3b. Wrong: Qwen3-Coder uses the Qwen2 tokenizer (vocab 151 936) while Qwen3.5 uses vocab 248 320 (lmstudio-bug-tracker #1597 confirms the Qwen3.5 dense–MoE pair is vocab-identical; Qwen3-Coder is not). llama.cpp will refuse that pair. Corrected: the only tokenizer-compatible draft pair in the current catalog is qwen3.5-0.8b-q8-draft with qwen3.5-35b-a3b-* or qwen3.5-122b-a10b-reap40-q4. Qwen3-Coder should use --ngram until a Qwen3-Coder-specific draft ships.
UMA env var recommendation was too blunt. v1 prescribed always setting GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 for ROCm backends. llama.cpp issue #18159 documents that on AMD APUs with large TTM allocations (exactly our case) the UMA detector can mis-read /proc/meminfo and forcing the env var can aggravate it. The kyuz0 containers we already consume are likely built with -DGGML_HIP_UMA=ON, making the runtime env var redundant or counterproductive. Corrected: opt-in flag with test gates, not unconditional.
REAP-40 size off by 2 GB. v1 said 44 GB; configs/models.conf:38 lists 46 GB. Corrected throughout.
GTT actual ≠ requested. v1 said "60 GB GTT"; actual sysfs reports ~59.0 GiB (63 350 767 616 bytes). Leaves ~13 GiB headroom above the 46 GB REAP-40 — enough for KV cache at q4_0 but tight. Corrected in §4.
PR #21344 build flag nuance. v1 quoted -DGGML_HIP_ROCWMMA_FATTN=OFF from the PR without context. That's the author's isolation-test build; production gfx1151 users (llm-tracker) prefer ON for long-context decode. Flagged in §5.6.
Hardware baseline was from memory, not sysfs. v1 recited 64 GB / gfx1151 / Fedora 43 from user memory. Re-verified against /proc/meminfo (65 074 108 kB), /proc/cmdline (kernel 6.19.8, iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496), and sysfs (GTT total 63.35 GB, VRAM 512 MiB). Updated preamble.

Still-unverified claims in this doc (worth validating empirically before committing effort):

"+20–40 % spec decode gain transfers to our 35B MoE on our 70 W cap" — the 0xSero numbers are on a 122B MoE on 120 W. Scaling is not guaranteed. Measure once 5.4 + 5.5 ship.
"PR #21344 helps all MoE in our catalog" — PR validated on Qwen3.5-122B and Mistral-Small/Nemotron-120B; extrapolation to Qwen3.6-35B-A3B / Qwen3-Coder-30B-A3B / Gemma-4-26B-A4B is an assumption, not a measurement.
"kyuz0 containers ship -DGGML_HIP_UMA=ON" — Containerfile fetch returned 404; could not directly inspect. Assumption based on DeepWiki description and the existing inference-optimization-landscape.md:35–37 which prescribes that flag as canonical. Verify by cloning the repo locally and reading Containerfile.rocm-* before building our fork.
"llama.cpp issue #19231 is still live on master" — confirmed as of Jan 31 2026 with related PR #19261; did not confirm whether #19261 has merged. Validate with a live test against our current kyuz0 container before trusting spec decode in production.

Revision notes — second sweep (2026-04-17)

A deeper review after v2 surfaced the following issues and enhancement opportunities. Recording here so the history is auditable.

Additional corrections

PR #21344 reframed as regression recovery. First sweep characterised it as a pure +19–35 % optimisation. Second sweep found issue #17917 documenting that commit 668ed76 (WMMA-MMQ INT kernels for RDNA 3) regressed gfx1151 prefill by −40 % (pp2048 900 → 543 t/s). PR #21344 tunes the same code path. Much of 0xSero's reported "gain" is likely recovering what 668ed76 threw away. §5.6 now says so explicitly; this changes how you pitch the work ("we're catching up to our own April 2025 performance") rather than "we're unlocking something new".
llama-server spec decode /v1/chat/completions reuse bug added to §5.4. llama.cpp issue #19231 shows spec decode only works on the first chat-completion request; subsequent slots drop draft acceptance entirely and run slower than no-spec. This is live as of Jan 31 2026 and has an unmerged fix PR #19261. Validate before trusting spec decode in agentic eval loops.
GLM-4.7-Flash Vulkan output corruption (§9.2). llama.cpp issue #18835 — Vulkan produces repeating Chinese characters on long prompts post-b7667. GLM-4.7-Flash is in our catalog (configs/models.conf:15) and defaults to Vulkan. A user running bin/serve -m GLM-4.7-Flash-UD-Q6_K_XL.gguf today gets corrupted output. Needs either automatic backend routing or a startup warning.
--parallel 1 slot bug (§5.4). Issue #17989 — passing --parallel 1 initialises 4 slots on current master. Silent 4× KV cache overhead on a constrained 64 GB system. Mitigation: pass -np 1 and verify n_slots=1 in startup log.

New enhancement opportunities (§9)

Host-memory prompt cache --cram (§9.1) — 93 % TTFT reduction for agentic eval loops per llama.cpp discussion #20574. Not covered by 0xSero.
Model-to-backend routing table (§9.5) — turns the decision matrix into a correct-by-default launch.sh. Silences the GLM-4.7 corruption bug without user intervention.
HSA_OVERRIDE_GFX_VERSION=11.0.0 (§9.3) — gfx1100 rocBLAS kernel fallback, 2–6× on small-matmul paths per llm-tracker. Complementary to the existing ROCBLAS_USE_HIPBLASLT=1.
TurboQuant KV cache watch-item in P3 — 4.6–4.9× KV compression vs f16; HIP port for gfx1100 exists per discussion #21526; would push our context ceiling from 131 K to ~256 K without GTT pressure.
Cross-reference to inference-optimization-landscape.md added at the top of §1 to prevent duplication. That existing doc already covers the canonical Strix Halo build flags, TurboQuant, Unsloth Dynamic 2.0, and issue #17917 — this overlay doc shouldn't re-litigate those.

Confirmed OK (no change needed)

--no-mmap flag — matches docs/inference-optimization-landscape.md:43–45 ("critical for ROCm on Strix Halo to avoid catastrophically slow loading"), already in launch.sh:107.
iommu=pt choice for 64 GB system (not amd_iommu=off) — aligns with kyuz0 toolboxes' production config and Gygeek/Framework-strix-halo-llm-setup; the strix-benchmarks amd_iommu=off threshold is >64 GB model loads, never hit on a 64 GB box.
Firmware version linux-firmware-20260309-1.fc43.noarch is not the known-bad 20251125 flagged in lib/detect.sh:detect_firmware_bad.
Q4_0/Q4_0 symmetric KV cache default — matches both 0xSero's Vulkan sweep and our existing launch.sh:111–112.

Things worth considering but deliberately out of scope here

Evaluating KTransformers / vLLM for agentic serving — covered in inference-optimization-landscape.md:§1.2, §1.7; requires a separate decision point about when local llama.cpp stops being sufficient.
Fine-tuning flows (PyTorch + ROCm + aotriton) — covered in llm-tracker; not in this repo's remit.
Dense Qwen3.5-27B via ROCm with PR #21344 — PR has documented regressions on 8B dense at batch ≥ 128. Not tested for 27B; assume risk and prefer Vulkan for dense models until measured.

Appendix D — Exact reproduce commands (from 0xSero, for reference)

Hardware/OS baseline

AMD Ryzen AI MAX+ 395 (16C/32T Zen 5), Radeon 8060S (gfx1151, RDNA 3.5, 40 CU), 128 GB LPDDR5X-8000, Fedora 43, kernel 6.17.1

Kernel params (128 GB system)

sudo grubby --update-kernel=ALL --args='amd_iommu=off ttm.pages_limit=335544321 amdgpu.gttsize=122880'
# verify:
cat /sys/class/drm/card*/device/mem_info_gtt_total   # ~128849018880 (120 GiB)

Vulkan stock baseline bench

export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH
$HOME/.local/opt/llama.cpp/current/llama-bench \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 -n 128

ROCm with PR #21344 (inside custom podman container)

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
llama-bench \
  -m /models/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 -n 128
# Build flags used: -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DGGML_HIP_ROCWMMA_FATTN=OFF

Speculative decoding (server mode)

$HOME/.local/opt/llama.cpp/current/llama-server \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -md $HOME/.local/share/models/gguf/Qwen3.5-0.8B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 -ngld 99 -fa 1 -c 4096 \
  --draft-max 8 --parallel 1

41 KiB Raw Permalink Blame History Unescape Escape