docs: update optimization guide with measured hardware data

Replace estimated values with clpeak measurements: DRAM 216-233 GB/s, GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting only). Correct backend recommendation to Vulkan RADV (2.7x faster tg than ROCm at 131K). Update KV cache recommendation to q4_0. Add Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.
2026-03-30 19:56:18 +02:00
parent 1549bc27c0
commit ea70687cd2
3 changed files with 157 additions and 208 deletions
--- a/configs/models.conf
+++ b/configs/models.conf
@@ -18,10 +18,9 @@ glm-4.7-flash-q6|lmstudio-community/GLM-4.7-Flash-GGUF|GLM-4.7-Flash-Q6_K.gguf|2
 qwen3.5-27b-q4|unsloth/Qwen3.5-27B-GGUF|Qwen3.5-27B-Q4_K_M.gguf|17|dense|Dense 27B, quality-first

 # ── MoE models (fast generation, best for 64GB) ─────────
-qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_L.gguf|19|moe|MoE 35B, 3B active, Unsloth dynamic
+qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf|21|moe|MoE 35B, 3B active, Unsloth dynamic XL
 qwen3.5-35b-a3b-q8|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-Q8_0.gguf|37|moe|MoE 35B Q8, near-full precision
-nemotron-30b-a3b-q4|lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-GGUF|NVIDIA-Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf|23|moe|Nemotron MoE 30B, 3B active
-nemotron-cascade2-q8|bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF|nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf|31|moe|Nemotron Cascade 2, Mamba-2 hybrid
+nemotron-cascade2-q8|bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF|nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf|31|moe|Nemotron Cascade 2, Mamba-2 hybrid (replaces Nano)

 # ── Coding models ─────────────────────────────────────────
 qwen3-coder-30b-a3b-q6|unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF|Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf|26|moe|Agentic coding MoE, pure Transformer
--- a/docs/optimization-log.md
+++ b/docs/optimization-log.md
@@ -143,11 +143,12 @@ Living document tracking what was applied, tested, and the actual results. Each
 - **Status**: BLOCKED — llama.cpp PR #20700
 - **Last checked**: 2026-03-27 — WIP, not expected to merge soon

-### 5.3 GPU Clock Fix
+### 5.3 GPU Clock Reporting

- **Status**: BLOCKED — ROCm issue #5750
- **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
- **Last checked**: 2026-03-27
+- **Status**: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
+- **Measured**: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
+- **Notes**: ROCm issue #5750 is about sysfs `pp_dpm_sclk` reporting, not actual performance. No action needed.
+- **Verdict**: CLOSED — no performance impact

 ---

@@ -187,7 +188,31 @@ Living document tracking what was applied, tested, and the actual results. Each
 | **UD-Q4_K_XL** | 20.7 GB | 835 | **56.4** | **Daily driver** — best quality/speed. |
 | Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |

-**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L can be deleted — Q4_K_XL is strictly better at only +2 GB.
+**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
+
+### Coder Model Shootout (2026-03-30)
+
+- **Benchmark**: `data/benchmarks/coder-shootout-*`
+- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
+- **RyzenAdj**: STAPM=81W (sustained ~70W)
+
+| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
+|-------|-------------|-----------|-------------|-------------|
+| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
+| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
+| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
+| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
+
+**Analysis**:
+- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
+- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
+- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
+- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
+
+**Recommended roles**:
+- **Qwen3-Coder-30B**: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
+- **Qwen3.5-35B-A3B**: General tasks, long prompt processing (best pp, best all-rounder)
+- **Qwen3-Coder-Next**: Complex multi-file coding tasks where quality > speed

 ---

--- a/docs/optimization.md
+++ b/docs/optimization.md
@@ -18,13 +18,9 @@ These are the foundational optimizations handled by this repo's scripts. Apply i
 sudo make optimize-tuned
 ```

-Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states and sets CPU governor to performance.
+Switches to `accelerator-performance`: disables higher-latency CPU STOP states, sets CPU governor to performance.

-Takes effect immediately. Previous profile is saved for rollback.
-
-| Expected Impact | pp512 | tg128 |
-|----------------|-------|-------|
-| Tuned profile | +5-8% | +2-3% |
+**Measured**: +5-8% pp, +2-3% tg.

 ### 1.2 Kernel Boot Parameters (reboot required)

@@ -32,15 +28,13 @@ Takes effect immediately. Previous profile is saved for rollback.
 sudo make optimize-kernel
 ```

-Adds three parameters to GRUB:
-
 | Parameter | Value (64 GB) | Purpose |
 |-----------|--------------|---------|
 | `iommu=pt` | -- | IOMMU passthrough, reduces memory access latency |
 | `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB (~59 GiB) |
 | `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |

-Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it. See [architecture.md](architecture.md) for the math.
+Values computed dynamically from total physical RAM. See [architecture.md](architecture.md) for the math.

 ### 1.3 BIOS VRAM Reduction (reboot + BIOS access)

@@ -48,11 +42,9 @@ Values are computed dynamically based on your system's total physical RAM. The s
 make optimize-vram   # Prints guidance — cannot modify BIOS directly
 ```

-Reduce dedicated VRAM (UMA Frame Buffer Size) from 32 GB to 512 MB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
+Reduce UMA Frame Buffer Size from 32 GB to 512 MB. Frees 31.5 GB for GTT. See [bios-vram-guide.md](bios-vram-guide.md) (HP ZBook: F10 at boot).

-See [bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough (HP ZBook: F10 at boot).
-
-**Combine 1.2 and 1.3 into a single reboot**: apply kernel params, then reboot into BIOS to change VRAM, then boot normally.
+**Combine 1.2 and 1.3 into a single reboot.**

 ### 1.4 Verify

@@ -61,264 +53,201 @@ make verify          # 9-point checklist, target: 9/9
 make audit           # Single-screen system status with scores
 ```

-### Phase 1 Expected Impact (combined)
+### Phase 1 Measured Impact

-| Optimization | pp512 | tg128 |
-|-------------|-------|-------|
+| Optimization | pp | tg |
+|-------------|----|----|
 | Tuned profile | +5-8% | +2-3% |
-| Kernel params + BIOS VRAM | +10-20% | +5-15% |
+| Kernel params + BIOS VRAM | Enables 37 GB+ models | +5-15% |
 | **Phase 1 combined** | **+15-25%** | **+8-18%** |

-Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
+Trade-off: Small models (<5 GB) are ~3-8% slower due to GTT indirection vs dedicated VRAM. Acceptable given the massive capability gain.

 ---

-## Phase 2: System Tuning (manual, no reboot unless noted)
+## Phase 2: System Tuning

-These require root but are safe to apply and revert.
-
-### 2.1 Power Budget Increase via RyzenAdj
-
-The HP ZBook Ultra G1a ships with a conservative 60W power limit. The Strix Halo chip supports 120W. Community testing shows **85W is the sweet spot**: +12-19% over 60W, with manageable thermals.
+All Phase 2 optimizations are applied with a single command:

 ```bash
-# Install ryzenadj (Fedora)
-sudo dnf install ryzenadj   # or build from https://github.com/FlyGoat/RyzenAdj
-
-# Apply 85W limits (milliwatts)
-sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
-
-# Verify
-sudo ryzenadj -i | grep -E 'STAPM|PPT'
+sudo make optimize-power
 ```

-| Setting | HP Default | Recommended | Max (risky) |
-|---------|-----------|-------------|-------------|
-| STAPM | 60W | **85W** | 120W |
-| PPT Fast | 60W | **85W** | 120W |
-| PPT Slow | 20W | **85W** | 120W |
+This applies RyzenAdj, sysctl, THP, and RADV nogttspill. Sysctl and nogttspill persist across reboots. RyzenAdj and THP are volatile.

-**Notes**:
- Settings are volatile — reset on reboot/sleep. Create a systemd service for persistence.
- Going above 85W yields only +2-3% more (LLM inference is memory-bandwidth-bound at ~215 GB/s).
- Monitor thermals: `sensors` or `amdgpu_top`. Throttling starts around 100C junction temp.
- HP firmware may periodically reset limits. Verify after wake from sleep.
- The 140W USB-C charger limits total system draw. At 100W+ APU, battery will drain even while plugged in.
+For RyzenAdj persistence across reboots:
+```bash
+sudo cp configs/ryzenadj-llm.service configs/ryzenadj-resume.service /etc/systemd/system/
+sudo systemctl enable --now ryzenadj-llm.service
+sudo systemctl enable ryzenadj-resume.service
+```
+
+### 2.1 RyzenAdj PPT Increase
+
+The HP ZBook Ultra G1a ships at 59W sustained. RyzenAdj raises this.
+
+**Measured on HP ZBook**: STAPM raised to 81W, but **HP firmware hard-caps PPT SLOW at 70W**. Effective sustained power: ~70W (was ~59W). Cannot be overridden without modded BIOS.
+
+**Measured impact**: Qwen3.5-35B-A3B tg1024 went from **~39 t/s → 57 t/s (+46%)**. This was the single largest improvement in the entire optimization journey.
+
+Thermals: 70-73C under sustained load. 30C headroom. Cooling handles it easily.

 ### 2.2 VM / Sysctl Tuning

-```bash
-# Apply immediately
-sudo sysctl -w vm.swappiness=1
-sudo sysctl -w vm.dirty_ratio=40
-sudo sysctl -w vm.dirty_background_ratio=10
-sudo sysctl -w vm.max_map_count=500000
-sudo sysctl -w vm.zone_reclaim_mode=0
+Persisted to `/etc/sysctl.d/99-llm-inference.conf`:

-# Persist across reboots
-sudo tee /etc/sysctl.d/99-llm-inference.conf << 'EOF'
-vm.swappiness = 1
-vm.dirty_ratio = 40
-vm.dirty_background_ratio = 10
-vm.max_map_count = 500000
-vm.zone_reclaim_mode = 0
-EOF
-```
-
-| Parameter | Default | Recommended | Why |
-|-----------|---------|-------------|-----|
-| `vm.swappiness` | 60 | **1** | Prevent model weights from being swapped out |
+| Parameter | Default | Set To | Why |
+|-----------|---------|--------|-----|
+| `vm.swappiness` | 60 | **1** | Prevent model weight eviction |
 | `vm.dirty_ratio` | 20 | **40** | Reduce I/O flush storms during inference |
-| `vm.dirty_background_ratio` | 10 | **10** | Keep background writeback at default |
 | `vm.max_map_count` | 65530 | **500000** | Large models need many memory mappings |
 | `vm.zone_reclaim_mode` | 0 | **0** | Don't aggressively reclaim memory zones |

 ### 2.3 Transparent Huge Pages

-THP reduces TLB misses for mmap'd model files (~55 GB model = 14M page table entries at 4KB vs 28K at 2MB).
+Set to `always`. Reduces TLB misses for mmap'd model files. Volatile — add `transparent_hugepage=always` to kernel cmdline for persistence.

-```bash
-# Apply immediately
-echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
-echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
+### 2.4 RADV_PERFTEST=nogttspill

-# Persist via kernel cmdline (add to GRUB):
-# transparent_hugepage=always
-
-# Verify THP is being used
-grep -i huge /proc/meminfo
-grep thp /proc/vmstat
-```
-
-**Trade-off**: `always` may cause rare latency spikes during memory compaction. Use `madvise` if you need predictable latency, but note that llama.cpp does not call `madvise(MADV_HUGEPAGE)` so `always` is needed.
-
-### 2.4 RADV_PERFTEST=nogttspill (Vulkan backend)
-
-Prevents unnecessary GTT spill management on unified memory. Fixes prompt processing degradation with the Vulkan RADV backend.
-
-```bash
-# Per-session
-export RADV_PERFTEST=nogttspill
-
-# Persist system-wide
-echo 'RADV_PERFTEST=nogttspill' | sudo tee /etc/environment.d/radv.conf
-```
-
-Only affects the Vulkan RADV backend. No effect on ROCm.
-
-### 2.5 Additional Kernel Parameters (reboot required)
-
-These can be added to the GRUB cmdline alongside Phase 1 params:
-
-| Parameter | Value | Purpose | Priority |
-|-----------|-------|---------|----------|
-| `amdgpu.noretry=0` | 0 | Enable GPU page fault retry, improves stability | Medium — add if seeing GPU crashes |
-| `transparent_hugepage=always` | -- | Persist THP setting | Medium |
-| `preempt=voluntary` | -- | Reduce context switch overhead | Low — only for batch inference |
-| `processor.max_cstate=1` | 1 | Disable deep C-states | Low — tuned profile handles this |
-
-**Do NOT add**: `amdgpu.ppfeaturemask=0xffffffff` — OverDrive is non-functional on gfx1151 (ROCm issue #5750).
+Persisted to `/etc/environment.d/radv-llm.conf`. Prevents GTT spill management overhead on unified memory. Vulkan RADV only.

 ---

-## Phase 3: Runtime Flags (per-invocation, no system changes)
+## Phase 3: Runtime Flags (per-invocation)

-These are llama-bench / llama-server flags that affect performance without changing the system.
+These flags should be used when running llama-bench, llama-server, or llama-cli. Already set in this repo's benchmark scripts.

-### 3.1 Always Use `-mmp 0` (no mmap)
+### 3.1 `-mmp 0` (no mmap) — mandatory

-On unified memory, mmap adds a double-copy penalty. The `--no-mmap` / `-mmp 0` flag loads weights directly. Already set in this repo's benchmark scripts.
+On unified memory, mmap adds a double-copy penalty. Always disable.

-### 3.2 Batch Size for MoE Models (`-b 256`)
+### 3.2 KV Cache Quantization — use Q4_0

-Default batch size (2048) is too large for MoE on this hardware. Reducing to 256 can improve pp512 throughput by up to 70% on MoE models.
+**Measured** (Vulkan RADV, Qwen3.5-35B-A3B):

-```bash
-# In llama-bench
-llama-bench -m model.gguf -b 256 -ngl 99 -fa 1
+| KV Type | pp2048 | tg1024 | Memory Savings |
+|---------|--------|--------|---------------|
+| f16 | 456 | 39.8 | Baseline |
+| q8_0 | 418 | 38.5 | ~50% (slightly slower on Vulkan) |
+| **q4_0** | **460** | **41.1** | **75% (fastest on Vulkan)** |

-# In llama-server
-llama-server -m model.gguf -b 256 -ngl 99 -fa 1
-```
+Q4_0 is faster because less memory bandwidth is spent reading KV cache. Use as default for serving. Quality impact is noticeable only on reasoning-heavy tasks.

-### 3.3 KV Cache Quantization
+### 3.3 Flash Attention (`-fa 1`) — always enable

-Q8_0 KV cache halves KV memory usage with negligible quality loss. Recommended as default for all serving.
+24% pp on ROCm. Modest improvement on Vulkan (CoopMat1). Already in benchmark scripts.

-```bash
-# llama-server
-llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0
+### 3.4 `ROCBLAS_USE_HIPBLASLT=1` (ROCm only) — mandatory

-# Benchmark sweep
-make benchmark ARGS="--tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072 --models MODEL.gguf --reps 3"
-```
+Without this, ROCm pp is 2-7x slower on gfx1151. Already in benchmark scripts.

-| KV Type | Memory Savings | Quality Impact | Recommendation |
-|---------|---------------|----------------|----------------|
-| f16 | Baseline | None | Default for benchmarks |
-| **q8_0** | **~50%** | **Negligible** | **Default for serving** |
-| q4_0 | ~75% | Noticeable on reasoning | Only for max context |
+### 3.5 Backend Selection

-### 3.4 Flash Attention (`-fa 1`)
+**Measured** (Qwen3.5-35B-A3B Q8, 128K context):

-Always enable on ROCm (+24% pp improvement). On Vulkan, FA uses CoopMat1 (modest improvement). Already set in benchmark scripts.
+| Workload | Vulkan RADV | ROCm 7.2 | Winner |
+|----------|------------|----------|--------|
+| pp2048 | 456 | 445 | Vulkan (+2%) |
+| tg1024 | 39.8 | 21.5 | **Vulkan (1.9x)** |
+| pp8192 @ 131K | 130 | 84 | **Vulkan (1.5x)** |
+| tg32 @ 131K | 22.0 | 8.1 | **Vulkan (2.7x)** |

-### 3.5 ROCBLAS_USE_HIPBLASLT=1 (ROCm only)
+**Vulkan RADV dominates across all workloads on this hardware.** ROCm is significantly slower for token generation, especially at long context. Never use AMDVLK.

-Without this, ROCm pp on gfx1151 is 2-7x slower. Already set in benchmark scripts.
+### 3.6 MoE Batch Size `-b 256` — NOT YET TESTED

-### 3.6 Backend Selection
-
-Neither ROCm nor Vulkan is universally faster:
-
-| Workload | Best Backend | Why |
-|----------|-------------|-----|
-| Short context tg | Vulkan RADV | Lower per-token overhead |
-| Long context (8K-130K) | ROCm + rocWMMA | True HW flash attention |
-| General stability | Vulkan RADV | More mature on gfx1151 |
-
-Never use AMDVLK — RADV scales 3.6x better at extreme context depths.
+Community reports up to +70% pp improvement for MoE models. Default batch (2048) may be too large. Needs benchmarking.

 ---

-## Phase 4: Build Optimizations (requires rebuilding containers)
+## Phase 4: Build Optimizations (not yet tested)

-These require rebuilding the llama.cpp toolbox containers with specific flags.
+These require rebuilding the llama.cpp toolbox containers. Given that **Vulkan RADV already outperforms ROCm significantly**, the ROI of these ROCm-specific optimizations is unclear.

 ### 4.1 ROCm Build Flags

 ```bash
 cmake -B build \
    -DGGML_HIP=ON \
-    -DGGML_HIP_ROCWMMA_FATTN=ON \   # GPU-accelerated flash attention via WMMA
-    -DGGML_HIP_UMA=ON \              # Unified memory aware allocation
+    -DGGML_HIP_ROCWMMA_FATTN=ON \
+    -DGGML_HIP_UMA=ON \
    -DAMDGPU_TARGETS=gfx1151
 ```

-`GGML_HIP_ROCWMMA_FATTN` is the only path to true GPU-accelerated flash attention on AMD (96% speedup at 65K context). The Vulkan CoopMat1 path is a software fallback.
+`GGML_HIP_ROCWMMA_FATTN` enables GPU-accelerated flash attention via WMMA. Could close the gap between ROCm and Vulkan for long-context workloads. Check if Donato Capitella's ROCm toolboxes already include this.

-### 4.2 rocWMMA Tuned Patch (PR #16827)
+### 4.2 Vulkan Cooperative Matrices

-Fixes a long-context regression in rocWMMA. Implements adaptive KQ stride, better launch bounds, and selective WMMA (prefill only; decode reverts to VEC/TILE). Check if Donato Capitella's ROCm toolboxes include this.
-
-### 4.3 Vulkan Cooperative Matrices
-
-RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Building llama.cpp with cooperative matrix support could enable WMMA-like speedups without ROCm dependency.
+RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Already used by llama.cpp for matrix operations. The current Vulkan toolbox shows `matrix cores: KHR_coopmat` — this is likely already active.

 ---

 ## Phase 5: Future / Currently Blocked

-These optimizations are not available today but are worth tracking.
+### 5.1 Speculative Decoding

-### 5.1 Speculative Decoding (blocked: llama.cpp PR #20075)
+Expected 1.8-2.5x tg speedup. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`) downloaded.

-Expected 1.8-2.5x tg speedup for coding tasks. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`, 812 MB) already downloaded. Blocked because Qwen3.5 MoE uses hybrid GatedDeltaNet architecture that breaks llama.cpp's speculative rollback mechanism.
+**Blocked**: llama.cpp PR #20075 — hybrid SSM/MoE speculative rollback.

-**Track**: [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — fix for hybrid SSM/MoE speculative decoding.
+### 5.2 Native Multi-Token Prediction

-### 5.2 Native Multi-Token Prediction (blocked: llama.cpp PR #20700)
+Qwen3.5 has built-in MTP heads. No draft model needed.

-Qwen3.5 was trained with built-in MTP heads. No separate draft model needed. Works in vLLM/SGLang today but not llama.cpp.
+**Blocked**: llama.cpp PR #20700 — MTP for Qwen3.5.

-**Track**: [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — MTP for Qwen3.5 with FastMTP vocabulary trimming.
+### 5.3 GPU Clock Reporting Fix

-### 5.3 GPU Clock Fix (blocked: ROCm #5750)
+ROCm issue #5750 reports GPU clocks appearing stuck at ~885 MHz in sysfs. However, **clpeak confirms clocks reach 2900 MHz under compute load** (measured 2026-03-30). The issue is likely broken sysfs reporting, not actual clock throttling. No performance impact.

-GPU clocks on gfx1151 may be stuck at ~885 MHz instead of 2900 MHz. `power_dpm_force_performance_level` and OverDrive are non-functional. If fixed, this could unlock significant additional throughput.
+**Tracking**: ROCm issue #5750 (sysfs reporting only, not a real blocker).

-**Track**: [ROCm issue #5750](https://github.com/ROCm/ROCm/issues/5750) — Strix Halo stuck in low power clocks.
+### 5.4 Other Future Items

-### 5.4 SageAttention
-
-2-5x speedup over FlashAttention via quantized attention computation. No AMD port exists yet.
-
-### 5.5 AMD XDNA NPU (50 TOPS)
-
-Not viable for LLM inference today. Linux support coming in kernel 7.1. Future potential: running a draft model on the NPU for speculative decoding while the GPU runs the main model.
-
-### 5.6 TurboQuant 3-bit KV Cache (ICLR 2026)
-
-4.9x KV cache compression with minimal quality loss. Being integrated into llama.cpp.
-
-### 5.7 LLMLingua-2 Prompt Compression
-
-20x prompt compression for agentic/RAG workloads. Reduces pp time by compressing input before inference. Applicable to the agentic eval pipeline.
+- **SageAttention**: 2-5x over FlashAttention. No AMD port yet.
+- **XDNA NPU** (50 TOPS): Linux support coming in kernel 7.1. Could run draft model for speculative decoding.
+- **TurboQuant 3-bit KV** (ICLR 2026): 4.9x KV compression. Being integrated into llama.cpp.
+- **LLMLingua-2**: 20x prompt compression for agentic/RAG workloads.

 ---

-## Hardware Limits (cannot be changed)
-
-Understanding what is fixed helps avoid wasted effort.
+## Hardware Limits (measured via clpeak on this system, 2026-03-30)

 | Resource | Value | Notes |
 |----------|-------|-------|
-| Memory bandwidth | **~215 GB/s** (measured) | 84% of 256 GB/s theoretical. Hard ceiling for tg speed. |
-| LPDDR5X-8000 | **8000 MT/s, 256-bit** | Soldered, no XMP/EXPO, no overclocking |
-| Infinity Fabric | **2 GHz FCLK** | Fixed, not tunable on Strix Halo |
-| Infinity Cache | **32 MB** | ~1 TB/s hit bandwidth. Per-layer weights exceed it. |
-| GPU clocks | **Up to 2900 MHz** | Currently broken in driver (see 5.3) |
-| Max power | **120W APU** | HP ZBook charger is 140W total system |
+| DRAM bandwidth | **216-233 GB/s** | clpeak float: 215.7, float16: 232.6. Theoretical max: 256 GB/s. |
+| Infinity Cache | **32 MB, ~1 TB/s** | MoE effective throughput exceeds DRAM BW due to cache hits |
+| GPU clocks | **2900 MHz confirmed** | clpeak shows clocks reach 2900 MHz under load. ROCm #5750 sysfs reporting may be broken but actual clocks are correct. |
+| FP16 compute | **21.9 TFLOPS** | 20 CUs x 2900 MHz |
+| FP32 compute | **12.3 TFLOPS** | |
+| LPDDR5X-8000 | 8000 MT/s, 256-bit | Soldered, no overclocking |
+| Infinity Fabric | 2 GHz FCLK | Fixed |
+| HP ZBook sustained power | **70W** (firmware cap) | RyzenAdj can set 85W but HP overrides to 70W |
+
+**Note on MoE bandwidth**: MoE models (3B active params) read ~13.5 GB per token at Q4. At 55 t/s this implies ~740 GB/s effective throughput — well above the 216 GB/s DRAM ceiling. The 32 MB Infinity Cache (~1 TB/s) is actively boosting throughput for repeated KV cache and activation reads.
+
+---
+
+## Performance Summary (all measured, Vulkan RADV, q4_0 KV)
+
+### Model Shootout (pp2048/tg1024, Phase 1+2 applied)
+
+| Model | Arch | Size | pp2048 | tg1024 |
+|-------|------|------|--------|--------|
+| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
+| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
+| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
+| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
+
+tg speed is bandwidth-bound: smaller model = faster tokens.
+
+### Optimization Journey (Qwen3.5-35B-A3B, tg on Vulkan)
+
+| Stage | tg (t/s) | Improvement |
+|-------|----------|-------------|
+| Pre-optimization (stock) | ~33 | Baseline |
+| After Phase 1 (tuned + kernel + BIOS) | ~39 | +18% |
+| After Phase 2 (RyzenAdj + sysctl + THP) | **~57** | **+46%** |

 ---

@@ -328,23 +257,19 @@ Understanding what is fixed helps avoid wasted effort.
 sudo make rollback   # Restores GRUB backup and previous tuned profile
 ```

-BIOS VRAM must be reverted manually (F10 at boot, restore previous UMA Frame Buffer Size).
-
-Phase 2 changes can be reverted individually:
+Phase 2 rollback:
 - RyzenAdj: `sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000`
 - Sysctl: `sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system`
 - THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
- nogttspill: `sudo rm /etc/environment.d/radv.conf`
+- nogttspill: `sudo rm /etc/environment.d/radv-llm.conf`
+- BIOS VRAM: revert manually (F10 at boot)

 ---

-## Troubleshooting
-
-If anything goes wrong, see [troubleshooting.md](troubleshooting.md).
-
 ## Further Reading

- [Hardware analysis](llama-cpp-optimization-research.md) — Deep dive into llama.cpp flags, backends, quantization
- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines, techniques, and future directions
+- [Optimization log](optimization-log.md) — Detailed test results and verdicts
+- [Hardware analysis](llama-cpp-optimization-research.md) — llama.cpp flags, backends, quantization deep dive
+- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines and techniques
 - [Benchmarking guide](benchmarking.md) — Methodology and result interpretation
 - [References](references.md) — All external links