Files

Felipe Cardoso f92b710492 fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.

2026-03-27 14:54:19 +01:00

5.2 KiB

Raw Blame History

Optimization Log

Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.

Verdicts: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).

Phase 1: Core System

1.1 Tuned Profile: accelerator-performance

Date: 2026-03-26
Change: sudo tuned-adm profile accelerator-performance
Benchmark: data/benchmarks/after-tuned-*
Result: +5-8% pp improvement, +2-3% tg improvement
Verdict: KEEP

1.2 Kernel Boot Parameters

Date: 2026-03-26
Change: iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496
Benchmark: data/benchmarks/full-opt-all-models-*
Result: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
Verdict: KEEP

1.3 BIOS VRAM Reduction (512 MB)

Date: 2026-03-26
Change: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
Benchmark: data/benchmarks/full-opt-all-models-*
Result: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
Trade-off: Small model regression is acceptable given the massive capability gain.
Verdict: KEEP

Phase 2: System Tuning

2.1 RyzenAdj 85W PPT

Date: PENDING
Change: sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000
Expected: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
Benchmark: Not yet run
Notes: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
Verdict: PENDING

2.2 VM Sysctl Tuning

Date: PENDING
Change: vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000
Expected: Prevent model weight eviction, reduce I/O disruption
Benchmark: Not yet run
Verdict: PENDING

2.3 Transparent Huge Pages

Date: PENDING
Change: transparent_hugepage=always
Expected: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
Benchmark: Not yet run
Verdict: PENDING

2.4 RADV_PERFTEST=nogttspill

Date: PENDING
Change: export RADV_PERFTEST=nogttspill
Expected: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
Benchmark: Not yet run — needs Vulkan-specific benchmark comparison
Verdict: PENDING

2.5 amdgpu.noretry=0

Date: PENDING
Change: Kernel cmdline amdgpu.noretry=0
Expected: Improved stability under memory pressure
Notes: Only apply if experiencing GPU page faults or crashes during large model loading
Verdict: PENDING

Phase 3: Runtime Flags

3.1 KV Cache Quantization

Date: PENDING (sweep running)
Change: -ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0
Benchmark: data/benchmarks/kv-sweep-128k-* (in progress)
Expected: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
Verdict: PENDING

3.2 MoE Batch Size `-b 256`

Date: PENDING
Change: Add -b 256 to MoE benchmark runs
Expected: Up to +70% pp improvement for MoE models (community benchmarks)
Benchmark: Not yet run
Verdict: PENDING

Phase 4: Build Optimizations

4.1 rocWMMA Flash Attention

Date: PENDING
Change: Rebuild ROCm toolbox with -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON
Expected: +96% long-context performance (65K+)
Notes: Need to check if Donato's toolboxes already include this
Verdict: PENDING

4.2 rocWMMA Tuned Patch (PR #16827)

Date: PENDING
Notes: Fixes long-context regression. Check Donato's latest toolbox builds.
Verdict: PENDING

Phase 5: Future / Blocked

5.1 Speculative Decoding

Status: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
Draft model: Downloaded Qwen3.5-0.8B-Q8_0.gguf (812 MB) on 2026-03-27
Last checked: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues

5.2 Native MTP (Multi-Token Prediction)

Status: BLOCKED — llama.cpp PR #20700
Last checked: 2026-03-27 — WIP, not expected to merge soon

5.3 GPU Clock Fix

Status: BLOCKED — ROCm issue #5750
Notes: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
Last checked: 2026-03-27

Context Window Benchmarks

64K Context (pp4096/tg1024, MoE models)

Date: 2026-03-26
Benchmark: data/benchmarks/ctx64k-*
Results: (check logs)

128K Context (pp8192/tg1024, MoE models)

Date: 2026-03-26
Benchmark: data/benchmarks/ctx128k-realistic-*
Results: (check logs)

256K Context (pp16384/tg1024, MoE models)

Date: 2026-03-27
Benchmark: data/benchmarks/ctx256k-*
Results: (check logs)

How to Add Entries

When testing a new optimization:

Record the date and exact change
Run a benchmark: make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."
Compare: make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new
Update this log with results and verdict
If KEEP: document in optimization.md with the measured numbers

5.2 KiB Raw Blame History