KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
5.2 KiB
5.2 KiB
Optimization Log
Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
Verdicts: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
Phase 1: Core System
1.1 Tuned Profile: accelerator-performance
- Date: 2026-03-26
- Change:
sudo tuned-adm profile accelerator-performance - Benchmark:
data/benchmarks/after-tuned-* - Result: +5-8% pp improvement, +2-3% tg improvement
- Verdict: KEEP
1.2 Kernel Boot Parameters
- Date: 2026-03-26
- Change:
iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496 - Benchmark:
data/benchmarks/full-opt-all-models-* - Result: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
- Verdict: KEEP
1.3 BIOS VRAM Reduction (512 MB)
- Date: 2026-03-26
- Change: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
- Benchmark:
data/benchmarks/full-opt-all-models-* - Result: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
- Trade-off: Small model regression is acceptable given the massive capability gain.
- Verdict: KEEP
Phase 2: System Tuning
2.1 RyzenAdj 85W PPT
- Date: PENDING
- Change:
sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 - Expected: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
- Benchmark: Not yet run
- Notes: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
- Verdict: PENDING
2.2 VM Sysctl Tuning
- Date: PENDING
- Change:
vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000 - Expected: Prevent model weight eviction, reduce I/O disruption
- Benchmark: Not yet run
- Verdict: PENDING
2.3 Transparent Huge Pages
- Date: PENDING
- Change:
transparent_hugepage=always - Expected: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
- Benchmark: Not yet run
- Verdict: PENDING
2.4 RADV_PERFTEST=nogttspill
- Date: PENDING
- Change:
export RADV_PERFTEST=nogttspill - Expected: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
- Benchmark: Not yet run — needs Vulkan-specific benchmark comparison
- Verdict: PENDING
2.5 amdgpu.noretry=0
- Date: PENDING
- Change: Kernel cmdline
amdgpu.noretry=0 - Expected: Improved stability under memory pressure
- Notes: Only apply if experiencing GPU page faults or crashes during large model loading
- Verdict: PENDING
Phase 3: Runtime Flags
3.1 KV Cache Quantization
- Date: PENDING (sweep running)
- Change:
-ctk q8_0 -ctv q8_0/-ctk q4_0 -ctv q4_0 - Benchmark:
data/benchmarks/kv-sweep-128k-*(in progress) - Expected: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
- Verdict: PENDING
3.2 MoE Batch Size -b 256
- Date: PENDING
- Change: Add
-b 256to MoE benchmark runs - Expected: Up to +70% pp improvement for MoE models (community benchmarks)
- Benchmark: Not yet run
- Verdict: PENDING
Phase 4: Build Optimizations
4.1 rocWMMA Flash Attention
- Date: PENDING
- Change: Rebuild ROCm toolbox with
-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON - Expected: +96% long-context performance (65K+)
- Notes: Need to check if Donato's toolboxes already include this
- Verdict: PENDING
4.2 rocWMMA Tuned Patch (PR #16827)
- Date: PENDING
- Notes: Fixes long-context regression. Check Donato's latest toolbox builds.
- Verdict: PENDING
Phase 5: Future / Blocked
5.1 Speculative Decoding
- Status: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
- Draft model: Downloaded
Qwen3.5-0.8B-Q8_0.gguf(812 MB) on 2026-03-27 - Last checked: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
5.2 Native MTP (Multi-Token Prediction)
- Status: BLOCKED — llama.cpp PR #20700
- Last checked: 2026-03-27 — WIP, not expected to merge soon
5.3 GPU Clock Fix
- Status: BLOCKED — ROCm issue #5750
- Notes: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
- Last checked: 2026-03-27
Context Window Benchmarks
64K Context (pp4096/tg1024, MoE models)
- Date: 2026-03-26
- Benchmark:
data/benchmarks/ctx64k-* - Results: (check logs)
128K Context (pp8192/tg1024, MoE models)
- Date: 2026-03-26
- Benchmark:
data/benchmarks/ctx128k-realistic-* - Results: (check logs)
256K Context (pp16384/tg1024, MoE models)
- Date: 2026-03-27
- Benchmark:
data/benchmarks/ctx256k-* - Results: (check logs)
How to Add Entries
When testing a new optimization:
- Record the date and exact change
- Run a benchmark:
make benchmark ARGS="--tag DESCRIPTIVE-NAME ..." - Compare:
make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new - Update this log with results and verdict
- If KEEP: document in optimization.md with the measured numbers