fix(benchmark): parse llama-bench output with variable column count

KV cache quantization adds type_k/type_v columns to llama-bench output,
shifting test and t/s to different indices. Parse from end of row instead
of hardcoded positions. Also fix KV suffix separator (underscore to dash)
to avoid regex ambiguity with type names like q8_0.

Add 5-phase optimization guide, optimization log for tracking results,
and research docs on llama.cpp and inference landscape optimizations.
This commit is contained in:
Felipe Cardoso
2026-03-27 14:54:19 +01:00
parent 7531f6fa74
commit f92b710492
7 changed files with 2148 additions and 52 deletions

View File

@@ -21,8 +21,13 @@ The most comprehensive community resource for Strix Halo LLM optimization.
## Community
- [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
- [Strix Halo Wiki — Power Modes](https://strixhalo.wiki/Guides/Power-Modes-and-Performance) — RyzenAdj sweet spots (85W recommended)
- [Strix Halo Wiki — llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance) — Backend comparison data
- [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
- [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
- [Framework Community — Compiling vLLM on Strix Halo](https://community.frame.work/t/how-to-compiling-vllm-from-source-on-strix-halo/77241) — Native vLLM build guide
- [Hardware Corner — Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/) — Comprehensive optimization walkthrough
- [Chips and Cheese — Strix Halo Memory Subsystem](https://chipsandcheese.com/p/strix-halos-memory-subsystem-tackling) — Bandwidth measurements (215 GB/s)
- [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database
## Other Strix Halo Repos
@@ -61,6 +66,22 @@ The most comprehensive community resource for Strix Halo LLM optimization.
- [SWE-bench](https://github.com/princeton-nlp/SWE-bench) — Real GitHub issue resolution
- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) — Optimized agentic framework for Qwen models
## System Tuning
- [RyzenAdj](https://github.com/FlyGoat/RyzenAdj) — Power management for Ryzen APUs (PPT/TDP control)
- [geohot/ztop](https://github.com/geohot/ztop) — Power monitoring for Strix Halo (discovered 60W HP limits)
- [ROCm Issue #5750](https://github.com/ROCm/ROCm/issues/5750) — GPU clocks stuck at idle on gfx1151
- [Mesa RADV Environment Variables](https://docs.mesa3d.org/envvars.html) — RADV_PERFTEST=nogttspill docs
- [Linux Kernel: amd-pstate](https://docs.kernel.org/admin-guide/pm/amd-pstate.html) — CPU power management
## llama.cpp Optimization
- [llama.cpp Speculative Decoding](https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md) — Draft model setup
- [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — Fix speculative for Qwen3.5 MoE (pending)
- [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — Native MTP for Qwen3.5 (WIP)
- [llama.cpp PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827) — rocWMMA tuned flash attention
- [llama.cpp Issue #12444](https://github.com/ggml-org/llama.cpp/issues/12444) — Hugepage support proposal
## AMD GPU Profiling
- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling