KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.
6.1 KiB
6.1 KiB
External References
Single source of truth for all external links used across this project.
AMD Official
- ROCm Strix Halo Optimization Guide — BIOS, kernel params, GTT/TTM configuration
- ROCm System Optimization Index — General ROCm tuning
- ROCm Installation Guide (Linux) — Package installation
- AMD SMI Documentation — GPU monitoring API
- ROCm GitHub — Source and issue tracker
Strix Halo Toolboxes (Donato Capitella)
The most comprehensive community resource for Strix Halo LLM optimization.
- strix-halo-toolboxes.com — Documentation, benchmarks, guides
- GitHub: kyuz0/amd-strix-halo-toolboxes — Container images, benchmark scripts, VRAM estimator
- Benchmark Results Viewer — Interactive performance charts
Community
- Strix Halo Wiki — AI Capabilities — Community benchmarks, model compatibility
- Strix Halo Wiki — Power Modes — RyzenAdj sweet spots (85W recommended)
- Strix Halo Wiki — llama.cpp Performance — Backend comparison data
- Level1Techs Forum — HP G1a Guide — Laptop-specific configuration
- Framework Community — GPU Performance Tests — Framework Desktop results
- Framework Community — Compiling vLLM on Strix Halo — Native vLLM build guide
- Hardware Corner — Strix Halo LLM Optimization — Comprehensive optimization walkthrough
- Chips and Cheese — Strix Halo Memory Subsystem — Bandwidth measurements (215 GB/s)
- LLM Tracker — Strix Halo — Centralized performance database
Other Strix Halo Repos
- pablo-ross/strix-halo-gmktec-evo-x2 — GMKtec EVO X2 optimization
- kyuz0/amd-strix-halo-llm-finetuning — Fine-tuning guides (Gemma-3, Qwen-3)
Monitoring Tools
- amdgpu_top — Best AMD GPU monitor (TUI/GUI/JSON)
- nvtop — Cross-vendor GPU monitor
- btop — System resource monitor
LLM Inference
- llama.cpp — LLM inference engine (Vulkan + ROCm)
- ollama — LLM runtime with model management
- vLLM — High-throughput serving
- llama-benchy — Multi-backend LLM benchmarking
Qwen3.5 Models (GGUF)
- unsloth/Qwen3.5-35B-A3B-GGUF — Top pick for 64GB Strix Halo (MoE, 3B active)
- unsloth/Qwen3.5-27B-GGUF — Dense 27B
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF — Best for agentic/coding
- Qwen3.5 Official — Model family overview
- Unsloth Dynamic 2.0 — Adaptive quantization methodology
- Unsloth Studio — Training + inference UI (beta)
Agentic Evaluation
- Inspect AI — All-in-one eval framework (HumanEval, BFCL, IFEval, GAIA)
- EvalPlus — HumanEval+ / MBPP+ with native ollama support
- BigCodeBench — 1,140 coding tasks across 139 libraries
- BFCL — Berkeley Function Calling Leaderboard
- SWE-bench — Real GitHub issue resolution
- Qwen-Agent — Optimized agentic framework for Qwen models
System Tuning
- RyzenAdj — Power management for Ryzen APUs (PPT/TDP control)
- geohot/ztop — Power monitoring for Strix Halo (discovered 60W HP limits)
- ROCm Issue #5750 — GPU clocks stuck at idle on gfx1151
- Mesa RADV Environment Variables — RADV_PERFTEST=nogttspill docs
- Linux Kernel: amd-pstate — CPU power management
llama.cpp Optimization
- llama.cpp Speculative Decoding — Draft model setup
- llama.cpp PR #20075 — Fix speculative for Qwen3.5 MoE (pending)
- llama.cpp PR #20700 — Native MTP for Qwen3.5 (WIP)
- llama.cpp PR #16827 — rocWMMA tuned flash attention
- llama.cpp Issue #12444 — Hugepage support proposal
AMD GPU Profiling
- Radeon GPU Profiler (RGP) — Hardware-level Vulkan/HIP profiling
- Radeon GPU Analyzer (RGA) — Offline shader/kernel analysis