Files
strix-halo-optimizations/docs/troubleshooting.md
Felipe Cardoso da2c4c6b8a fix(docs): address review findings — accuracy, consistency, completeness
- architecture.md: fix kernel param math to match actual computed values,
  use cardN placeholder in sysfs paths, clarify system_ram_kb is OS-visible
- benchmarking.md: normalize flags to -ngl 99 / -mmp 0 (matching code),
  add llama-rocm7-nightlies backend
- CLAUDE.md: clarify HSA_OVERRIDE_GFX_VERSION is set in containers not
  scripts, fix lib sourcing description, specify which scripts need root
- detect.sh: document detect_cpu_cores returns threads not cores
- troubleshooting.md: add link to references.md
- README.md: remove unsupported Fedora 42 claim, describe configs/ content

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:44:16 +01:00

3.1 KiB

Troubleshooting

Firmware: linux-firmware 20251125 Causes ROCm Crashes

Symptoms: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.

Check: rpm -qa | grep linux-firmware

Fix: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:

sudo dracut -f --kver $(uname -r)

The toolkit checks this automatically — make audit shows firmware status.

amdgpu_top: Cargo Build Fails (gix-hash error)

Symptoms: error: Please set either the sha1 or sha256 feature flag during cargo install amdgpu_top.

Cause: Rust toolchain version incompatibility with the gix-hash dependency.

Fix: Use the pre-built RPM instead:

make monitor-install

The install script downloads the RPM from GitHub releases, bypassing cargo entirely.

Toolbox GPU Access Failure

Symptoms: llama-cli --list-devices shows no GPU inside a toolbox container.

Check: Device mappings when creating the toolbox:

  • Vulkan backends need: --device /dev/dri
  • ROCm backends need: --device /dev/dri --device /dev/kfd

Fix: Recreate the toolbox with correct device flags. The refresh-toolboxes.sh script handles this automatically.

Also ensure your user is in the video and render groups:

sudo usermod -aG video,render $USER

GRUB Changes Not Taking Effect

Symptoms: After make optimize-kernel and reboot, make audit still shows missing params.

Possible causes:

  1. BLS (Boot Loader Spec): Modern Fedora uses BLS entries. The script uses grubby when available, but verify:

    grubby --info=ALL | grep args
    
  2. Wrong GRUB config path: Check which config is actually used:

    cat /proc/cmdline    # what the kernel actually booted with
    cat /etc/default/grub  # what the script modified
    
  3. GRUB not regenerated: Manually regenerate:

    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    

Memory Unchanged After BIOS Change

Symptoms: Changed VRAM in BIOS but make audit still shows 32 GiB.

Check:

cat /sys/class/drm/card1/device/mem_info_vram_total

Possible causes:

  • BIOS change not saved (verify by re-entering BIOS)
  • Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
  • Kernel params not applied (VRAM reduction requires kernel params to be useful)

Benchmark Failures

Symptoms: make benchmark-baseline reports "FAILED" for some backends.

Common fixes:

  • Ensure model exists: ls data/models/*.gguf
  • Check model fits in memory: small models (4B) for initial testing
  • Try llama-vulkan-radv first (most stable backend)
  • Check dmesg for GPU errors: dmesg | tail -30

Rollback

If optimization causes issues:

sudo make rollback

This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See docs/optimization.md for the full rollback procedure.

Further Resources

For external tool documentation, upstream bug trackers, and community resources, see docs/references.md.