- architecture.md: fix kernel param math to match actual computed values, use cardN placeholder in sysfs paths, clarify system_ram_kb is OS-visible - benchmarking.md: normalize flags to -ngl 99 / -mmp 0 (matching code), add llama-rocm7-nightlies backend - CLAUDE.md: clarify HSA_OVERRIDE_GFX_VERSION is set in containers not scripts, fix lib sourcing description, specify which scripts need root - detect.sh: document detect_cpu_cores returns threads not cores - troubleshooting.md: add link to references.md - README.md: remove unsupported Fedora 42 claim, describe configs/ content Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.1 KiB
Troubleshooting
Firmware: linux-firmware 20251125 Causes ROCm Crashes
Symptoms: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.
Check: rpm -qa | grep linux-firmware
Fix: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:
sudo dracut -f --kver $(uname -r)
The toolkit checks this automatically — make audit shows firmware status.
amdgpu_top: Cargo Build Fails (gix-hash error)
Symptoms: error: Please set either the sha1 or sha256 feature flag during cargo install amdgpu_top.
Cause: Rust toolchain version incompatibility with the gix-hash dependency.
Fix: Use the pre-built RPM instead:
make monitor-install
The install script downloads the RPM from GitHub releases, bypassing cargo entirely.
Toolbox GPU Access Failure
Symptoms: llama-cli --list-devices shows no GPU inside a toolbox container.
Check: Device mappings when creating the toolbox:
- Vulkan backends need:
--device /dev/dri - ROCm backends need:
--device /dev/dri --device /dev/kfd
Fix: Recreate the toolbox with correct device flags. The refresh-toolboxes.sh script handles this automatically.
Also ensure your user is in the video and render groups:
sudo usermod -aG video,render $USER
GRUB Changes Not Taking Effect
Symptoms: After make optimize-kernel and reboot, make audit still shows missing params.
Possible causes:
-
BLS (Boot Loader Spec): Modern Fedora uses BLS entries. The script uses
grubbywhen available, but verify:grubby --info=ALL | grep args -
Wrong GRUB config path: Check which config is actually used:
cat /proc/cmdline # what the kernel actually booted with cat /etc/default/grub # what the script modified -
GRUB not regenerated: Manually regenerate:
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Memory Unchanged After BIOS Change
Symptoms: Changed VRAM in BIOS but make audit still shows 32 GiB.
Check:
cat /sys/class/drm/card1/device/mem_info_vram_total
Possible causes:
- BIOS change not saved (verify by re-entering BIOS)
- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
- Kernel params not applied (VRAM reduction requires kernel params to be useful)
Benchmark Failures
Symptoms: make benchmark-baseline reports "FAILED" for some backends.
Common fixes:
- Ensure model exists:
ls data/models/*.gguf - Check model fits in memory: small models (4B) for initial testing
- Try
llama-vulkan-radvfirst (most stable backend) - Check dmesg for GPU errors:
dmesg | tail -30
Rollback
If optimization causes issues:
sudo make rollback
This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See docs/optimization.md for the full rollback procedure.
Further Resources
For external tool documentation, upstream bug trackers, and community resources, see docs/references.md.