- architecture.md: fix kernel param math to match actual computed values, use cardN placeholder in sysfs paths, clarify system_ram_kb is OS-visible - benchmarking.md: normalize flags to -ngl 99 / -mmp 0 (matching code), add llama-rocm7-nightlies backend - CLAUDE.md: clarify HSA_OVERRIDE_GFX_VERSION is set in containers not scripts, fix lib sourcing description, specify which scripts need root - detect.sh: document detect_cpu_cores returns threads not cores - troubleshooting.md: add link to references.md - README.md: remove unsupported Fedora 42 claim, describe configs/ content Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
101 lines
3.1 KiB
Markdown
101 lines
3.1 KiB
Markdown
# Troubleshooting
|
|
|
|
## Firmware: linux-firmware 20251125 Causes ROCm Crashes
|
|
|
|
**Symptoms**: Arbitrary crashes, instability, or mysterious failures with ROCm workloads.
|
|
|
|
**Check**: `rpm -qa | grep linux-firmware`
|
|
|
|
**Fix**: Downgrade to 20251111 or upgrade to 20260110+. After changing firmware:
|
|
```bash
|
|
sudo dracut -f --kver $(uname -r)
|
|
```
|
|
|
|
The toolkit checks this automatically — `make audit` shows firmware status.
|
|
|
|
## amdgpu_top: Cargo Build Fails (gix-hash error)
|
|
|
|
**Symptoms**: `error: Please set either the sha1 or sha256 feature flag` during `cargo install amdgpu_top`.
|
|
|
|
**Cause**: Rust toolchain version incompatibility with the `gix-hash` dependency.
|
|
|
|
**Fix**: Use the pre-built RPM instead:
|
|
```bash
|
|
make monitor-install
|
|
```
|
|
The install script downloads the RPM from GitHub releases, bypassing cargo entirely.
|
|
|
|
## Toolbox GPU Access Failure
|
|
|
|
**Symptoms**: `llama-cli --list-devices` shows no GPU inside a toolbox container.
|
|
|
|
**Check**: Device mappings when creating the toolbox:
|
|
- Vulkan backends need: `--device /dev/dri`
|
|
- ROCm backends need: `--device /dev/dri --device /dev/kfd`
|
|
|
|
**Fix**: Recreate the toolbox with correct device flags. The [refresh-toolboxes.sh](https://github.com/kyuz0/amd-strix-halo-toolboxes) script handles this automatically.
|
|
|
|
Also ensure your user is in the `video` and `render` groups:
|
|
```bash
|
|
sudo usermod -aG video,render $USER
|
|
```
|
|
|
|
## GRUB Changes Not Taking Effect
|
|
|
|
**Symptoms**: After `make optimize-kernel` and reboot, `make audit` still shows missing params.
|
|
|
|
**Possible causes**:
|
|
|
|
1. **BLS (Boot Loader Spec)**: Modern Fedora uses BLS entries. The script uses `grubby` when available, but verify:
|
|
```bash
|
|
grubby --info=ALL | grep args
|
|
```
|
|
|
|
2. **Wrong GRUB config path**: Check which config is actually used:
|
|
```bash
|
|
cat /proc/cmdline # what the kernel actually booted with
|
|
cat /etc/default/grub # what the script modified
|
|
```
|
|
|
|
3. **GRUB not regenerated**: Manually regenerate:
|
|
```bash
|
|
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
|
```
|
|
|
|
## Memory Unchanged After BIOS Change
|
|
|
|
**Symptoms**: Changed VRAM in BIOS but `make audit` still shows 32 GiB.
|
|
|
|
**Check**:
|
|
```bash
|
|
cat /sys/class/drm/card1/device/mem_info_vram_total
|
|
```
|
|
|
|
**Possible causes**:
|
|
- BIOS change not saved (verify by re-entering BIOS)
|
|
- Wrong BIOS setting modified (look for "UMA Frame Buffer Size", not "Shared Memory")
|
|
- Kernel params not applied (VRAM reduction requires kernel params to be useful)
|
|
|
|
## Benchmark Failures
|
|
|
|
**Symptoms**: `make benchmark-baseline` reports "FAILED" for some backends.
|
|
|
|
**Common fixes**:
|
|
- Ensure model exists: `ls data/models/*.gguf`
|
|
- Check model fits in memory: small models (4B) for initial testing
|
|
- Try `llama-vulkan-radv` first (most stable backend)
|
|
- Check dmesg for GPU errors: `dmesg | tail -30`
|
|
|
|
## Rollback
|
|
|
|
If optimization causes issues:
|
|
```bash
|
|
sudo make rollback
|
|
```
|
|
|
|
This restores the GRUB backup and previous tuned profile. BIOS changes must be reverted manually (F10 at boot). See [docs/optimization.md](optimization.md) for the full rollback procedure.
|
|
|
|
## Further Resources
|
|
|
|
For external tool documentation, upstream bug trackers, and community resources, see [docs/references.md](references.md).
|