Files

Felipe Cardoso 58124cd657 feat: add Qwen3.5 model catalog and agentic evaluation framework

Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 00:20:23 +01:00

4.1 KiB

Raw Blame History

CLAUDE.md — AI Assistant Context

Optimization toolkit for AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified memory) on Fedora 43. Pure bash scripts with inline Python for JSON handling and GRUB editing. See README.md for user-facing commands.

Architecture

bin/ dispatchers → scripts/ implementations → lib/ shared libraries. Scripts source libs as needed: always common.sh first, then detect.sh if hardware detection is needed, then format.sh if formatted output is needed. Some scripts (e.g., rollback.sh) only need common.sh. Runtime data goes to data/ (gitignored). Full details in docs/architecture.md.

Safety Rules

scripts/optimize/kernel-params.sh modifies /etc/default/grub — requires root, backs up to data/backups/ first. Always maintain the Python-with-env-vars pattern for GRUB editing (no shell variable interpolation into Python code).
scripts/optimize/tuned-profile.sh and rollback.sh require root and save previous state for rollback.
data/backups/ contains GRUB backups and tuned profile snapshots — never delete these.
Optimization scripts that modify system state (kernel-params.sh, tuned-profile.sh, rollback.sh) check $EUID at the top and exit immediately if not root. Guidance-only scripts (vram-gtt.sh, verify.sh) do not require root.
All Python blocks receive data via environment variables (os.environ), never via shell interpolation into Python source. This prevents injection. Do not revert to '''$var''' or "$var" patterns inside Python heredocs.

Key Technical Details

GPU sysfs: Auto-detected by find_gpu_card() in lib/detect.sh (matches vendor 0x1002). Falls back to first card with mem_info_vram_total.
Memory recommendations: recommended_gttsize_mib() in detect.sh computes from total physical RAM = visible RAM + dedicated VRAM (the VRAM is still physical memory). Floor at 1 GiB.
Kernel param detection: detect_kernel_param() uses word-boundary-anchored regex to avoid iommu matching amd_iommu.
Benchmark invocation: toolbox run -c NAME -- [env ROCBLAS_USE_HIPBLASLT=1] /path/to/llama-bench -ngl 99 -mmp 0 -fa 1 -r N. ENV_ARGS passed as a proper bash array (not string splitting).
llama-bench output: Pipe-delimited table. Python parser at fixed column indices (parts[8]=test, parts[9]=t/s). Format changes upstream would break parsing.
ROCm for gfx1151: Scripts set ROCBLAS_USE_HIPBLASLT=1 in benchmark ENV_ARGS. HSA_OVERRIDE_GFX_VERSION=11.5.1 is set inside the toolbox containers (not by our scripts) — needed for ollama and native ROCm builds.
Fedora GRUB: Prefers grubby (BLS), falls back to grub2-mkconfig, then grub-mkconfig. All three paths are handled.

Conventions

set -euo pipefail in every executable script
snake_case function names, UPPER_CASE for constants and loop variables
4-space indentation, no tabs
lib/ files are sourced (no shebang enforcement), but include #!/usr/bin/env bash for editor support
Colors gated on [[ -t 1 ]] (disabled when piped)
bc used for float math; python3 for JSON and GRUB editing only

Validating Changes

make audit          # Quick check — shows system status with pass/fail indicators
make verify         # 9-point optimization checklist
bin/audit --json | python3 -m json.tool   # Verify JSON output is valid

Agentic Evaluation

Scripts in scripts/agentic/ with dispatcher at bin/agentic. Uses a Python venv at data/venv/. Eval frameworks: inspect-ai (all-in-one), evalplus (HumanEval+/MBPP+), bigcodebench. All target an OpenAI-compatible endpoint (ollama or llama.cpp server). Model catalog at configs/models.conf.

External Resources

All external links are centralized in docs/references.md. Key ones:

AMD ROCm Strix Halo guide (kernel params, GTT configuration)
Donato Capitella toolboxes (container images, benchmarks, VRAM estimator)
Qwen3.5 model family (GGUF quants by Unsloth)
Agentic eval frameworks (Inspect AI, EvalPlus, BFCL, BigCodeBench)

4.1 KiB Raw Blame History