feat: add Qwen3.5 model catalog and agentic evaluation framework

Models: - configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick), Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding) - Updated benchmark setup to show catalog with download status - docs/model-recommendations.md: memory planning, quantization guide Agentic evaluation: - scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench in a Python venv - scripts/agentic/run-eval.sh: runs evaluations against local LLM server (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code (EvalPlus+BigCodeBench), tooluse (BFCL), full (all) - bin/agentic: dispatcher with help - docs/agentic-benchmarks.md: methodology, framework comparison, model recommendations for agentic use Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:20:23 +01:00
parent 71053997be
commit 58124cd657
11 changed files with 1354 additions and 16 deletions
--- a/docs/references.md
+++ b/docs/references.md
@@ -43,6 +43,24 @@ The most comprehensive community resource for Strix Halo LLM optimization.
 - [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
 - [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking

+## Qwen3.5 Models (GGUF)
+
+- [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF) — Top pick for 64GB Strix Halo (MoE, 3B active)
+- [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF) — Dense 27B
+- [unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) — Best for agentic/coding
+- [Qwen3.5 Official](https://github.com/QwenLM/Qwen3.5) — Model family overview
+- [Unsloth Dynamic 2.0](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs) — Adaptive quantization methodology
+- [Unsloth Studio](https://unsloth.ai/docs/new/studio) — Training + inference UI (beta)
+
+## Agentic Evaluation
+
+- [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) — All-in-one eval framework (HumanEval, BFCL, IFEval, GAIA)
+- [EvalPlus](https://github.com/evalplus/evalplus) — HumanEval+ / MBPP+ with native ollama support
+- [BigCodeBench](https://github.com/bigcode-project/bigcodebench) — 1,140 coding tasks across 139 libraries
+- [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) — Berkeley Function Calling Leaderboard
+- [SWE-bench](https://github.com/princeton-nlp/SWE-bench) — Real GitHub issue resolution
+- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) — Optimized agentic framework for Qwen models
+
 ## AMD GPU Profiling

 - [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling