feat: add Qwen3.5 model catalog and agentic evaluation framework

Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Felipe Cardoso
2026-03-26 00:20:23 +01:00
parent 71053997be
commit 58124cd657
11 changed files with 1354 additions and 16 deletions

View File

@@ -43,6 +43,24 @@ The most comprehensive community resource for Strix Halo LLM optimization.
- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking
## Qwen3.5 Models (GGUF)
- [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF) — Top pick for 64GB Strix Halo (MoE, 3B active)
- [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF) — Dense 27B
- [unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) — Best for agentic/coding
- [Qwen3.5 Official](https://github.com/QwenLM/Qwen3.5) — Model family overview
- [Unsloth Dynamic 2.0](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs) — Adaptive quantization methodology
- [Unsloth Studio](https://unsloth.ai/docs/new/studio) — Training + inference UI (beta)
## Agentic Evaluation
- [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) — All-in-one eval framework (HumanEval, BFCL, IFEval, GAIA)
- [EvalPlus](https://github.com/evalplus/evalplus) — HumanEval+ / MBPP+ with native ollama support
- [BigCodeBench](https://github.com/bigcode-project/bigcodebench) — 1,140 coding tasks across 139 libraries
- [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) — Berkeley Function Calling Leaderboard
- [SWE-bench](https://github.com/princeton-nlp/SWE-bench) — Real GitHub issue resolution
- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) — Optimized agentic framework for Qwen models
## AMD GPU Profiling
- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling