More

danielhanchen · 2026-02-04T01:00:26 1770166826

Yes sadly that sometimes happens - the issue is Codex CLI / Claude Code were designed for GPT / Claude models specifically, so it'll be hard for OSS models directly to utilize the full spec / tools etc, and might get loops sometimes - I would maybe try the MXFP4_MOE quant to see if it helps, and maybe try Qwen CLI (was planning to make a guide for it as well)

I guess until we see the day OSS models truly utilize Codex / CC very well, then local models will really take off

danielhanchen · 2026-02-03T16:22:47 1770135767

It works reasonably well for general tasks, so we're definitely getting there! Probably Qwen3 CLI might be better suited, but haven't tested it yet.

danielhanchen · 2026-02-03T16:06:12 1770134772

For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

genpfault · 2026-02-03T18:17:38 1770142658

Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

System info:

    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64

llama.cpp command-line:

    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768

danielhanchen · 2026-02-04T00:58:25 1770166705

Super cool! Also with `--fit on` you don't need `--ctx-size 32768` technically anymore - llama-server will auto determine the max context size!

genpfault · 2026-02-04T03:18:26 1770175106

Nifty, thanks for the heads-up!

halcyonblue · 2026-02-03T18:42:51 1770144171

What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!

coder543 · 2026-02-03T18:52:42 1770144762

MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.

Not as good as running the entire thing on the GPU, of course.

lnenad · 2026-02-04T15:10:30 1770217830

Thanks to you I decided to give it a go as well (didn't think I'd be able to run it on 7900xtx) and I must say it's awesome for a local model. More than capable for more straightforward stuff. It uses full VRAM and about 60GBs of RAM, but runs at about 10tok/s and is *very* usable.

bityard · 2026-02-03T19:09:05 1770145745

Hi Daniel, I've been using some of your models on my Framework Desktop at home. Thanks for all that you do.

Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?

danielhanchen · 2026-02-04T00:57:43 1770166663

Thanks! Oh Qwen3's own GGUFs also works, but ours are dynamically quantized and calibrated with a reasonably large diverse dataset, whilst Qwen's ones are not - see https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

bityard · 2026-02-04T01:37:36 1770169056

I've read that page before and although it all certainly sounds very impressive, I'm not an AI researcher. What's the actual goal of dynamic quantization? Does it make the model more accurate? Faster? Smaller?

itake · 2026-02-04T03:31:55 1770175915

More accurate and smaller.

quantization = process to make the model smaller (lossy)

dynamic = being smarter about the information loss, so less information is lost

bityard · 2026-02-04T17:31:57 1770226317

Thanks, that makes sense.

ranger_danger · 2026-02-03T16:27:45 1770136065

What is the difference between the UD and non-UD files?

danielhanchen · 2026-02-03T16:29:55 1770136195

UD stands for "Unsloth-Dynamic" which upcasts important layers to higher bits. Non UD is just standard llama.cpp quants. Both still use our calibration dataset.

CamperBob2 · 2026-02-03T17:24:32 1770139472

Please consider authoring a single, straightforward introductory-level page somewhere that explains what all the filename components mean, and who should use which variants.

The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.

danielhanchen · 2026-02-03T17:40:51 1770140451

Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok

Keats · 2026-02-03T19:20:08 1770146408

Is there some indication on how the different bit quantization affect performance? IE I have a 5090 + 96GB so I want to get the best possible model but I don't care about getting 2% better perf if I only get 5 tok/s.

mirekrusin · 2026-02-03T20:24:55 1770150295

It takes download time + 1 minute to test speed yourself, you can try different quants, it's hard to write down a table because it depends on your system ie. ram clock etc. if you go out of gpu.

I guess it would make sense to have something like max context size/quants that fit fully on common configs with gpus, dual gpus, unified ram on mac etc.

Keats · 2026-02-03T21:06:00 1770152760

Testing speed is easy yes, I'm mostly wondering about the quality difference between Q6 vs Q8_K_XL for example.

danielhanchen · 2026-02-04T00:57:00 1770166620

I haven't done benchmarking yet (plan to do them), but it should be similar to our post on DeepSeek-V3.1 Dynamic GGUFs: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

segmondy · 2026-02-03T17:57:00 1770141420

The green/yellow/red indicators are based on what you set for your hardware on huggingface.

ranger_danger · 2026-02-03T22:47:35 1770158855

What is your definition of "important" in this context?

danielhanchen · 2026-02-04T00:56:09 1770166569

Oh we wrote about it here: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

CamperBob2 · 2026-02-03T21:08:48 1770152928

Good results with your Q8_0 version on 96GB RTX 6000 Blackwell. It one-shotted the Flappy Bird game and also wrote a good Wordle clone in four shots, all at over 60 tps. Thanks!

Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page?

danielhanchen · 2026-02-04T00:55:46 1770166546

Nice! Yes Q8_0 is similar - the others are different since they use a calibration dataset.

MrDrMcCoy · 2026-02-03T22:38:40 1770158320

Still hoping IQuest-Coder gets the same treatment :)

binsquare · 2026-02-03T16:38:17 1770136697

How did you do it so fast?

Great work as always btw!

danielhanchen · 2026-02-03T17:41:15 1770140475

Thanks! :) We're early access partners with them!

bytesandbits · 2026-02-04T05:00:01 1770181201

how are you so fast man

danielhanchen · 2026-01-22T23:33:43 1769124823

Excited to have collabed on this! Thanks electroglyph for the contrib!

danielhanchen · 2026-01-14T03:01:10 1768359670

Love vLLM!

danielhanchen · 2025-12-31T10:15:43 1767176143

Qwen's latest Qwen-Image-2512 is currently the strongest open source model.

To run them locally, we made some GGUFs: https://huggingface.co/unsloth/Qwen-Image-2512-GGUF

danielhanchen · 2025-11-10T05:11:16 1762751476

Love the blog :) If you or folks are looking for junior ML roles on training, RL & distributed training, doors always open!

danielhanchen · 2025-10-30T08:36:56 1761813416

Super agree! Love how uv installs packages in parallel! It made installs 30 seconds from 5 minutes during `uv pip install unsloth`!

danielhanchen · 2025-10-03T07:50:05 1759477805

I made some dynamic GGUFs for the 32B MoE model! Try:

./llama.cpp/llama-cli -hf unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL

Also a support agent finetuning notebook with granite 4: https://colab.research.google.com/github/unslothai/notebooks...

anshumankmr · 2025-10-03T08:34:30 1759480470

You guys are lightning fast. Did you folks have access to the model weights before hand or something, if you don't mind me asking?

danielhanchen · 2025-10-04T11:59:14 1759579154

Oh thanks! Yes sometimes we get early access to some models!

incomingpain · 2025-10-03T12:24:21 1759494261

As always, you're awesome. keep up the great work!

danielhanchen · 2025-10-04T11:59:20 1759579160

Thanks!

danielhanchen · 2025-10-02T22:08:01 1759442881

Made some dynamic GGUFs for those interested! https://huggingface.co/unsloth/granite-4.0-h-small-GGUF (32B Mamba Hybrid + MoE)

CMay · 2025-10-03T02:58:33 1759460313

Thanks! Any idea why I'm getting such poor performance on these new models? Whether Small or Tiny, on my 24GB 7900XTX I'm seeing like 8 tokens/s using the latest llama.cpp with vulkan. Even if it was running 4x faster than this I would be asking why I'm getting so few tokens/s when it sounds like the models are supposed to bring increased inference efficiency.

danielhanchen · 2025-10-04T11:59:54 1759579194

Oh I think its a Vulcan backend issue - someone raised it with me and said the rocm backend is much faster