Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet (except on GPQA). While that says nothing about how it will handle your particular problem, dear reader, that does seem.. like an insane transfer of capabilities to a relatively tiny model. Mad props to DeepSeek!
This says more about benchmarks than R1, which I do believe is absolutely an impressive model.
For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.
Sonnet's strength was always comprehending the problem and its context. It happened to also be pretty good at generating code, but what it actually made it its first really useful model was that it understood _what_ to code and how to communicate.
Exactly - it works better in the real world, where there's a lot less context than a clinical benchmark, and you're just trying to get the answer without writing an essay.
I assume this is because reasoning is easy as long as it's just BAU prediction based on reasoning examples it was trained on. It's only when tackling a novel problem that the model needs to "reason for itself" (try to compose a coherent chain of reasoning). By generating synthetic data (R1 outputs) it's easy to expand the amount of reasoning data in the training set, making more "reasoning" problems just simple prediction that a simple model can support.
I wonder if (when) there will be a GGUF model available for this 8B model. I want to try it out locally in Jan on my base m4 Mac mini. I currently run Llama 3 8B Instruct Q4 at around 20t/s and it sounds like this would be a huge improvement in output quality.
It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.
Edit: Running the DeepSeek-R1-Distill-Llama-8B-Q8_0 gives me about 3t/s and destroys my system performance on the base m4 mini. Trying the Q4_K_M model next.
Not trivial as long as imatrix is concerned: we've found it substantially improves performance in Q4 for long Ukrainian contexts. I imagine, it's similarly effective in various other positions.
Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?
You can’t wave a magic wand and make an 8b model that good.
I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.
You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.
> like an insane transfer of capabilities to a relatively tiny model
It certainly does.
…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.
It’s somewhere in between, really. This is a rapidly advancing space, so to some degree, it’s expected that every few months, new bars are being set.
There’s also a lot of work going on right now showing that small models can significantly improve their outputs by inferencing multiple times[1], which is effectively what this model is doing. So even small models can produce better outputs by increasing the amount of compute through them.
I get the benchmark fatigue, and it’s merited to some degree. But in spite of that, models have gotten really significantly better in the last year, and continue to do so. In some sense, really good models should be really difficult to evaluate, because that itself is an indicator of progress.
That isn't what it's doing and it's not what distillation is.
The smaller models are distillations, they use the same architecture they were using before.
The compute required for Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B are identical.
In general I agree that this is a rapidly advancing space, but specifically:
> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet
My point is that the words 'according to these benchmarks' is key here, because it's enormously unlikely (and this upheld by the reviews of people testing these distilled models), that:
> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B) is stronger than Claude 3.5 Sonnet
So, if you have two things:
1) Benchmark scores
2) A model that clearly is not actually that enormously better from the distillation process.
Clearly, clearly, one of those two things is wrong.
Either:
1) The benchmarks are meaningless.
2) People are somehow too stupid to be able to evalulate the 8B models and they really are as good as Claude sonnet.
...
Which of those seems more likely?
Perhaps I'm biased, or wrong, because I don't care about the benchmark scores, but my experience playing with these distilled models is that they're good, but they're not as good as sonnet; and that should come as absolutely no surprise to anyone.
Another possible conclusion is that your definition of good, whatever that may be, doesn’t include the benchmarks these models are targeting.
I don’t actually know what they all are, but MATH-500 for instance is some math problem solving that Sonnet is not all that good at.
The benchmarks are targeting specific weaknesses that LLMs generally have from only learning next token prediction and instruction tuning. In fact, benchmarks show there are large gaps in some areas, like math, where even top models don’t perform well.
‘According to these benchmarks’ is key, but not for the reasons you’re expressing.
Option 3
3) It’s key because that’s the hole they’re trying to fill. Realistically, most people in personal usage aren’t using models to solve algebra problems, so the performance of that benchmark isn’t as visible if you aren’t using an LLM for that.
If you look at a larger suite of benchmarks, then I would expect them to underperform compared to sonnet. It’s no different than sports stats where you can say who is best at one specific part of the game (rebounds, 3 point shots, etc) and you have a general sense of who is best (eg LeBron, Jordan), but the best players are neither the best at everything and it’s hard to argue who is the ‘best of the best’ because that depends on what weight you give to the different individual benchmarks they’re good at. And then you also have a lot of players who are good at doing one thing.