Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Dist...

qeternity · 2025-01-20T14:58:27 1737385107

This says more about benchmarks than R1, which I do believe is absolutely an impressive model.

For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.

radu_floricica · 2025-01-20T19:08:53 1737400133

Sonnet's strength was always comprehending the problem and its context. It happened to also be pretty good at generating code, but what it actually made it its first really useful model was that it understood _what_ to code and how to communicate.

Alex-Programs · 2025-01-20T20:36:21 1737405381

Exactly - it works better in the real world, where there's a lot less context than a clinical benchmark, and you're just trying to get the answer without writing an essay.

thegeomaster · 2025-01-20T16:03:33 1737389013

LiveBench (which I like because it tries very hard to avoid contamination) ranks Sonnet 3.5 second only to o1 (which is totally expected).

parav · 2025-01-20T17:37:35 1737394655

LiveCodingBench has DeepSeekR1 at #3 after O1-high and O1-medium https://livecodebench.github.io/leaderboard.html

usaar333 · 2025-01-21T05:53:27 1737438807

That's more of a leetcode bench than real world coding bench

svantana · 2025-01-21T08:38:57 1737448737

That's R1-preview released a while back - the real R1 is even better.

behnamoh · 2025-01-20T16:49:49 1737391789

no, sonnet 3.5 is #7 on LiveBench, even below DeepSeek V3.

thegeomaster · 2025-01-20T16:59:49 1737392389

The parent comment was talking about coding specifically, not the average score. I see o1 at 69.69, and Claude 3.5 Sonnet at 67.13.

sebastiennight · 2025-01-21T12:40:20 1737463220

o1's score looks like exactly what I would expect Elon Musk to aim for with Grok's benchmarks

mordae · 2025-01-21T07:57:50 1737446270

Because it listens actively and asks questions.

HarHarVeryFunny · 2025-01-20T20:56:14 1737406574

I assume this is because reasoning is easy as long as it's just BAU prediction based on reasoning examples it was trained on. It's only when tackling a novel problem that the model needs to "reason for itself" (try to compose a coherent chain of reasoning). By generating synthetic data (R1 outputs) it's easy to expand the amount of reasoning data in the training set, making more "reasoning" problems just simple prediction that a simple model can support.

bochoh · 2025-01-20T14:44:31 1737384271

I wonder if (when) there will be a GGUF model available for this 8B model. I want to try it out locally in Jan on my base m4 Mac mini. I currently run Llama 3 8B Instruct Q4 at around 20t/s and it sounds like this would be a huge improvement in output quality.

bugglebeetle · 2025-01-20T15:05:47 1737385547

YC’s own incredible Unsloth team already has you covered:

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B

DrPhish · 2025-01-20T15:02:21 1737385341

Making your own ggufs is trivial: https://rentry.org/tldrhowtoquant/edit

It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.

bochoh · 2025-01-20T15:06:24 1737385584

Thanks for that! It seems that unsloth actually beat me to [it](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...)!

Edit: Running the DeepSeek-R1-Distill-Llama-8B-Q8_0 gives me about 3t/s and destroys my system performance on the base m4 mini. Trying the Q4_K_M model next.

tucnak · 2025-01-20T19:32:08 1737401528

Not trivial as long as imatrix is concerned: we've found it substantially improves performance in Q4 for long Ukrainian contexts. I imagine, it's similarly effective in various other positions.

m3kw9 · 2025-01-20T21:43:35 1737409415

Use it and come back lmao

dimgl · 2025-01-24T13:52:14 1737726734

I've been using it. Completely overhyped trash. Does well in benchmarks and fails in real-world scenarios.

MaxikCZ · 2025-01-26T21:03:18 1737925398

Can u provide some examples?

noodletheworld · 2025-01-20T14:54:33 1737384873

> according to these benchmarks

Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?

You can’t wave a magic wand and make an 8b model that good.

I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.

You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.

> like an insane transfer of capabilities to a relatively tiny model

It certainly does.

…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.

deepsquirrelnet · 2025-01-20T16:05:03 1737389103

It’s somewhere in between, really. This is a rapidly advancing space, so to some degree, it’s expected that every few months, new bars are being set.

There’s also a lot of work going on right now showing that small models can significantly improve their outputs by inferencing multiple times[1], which is effectively what this model is doing. So even small models can produce better outputs by increasing the amount of compute through them.

I get the benchmark fatigue, and it’s merited to some degree. But in spite of that, models have gotten really significantly better in the last year, and continue to do so. In some sense, really good models should be really difficult to evaluate, because that itself is an indicator of progress.

[1] https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...

noodletheworld · 2025-01-20T23:22:17 1737415337

> which is effectively what this model is doing.

That isn't what it's doing and it's not what distillation is.

The smaller models are distillations, they use the same architecture they were using before.

The compute required for Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B are identical.

In general I agree that this is a rapidly advancing space, but specifically:

> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet

My point is that the words 'according to these benchmarks' is key here, because it's enormously unlikely (and this upheld by the reviews of people testing these distilled models), that:

> the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B) is stronger than Claude 3.5 Sonnet

So, if you have two things:

1) Benchmark scores

2) A model that clearly is not actually that enormously better from the distillation process.

Clearly, clearly, one of those two things is wrong.

Either:

1) The benchmarks are meaningless.

2) People are somehow too stupid to be able to evalulate the 8B models and they really are as good as Claude sonnet.

...

Which of those seems more likely?

Perhaps I'm biased, or wrong, because I don't care about the benchmark scores, but my experience playing with these distilled models is that they're good, but they're not as good as sonnet; and that should come as absolutely no surprise to anyone.

deepsquirrelnet · 2025-01-21T15:33:07 1737473587

Another possible conclusion is that your definition of good, whatever that may be, doesn’t include the benchmarks these models are targeting.

I don’t actually know what they all are, but MATH-500 for instance is some math problem solving that Sonnet is not all that good at.

The benchmarks are targeting specific weaknesses that LLMs generally have from only learning next token prediction and instruction tuning. In fact, benchmarks show there are large gaps in some areas, like math, where even top models don’t perform well.

‘According to these benchmarks’ is key, but not for the reasons you’re expressing.

Option 3 3) It’s key because that’s the hole they’re trying to fill. Realistically, most people in personal usage aren’t using models to solve algebra problems, so the performance of that benchmark isn’t as visible if you aren’t using an LLM for that.

If you look at a larger suite of benchmarks, then I would expect them to underperform compared to sonnet. It’s no different than sports stats where you can say who is best at one specific part of the game (rebounds, 3 point shots, etc) and you have a general sense of who is best (eg LeBron, Jordan), but the best players are neither the best at everything and it’s hard to argue who is the ‘best of the best’ because that depends on what weight you give to the different individual benchmarks they’re good at. And then you also have a lot of players who are good at doing one thing.