The most interesting part of DeepSeek's R1 release isn't just the performance - ...

aimanbenbaha · 2025-01-21T20:13:41 1737490421

Interestingly this point was indicated by Karpathy last summer that RLHF is barely RL. He said it would be very difficult to apply pure reinforcement learning on open-domains. This is why RLHF are a shortcut to fill this gap but still because the reward model is trained on human vibes checks the LLM could easily game the RM by giving out misleading responses or gaming the system.

Importantly the barrier is that open domains are too complex and too undefined to have a clear reward function. But if someone cracks that — meaning they create a way for AI to self-optimize in these messy, subjective spaces — it'll completely revolutionize LLMs through pure RL.

Here's the link of the tweet: https://x.com/karpathy/status/1821277264996352246

leobg · 2025-01-25T20:14:14 1737836054

The whole point of RLHF is to make up for the fact that there is no loss function for a good answer in terms of token ids or their order. A good answer can come in many different forms and shapes.

That’s why all those models fine tuned on (instruction, input, answer) tuples are essentially lobotomized. They’ve been told that, for the given input, only the output given in the training data is correct, and any deviation should be “punished”.

In truth, for each given input, there are many examples of output that should be reinforced, many examples of output that should be punished, and a lot in between.

When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.

We should be doing the same when taming LLMs from their pretraining as document completers into assistants.

hb-robo · 2025-01-21T16:59:48 1737478788

Layman question here since this isn't my field: how do you achieve success on closed-system tasks without supervision? Surely at some point along the way, the system must understand whether their answers and reasoning are correct.

boole1854 · 2025-01-21T18:08:40 1737482920

In their paper, they explain that "in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases."

Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.

davmre · 2025-01-21T20:12:53 1737490373

You're totally right there must be supervision; it's just a matter of how the term is used.

"Supervised learning" for LLMs generally means the system sees a full response (eg from a human expert) as supervision.

Reinforcement learning is a much weaker signal: the system has the freedom to construct its own response / reasoning, and only gets feedback at the end whether it was correct. This is a much harder task, especially if you start with a weak model. RL training can potentially struggle in the dark for an exponentially long period before stumbling on any reward at all, which is why you'd often start with a supervised learning phase to at least get the model in the right neighborhood.

aomix · 2025-01-21T18:04:28 1737482668

They use other models to judge correct-ness and when possible just ask the model output something that can be directly verified. Like math equations that can be checked 1:1 against the correct answer.

jjtheblunt · 2025-01-21T17:27:55 1737480475

> the real value is showing you can bootstrap complex reasoning through pure reinforcement.

This made me smile, as I thought (non snarkily) that's what living beings do.

fsndz · 2025-01-21T16:44:39 1737477879

this ! and the truth is is there that much corporate domains without "clear success metrics" ?

petra · 2025-01-21T18:20:09 1737483609

You also need to be able to test your solution, on how sucsessful it is.

In some domains it is harder than math and code.

fsndz · 2025-01-21T21:13:11 1737493991

true. I think simulations will help a lot in that direction. Imagine if you can do RL a bit like DeepSeek for R1 but on corporate tasks. https://open.substack.com/pub/transitions/p/deepseek-is-comi...

fsndz · 2025-01-21T16:45:06 1737477906

emphasis on corporate

data_maan · 2025-01-21T20:38:07 1737491887

The MIT licence is for code only