Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Layman question here since this isn't my field: how do you achieve success on closed-system tasks without supervision? Surely at some point along the way, the system must understand whether their answers and reasoning are correct.



In their paper, they explain that "in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases."

Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.


You're totally right there must be supervision; it's just a matter of how the term is used.

"Supervised learning" for LLMs generally means the system sees a full response (eg from a human expert) as supervision.

Reinforcement learning is a much weaker signal: the system has the freedom to construct its own response / reasoning, and only gets feedback at the end whether it was correct. This is a much harder task, especially if you start with a weak model. RL training can potentially struggle in the dark for an exponentially long period before stumbling on any reward at all, which is why you'd often start with a supervised learning phase to at least get the model in the right neighborhood.


They use other models to judge correct-ness and when possible just ask the model output something that can be directly verified. Like math equations that can be checked 1:1 against the correct answer.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: