During the openai gym era of RL, one of the great selling pts was that RL was very approachable for a new comer as the gym environments were small and tractable that a hobbyist could learn a little bit of RL, try it out on cartpole and see how it'd perform. Are there similarly tractable RL tasks/learning environments with LLMs? From the outside, my impression is that you need some insane GPU access to even start to mess around with these models. Is there something one can do on a normal MacBook air for instance in this LLM x RL domain?
I'm absolutely not versed in RL, but I wanted to understand GRPO, the RL algorithm behind Deepseek's latest model.
I started from a very simple LLM, inspired from Andrej Karpathy's "GPT from scratch" video (https://www.youtube.com/watch?v=kCc8FmEb1nY). Then, I added onto that the GRPO algorithm, which in itself is very simple.
GRPO project is neat. Would you be willing to do a Karpathy-style explainer, breaking down the algorithm from scratch? Itβs hard to understand on its own without prior background knowledge.
Find materials on PPO which should be widespread since it is the most popular RL algorithm. GRPO works on the same principles, just makes certain estimates from samples rather than training an auxiliary neural network to make them.