Show HN: Beating Pokemon Red with RL and <10M Parameters

levocardia · 2025-03-05T19:26:29 1741202789

Really cool work. It seems like some critical areas (team rocket, safari zone) rely on encoding game knowledge into the reward function somehow, which "smuggles in" external intelligence about the game. A lot of these are related to planning, which makes me wonder whether you could "bolt on" an LLM to do things like steer the RL agent, dynamically choose what to reward, or even do some of the planning itself. Do you think there's any low-hanging fruit on this front?

Xelynega · 2025-03-05T20:23:28 1741206208

For well-known games like "Pokemon Red" I wonder how much of that game knowledge would be "smuggled in" by an LLM in it's training data if you just replaced the external info in the reward function with it/used it to make up for other deficiencies.

I think they allude to this in their conclusion, but it's less about the low-hanging fruit and more about designing a system to feedback game dialogue into the RL decision making process in a way that can be mutated as part of the RL(be it an LLM or something else)

drubs · 2025-03-05T19:31:16 1741203076

Wrote about this in the results section. I think there is a way to mix the two and simplify the rewards in the process. A lot of the magic behind getting the agent to teach and use cut probably could have been handled by an LLM.

rvz · 2025-03-05T19:20:05 1741202405

Note: What makes this interesting is that this is a pre-LLM project which shows that in some projects you don't need an "LLM" for this. All you need is just a plain old reinforcement learning algorithm and a deep neural network which is perfect for this.

This is what I want to see more of and goes against the hype of LLMs. What a great RL project.

Meanwhile, "Claude" is still stuck somewhere in the game. Imagine the costs of running that vs this project.

mclau156 · 2025-03-05T19:21:14 1741202474

Claude 3.7 recently failed to finish Pokemon after getting stuck in a corner and deciding it was impossible to get out

xinpw8 · 2025-03-05T19:38:53 1741203533

not our agents a hierarchical approach would be superior. add rl to claude and it's gg

N_Lens · 2025-03-06T10:56:18 1741258578

Wow nice work. 10M is a tiny model and I suspect this might be the future for specialised work. I can also imagine the progress towards AGI/ASI to have smaller models used as submodules.

brains basically have “modules” like this as well - neuronal columns that handle specialised tasks. For example when you’re driving on the road, the understanding whether the distance between you and the vehicle in front is increasing or decreasing is a finely tuned function of a specialised part of the brain.

novia · 2025-03-05T23:01:58 1741215718

Please stream the gameplay to twitch so people can compare.

tehsauce · 2025-03-06T00:35:42 1741221342

We have a shared community map where you can watch hundreds of agents from multiple peoples training runs playing in real time!

https://pwhiddy.github.io/pokerl-map-viz/

Matthyze · 2025-03-06T08:46:25 1741250785

That's amazing. Really awesome work.

novia · 2025-03-07T10:14:26 1741342466

Can you make a twitch stream of a single agent playing?

drubs · 2025-03-08T21:45:43 1741470343

Wouldn't make much sense. We generally train with 288 environments simultaneously. I've been thinking about ways to nicely stream all 288 environments though.

benopal64 · 2025-03-05T19:47:12 1741204032

Incredible work. I am just learning about PyBoy from your project, and it made me think of many fun ways to use that library to play Pokemon autonomously.

xinpw8 · 2025-03-06T04:56:33 1741236993

Very good to hear. Join the pyboy/pokemon discords! https://discord.gg/UXpjQTgs https://discord.gg/EVS3tAGm

bubblyworld · 2025-03-05T18:57:34 1741201054

What an awesome project! I'm curious - I would have thought that rewarding unique coordinates would be enough to get the agent to (eventually) explore all areas, including the key ones. What did the agents end up doing before key areas got an extra reward?

(and how on earth did you port Pokémon red to a RL environment? O.o)

drubs · 2025-03-05T19:04:04 1741201444

The environments wouldn't concentrate enough in the Rocket Hideout beneath Celadon Game Corner. The agent would have the player wander the world reward hacking. With wild battles enabled, the environments would end up in Lavender Tower fighting Gastly.

> (and how on earth did you port Pokémon red to a RL environment? O.o)

Read and find out :)

bubblyworld · 2025-03-06T05:27:12 1741238832

Thanks haha, I kept reading =D I see, so it's not just that you have to visit the key areas, they need to show up in the episodes enough to provide a signal for training.

drubs · 2025-03-06T05:46:26 1741239986

wegfawefgawefg · 2025-03-05T22:56:23 1741215383

you dont port it you wrap it. you can put anything in an rl environment. usually emulators are done with bizhawk, and some lua. worst case theres ffi or screen capture.

bubblyworld · 2025-03-06T05:28:19 1741238899

Right, my thought was that this would be way too slow for episode rollout (versus an accelerated implementation in jax or something), but I guess not!

wegfawefgawefg · 2025-03-07T10:37:16 1741343836

well thats the golden issue with rl, sample efficiency. it is env bounded, so you want an architecture that extracts the max possible information from each collected sample, avoiding catastrophic forgetting, prioritizing samples according to relevance

drubs · 2025-03-05T23:32:27 1741217547

My first version of this project 5 years ago involved a python-lua named pipe using Bizhawk actually. No clue where that code went

modeless · 2025-03-05T18:52:42 1741200762

Can't Pokemon be beaten by almost random play?

drdeca · 2025-03-05T22:38:03 1741214283

Judging by the “pi plays Pokemon Sapphire”, uh, not in a reasonable amount of time? It’s been at it for over 3 years, hasn’t gotten a gym badge yet, mostly stays in the starting town.

tehsauce · 2025-03-05T19:32:14 1741203134

It's impossible to beat with random actions or brute force, but you can get surprisingly far. It doesn't take too long to get halfway through route 1, but even with insane compute you'll never make it even to viridian forest.

VertanaNinjai · 2025-03-05T18:57:57 1741201077

It can be brute forced if that’s what you mean. It has a fairly low difficulty curve and these old games have a grid system for movement and action selections. That’s why they’re pointing out the lower parameter amount and CPU. The point I took away is doing more with less.

xinpw8 · 2025-03-05T19:14:48 1741202088

It definitely cannot be beaten using random inputs. It doesn't even get out of Pallet Town after billions of steps. We tested...

fancyswimtime · 2025-03-05T23:07:50 1741216070

the game has been beaten by fish

cjbillington · 2025-03-06T10:01:18 1741255278

Based on the other examples of random inputs not being sufficient, I dare say the fish-based attempt may have been fraudulent.

xinpw8 · 2025-03-06T04:58:27 1741237107

dyor we only tested it with a pufferfish, courtesy of puffer.ai / pufferlib RL library. i promise it doesn't work with random inputs.

gusgus01 · 2025-03-06T05:29:24 1741238964

I'm not sure if you're just making a play on words, but I believe the commenter was talking about the streamer who sets up their fishtank to map to inputs and then let's their fish "play games". They beat pokemon sapphire supposedly. https://www.polygon.com/2020/11/9/21556590/fish-pokemon-sapp...

bloomingkales · 2025-03-05T21:26:24 1741209984

The win condition of the game is the entire state of the game configured in a certain way. So there exists a lot of win conditions, you just have to do a search.

xinpw8 · 2025-03-07T19:19:39 1741375179

not sure what you mean..details?

kerkeslager · 2025-03-05T21:17:08 1741209428

Are there any uses for AI yet that aren't either:

1. Doing things humans do for fun. 2. Doing things that AI is horribly terrible at.

?

drubs · 2025-03-05T22:24:52 1741213492

There's a ton of applications for AI. Back when I was at Spotify, I co-authored Basic Pitch (https://basicpitch.spotify.com/), an audio-to-midi library. There are a ton of uses for AI outside of what's heavily publicized.

sadeshmukh · 2025-03-05T21:21:02 1741209662

Medical field, spotting things

Autonomous drones

Financial fraud detection

Scheduling of trains/buses/etc

I personally do like chatbots but you probably don't

xinpw8 · 2025-03-06T05:00:00 1741237200

the only chatbot for me is smarterchild

bigfishrunning · 2025-03-06T16:56:09 1741280169

I feel like that sentence aged me.

xinpw8 · 2025-03-07T19:20:47 1741375247

Thank you. Because I was just shaking my head "kids these days"

throwaway314155 · 2025-03-06T00:19:05 1741220345

Awesome! Why do you think the reward for reading signs helped? I'm assuming the model doesn't gain the ability to read and understand english just from RL, so what purpose does it serve other than to maybe waste ticks on signs that ultimately don't need to be read?

drubs · 2025-03-06T00:36:33 1741221393

It's silly, but signs were a way to incentivize the agent to explore deeper into the Safari Zone among other areas.

jononor · 2025-03-05T18:26:03 1741199163

Very nice! Nice to see demonstrations of reinforcement learning being used to solve non-trivial tasks.

differintegral · 2025-03-05T19:28:36 1741202916

This is very cool, congrats!

I wonder, does anyone have a sense of the approximate raw number of button presses required to beat the game? Mostly curious to see how that compares to the parameter count.

tarentel · 2025-03-05T20:31:44 1741206704

I imagine < 10000. https://github.com/KeeyanGhoreshi/PokemonFireredSingleSequen... and https://www.youtube.com/watch?v=6gjsAA_5Agk. I believe this is something like 200k and is a slightly different game. Quite a bit less than 10m either way.

worble · 2025-03-05T18:39:10 1741199950

Heads up, clicking "Next Page" just takes you to an empty screen, you have to use the navigation links on the left if you want to get read past the first screen.

drubs · 2025-03-05T18:42:02 1741200122

Thanks for the heads up. I just pushed a fix.

worble · 2025-03-05T18:55:22 1741200922

I think you fixed the one below the puffer.ai image, but not the one above Authors.

drubs · 2025-03-05T19:01:08 1741201268

and...fixed!

xinpw8 · 2025-03-05T19:15:57 1741202157

i am sorry for my awful qa on the site :((((((((((((

bee_rider · 2025-03-05T18:52:36 1741200756

Ah, very neat.

Maybe some day the “rival” character in Pokemon can be played by a RL system, haha. That way you can have a “real player (simulated)” for your rival.

xinpw8 · 2025-03-05T19:17:54 1741202274

a cool idea, except that battling actually doesn't even matter to the ai. if you look at what the agent is doing during a battle, it is sort of spamming options + picking damaging attacks. it would be a stretch to say that agents were 'good' at battling...

wegfawefgawefg · 2025-03-05T22:57:38 1741215458

if youve done the work to to make the rival rl based and have the ability to go around youd probably have added basic battle controls

xinpw8 · 2025-03-06T05:06:17 1741237577

as it stands, battling is wholly unimportant to completing the game, as long as the agents can eventually complete the trainer battles mandatory for plot advancement. it's funny because everyone thinks about battling when they think about pokemon. my first fn i wrote, back when we were still bumping around pallet town, was a battle reward function. it was trash and didn't work and was over-complicated. the crux of the problem is exploration over a vast, open-world map, and completion of the sundry storyline tasks at distal parts of said map in the correct sequence without the policy collapsing and without agents overfitting to, say, overworld loops.

wegfawefgawefg · 2025-03-06T12:18:54 1741263534

you missed my point.

I know all about rl. Ive read go-explore 1/2, and I have personally implemented intrinsic curiosity.

I was just commenting on what rhe other person said, which is that it would be cool to have the npcs be agents that battle and train too, to which you said they could not be made to, to which I say, we have the technology. :)

drubs · 2025-03-06T14:52:42 1741272762

Sounds cool to me.

KeplerBoy · 2025-03-06T21:48:35 1741297715

Really missing the arxiv link. The whole page reads like the arxiv link should be in the next paragraph, but it never appeared.

xinpw8 · 2025-03-05T18:28:06 1741199286

This is a first-in-world, isn't it?

nimish · 2025-03-05T22:25:37 1741213537

Considering how many things are less complicated than Pokemon, this is very cool

endofreach · 2025-03-05T23:04:07 1741215847

> Pokémon Red takes 25 hours on average for a new player to complete.

Seriously? I've never really played video games, but i remember spending so much time on pokemon red when i was young. Not sure if i ever really finished more than once. But i'm pretty sure i must have played for more than 50h or so before even close to finish. My memory might trick me though.

Not sure which pokemon version it was, but i got so hooked trying to get this "secret" pokemon which was just a bunch of pixels. Some kind of bug (of the game, not the type of pokemon). You had to do specific things in a park and other things and then surf up and down x-times on the right shore of an island... or something like that. I had no idea how it worked and got so hooked, i must have spent most of my playing time on things like that.

Oh boy, memories...

ludicity · 2025-03-05T23:17:13 1741216633

It definitely took me way more than 25 hours as a kid to beat Pokemon Blue! But I was so young that I didn't understand that "Oak: Hello!" meant that someone called Oak was talking.

The glitched Pokemon you're talking about is Missingno by the way! I remember surfing up and down Cinnabar Island to do the same thing.

xinpw8 · 2025-03-06T05:01:58 1741237318

i had to look up how to do cut. like, i was hard-stuck.

endofreach · 2025-03-06T06:22:58 1741242178

Awesome! Missingno was what i meant. Thank you!

Uehreka · 2025-03-06T03:59:00 1741233540

There’s a guy on Youtube named JRose11 who is on a quest to beat Pokemon Red with all 151 of the original Pokemon individually. He’s about 100 Pokemon in at this point. He doesn’t use crazy speedrunning tactics (he wants to approximate a normal-ish playthrough) but because he knows exactly where to go, what to do and what’s skippable almost all of his runs are under 10 hours (many are under 6 and he did it with Mewtwo in just under 2).

oreally · 2025-03-06T05:11:32 1741237892

The estimates seem to be in today's reported numbers based off howlongtobeat. Back in the day it was intended to last 60hours iirc.

mclau156 · 2025-03-05T19:26:27 1741202787

Could you have used the decompilations of pokemon on github? https://github.com/pret/pokered

drubs · 2025-03-05T19:32:32 1741203152

There's an entire section on how the decompilations were used :)

mclau156 · 2025-03-05T20:02:20 1741204940

Ok sorry I thought maybe there was a chance that the decomp project could edited in a way that would create a ROM that allowed RL to be done easier, but it seems like it just came in handy for looking up values along with the GB ASM tutorial, the alternative of my thought process is re-creating pokemon red in a modern language which you also mentioned

xinpw8 · 2025-03-05T19:36:53 1741203413

if you helped with pret then god bless you