> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...
I similarly am surprised at how fast they are progressing. I wrote this piece a few months ago about how I think steering world model output is the next realm of AAA gaming:
But even when I wrote that I thought things were still a few years out. I facetiously said that Rockstar would be nerd-sniped on GTA6 by a world model, which sounded crazy a few months ago. But seeing the progress already made since GameNGen and knowing GTA6 is still a year away... maybe it will actually happen.
> Rockstar would be nerd-sniped on GTA6 by a world model
I'm having trouble parsing your meaning here.
GTA isn't really a "drive on the street simulator", is it? There is deliberate creative and artistic vision that makes the series so enjoyable to play even decades after release, despite the graphics quality becoming more dated every year by AAA standards.
Are you saying someone would "vibe model" a GTAish clone with modern graphics that would overtake the actual GTA6 in popularity? That seems extremely unlikely to me.
I don't _really_ mean it obviously but I think a key component of what makes something like GTA compelling is that fully modeled world you move around in. These things take what amounts to hundreds if not thousands of man years to create "traditionally", and the fact someone can now prompt to life a city or any other environment with simiailr (or better) fidelity is a massive change in how we think about creative content production.
GTA6 will not actually be nerd-sniped, but it's easy to see how a lot of what makes the game defensible is being rapidly commoditized.
GTA VI's story mode won't be surpassed by a world model, but the fucking around and blowing things up part conceivably could, and that's how people are spending their time in GTA. I don't see a world model providing the framing needed to contextualize the mayhem, thereby making it fun, anytime soon myself, but down the line? Maybe.
They will then learn the bitter lesson that convincing the GenAI to create something that brings your vision to life is impossible. It's a real talent to even be able to define for yourself what your vision is, and then to have artists achieve it visually in any medium is a process of back and forth between people with their own interpretations evolving the idea into something even better and cohesive.
GenAI will never get there because it can't, by design. It can riff on what was, and it can please the prompter, but it cannot challenge anyone creatively. No current LLM's can, either. I'll eat my hat if this is wrong in ten years, but it won't be.
It will generate refined slop ad nauseam, and that will train people's brains into spotting said slop faster using less energy. And then it'll be shunned.
bro, how you could get the very precise and predictable editing bro that you have in a regular game engine bro. also bro, empty pretty world with nothing to do bro is lame bro
Probably depends on how you engage with GTA. “Drive on the street simulator” along with arrays of weapons and explosions is the majority of my hours in GTA.
I despise the creative and artistic vision of GTA online, but I’m clearly in a minority there gauging by how much money they’ve made off it.
I assumed the opposite because I haven't heard about GTA's story in ages, but could be a sampling bias. It's hand-wavy, but last I recall most of the microtransactions didn't show up in single player (like if you bought a car, you couldn't use it in single player) so the people spending money on it are doing it for online, not the story.
I didn't think the story was earth-shattering; it was fine, but no Baldur's Gate.
Edit: In retrospect, the characters were fairly iconic. I still distinctly remember Trevor.
The future of games was MMORPGs and RPG-ization in general as other genres adopted progression systems. But the former two are simply too expensive and risky even today for AAA to develop. Which brings us to another point, the problem with Western AAA is more about high levels of risk aversion, which is what's really feeding the lack of imaginative. And that's more to do with the economics of opportunity cost to the S&P 500.
Anyways, crafting pretty looking worlds is one thing, but you still need to fill them in with something worth doing, and that's something we haven't really figured out. That's one of the reasons why the sandbox MMORPG was developed as opposed to "themeparks". The underlying systems, the backend is the real meat here. At most with the world models right now is that you're replacing 3d artists and animators, but I would not say that is a real bottleneck in relation to one's own limitations.
> Which brings us to another point, the problem with Western AAA is more about high levels of risk aversion, which is what's really feeding the lack of imaginative.
Maybe I’m misinterpreting what you’re saying here, but 2021 til present has been a glut of some of the best titles ever made, by pretty much any measure
I'm trying to wrap my head around this since we're still seeing text spit out slowly ( I mean slowly as in 1000's of tokens a second)
I'm starting to think some of the names behind LLMs/GenAI are cover names for aliens and any actual humans involved have signed an NDA that comes with millions of dollars and a death warrant if disobeyed.
Even as a layman and AI skeptic, to me this entirely matches my expectations, and something like this seemed like it was basically inevitable as of the first demos of video rendering responding to user input (a year ago? maybe?).
Not to detract from what has been done here in any way, but it all seems entirely consistent with the types of progress we have seen.
It's also no surprise to me that it's from Google, who I suspect is better situated than any of its AI competitors, even if it is sometimes slow to show progress publicly.
>It's basically what every major AI lab head is saying from the start.
I suppose it depends what you count as "the start". The idea of AI as a real research project has been around since at least the 1950s. And I'm not a programmer or computer scientist, but I'm a philosophy nerd and I know debates about what computers can or can't do started around then. One side of the debate was that it awaited new conceptual and architectural breakthroughs.
I also think you can look at, say, Ted Talks on the topic, with guys like Jeff Hawkins presenting the problem as one of searching for conceptual breakthroughs, and I think similar ideas of such a search have been at the center of Douglas Hofstadter's career.
I think in all those cases, they would have treated "more is different" like an absence of nuance, because there was supposed to be a puzzle to solve (and in a sense there is, and there has been, in terms of vector space and back propagation and so on, but it wasn't necessarily clear that physics could "pop out" emergently from such a foundation).
When they say "the start", I think they mean the start of the current LLM era (circa 2017). The main story of this time has been a rejection of the idea that major conceptual breakthroughs and complex architectures are needed to achieve intelligence. Instead, it's better to focus on simple, general-purpose methods that can scale to massive amounts of data and compute (i.e. the Bitter Lesson [1]).
Oof ... to call other people's decades of research into directed machine learning "a colossal waste of researcher's time" is indeed a rather toxic point of view unsurprisingly causing a bitter reaction in scientists/researchers.
Even if his broader point might be valid (about the most fruitful directions in ML), calling something a "bitter lesson" while insulting a whole field of science is ... something.
Also as someone involved in early RL, he should know better.
It's akin to us sending a rocket to space and immediately discovering a wormhole. Sure, there's a lot of science about what's out there, but to discover all this in our first few trips to orbit ...
Joscha Bach postulates that what we call consciousness must be something rather simple, an emergent property present in all sufficiently complex biological organisms.
We don't inherit any software, so cognitive function must bootstrap itself from it's underlying structure alone.
I wonder, though. Many animal species just "know" how to perform certain complex actions without being taught the way humans have to be taught. Building a nest, for example.
If you say that this is emergent from the "underlying structure alone", doesn't this mean that it would still be "inherited" software (though in this case, maybe we think of it like punch cards).
I’ve seen different figures for information content of DNA but they’re all mostly misleading. What we actually inherit is much more. We are the result of an unpacking algorithm starting from a single cell over time, so our information content should at the very least include the entirety of the cell (which is probably impossible to calculate). Additionally, in a more general sense, arbitrarily complex behavior can be derived from very simple mathematics, e.g. cellular automata. With sufficient complex dynamics (which for us are given by the laws of physics), even very small information changes lead to vastly different “emergent behavior”, whatever that means. One could improperly say that part of the information is included in the laws of physics itself.
A biological example that I like: the neural structures for vision develop almost fully formed from the very beginning. The state of our network at initialization is effectively already functional. I’m not sure to which extent this is true for humans, but it is certainly true for simpler organisms like flies. The way cells achieve this is through some extremely simple growth rules as the structure is being formed for the first time. Different kinds of cells behave almost independently of each other, and it just so happens that the final structure is a perfectly functional eye. I’ve seen animations of this during a conference talk and it was one of the most fascinating things I’ve ever seen. It truly shows how the complexity of a biological organism is just billions of times any human technology. And at the same time, it’s a beautiful illustration of the lack of intelligent design. It’s like watching a Lego assemble by just shaking the pieces.
Problems like this will turn out to have simple solutions. Once we get past the idea of "inherited instinct" (obvious nonsense and easily proved to be so) the solution will be easier to see.
An example that might be useful: dragonflies lay their eggs in water. Since a dragonfly has like a 4-bit CPU you might be amazed at how it manages to get all the processing required to identify a body of water from a distance into its tiny mind, and also marvel at what sort of JPEG+++ encoding must be used to convey what water looks like from generation to generation.
But they don't do that at all: instead they have eyes that are sensitive to polarized light. The surface of water polarizes reflected light. So do things like polished gravestones. So dragonflies will lay their eggs on gravestones too.
One I like to ponder is: beavers building damns. Do they have an encoded algorithm that knows that they need to damn the river to have a place to live, by gnawing on trees, carrying them to the right place on the river bed, etc? Nope, certainly they don't have that. Perhaps they have teeth that grow so long that they hurt, motivating the animal to gnaw on something solid to wear them down. The only solid thing they have available is a tree.
A similar phenomenon was demonstrated with deep neural networks nearly a decade ago. You optimize the architecture using randomized weights instead of optimizing the weights. You can still optimize the weights in a separate additional step to improve performance.
I’ve always said that animals have short term and long term memory via the hippocampus, and then there’s supragenerational memory stored in DNA - behaviors that are learned over many generations and passed down via genetics.
The emergent property theory seems logical, but I'm also partial to the quantum-tunneling-miasma theory which basically posits that there could be something fairly complex going on, and we just lack the ability to observe/measure it in our current physics. (Although I have difficulty coherently separating this theory from faith-based beliefs)
>We don't inherit any software, so cognitive function must bootstrap itself from it's underlying structure alone.
Hardware and software, as metaphors applied to biology, I think are better understood as a continuum than a binary, and if we don't inherit any software (is that true?), we at least inherit assembly code.
> we don't inherit any software (is that true?), we at least inherit assembly code
To stay with the metaphor, DNA could be rather understood as firmware that runs on the cell. What I mean with software is the 'mind' that runs on a collection of cells. Things like language, thoughts and ideas.
There is also a second level of software that runs not on a single mind alone, but collection of minds, to form cliques or a societies. But this is not encoded in genes, but in memes.
I think we have some notion of a proto-grammar or ability to linguistically conceptualize, probably at the level of some primordial conceptual units that are more fundamental than language, thoughts and ideas in the concrete forms we generally understand them to have.
I think it's like Chomsky said, that we don't learn this infrastructure for understanding language any more than a bird "learns" their feathers. But I might be losing track of what you're suggesting is software in the metaphor. I think I'm broadly on board with your characterization of DNA, the mind and memes generally though.
Lemme start by saying this is objectively amazing. But I just really wouldn't call it a breakthrough.
We had one breakthrough a couple of years ago with GPT-3, where we found that neural networks / transformers + scale does wonders.
Everything else has been a smooth continuous improvement. Compare today's announcement to Genie-2[1] release less than 1 year ago.
The speed is insane, but not surprising if you put in context on how fast AI is advancing. Again, nothing _new_. Just absurdly fast continuous progress.
Why wouldn't it? I still have to hear one convincing argument how our brain isn't working as a function of probable next best actions. When you look at amoebas work, and animals that are somewhere between them and us in intelligence, and then us, it is a very similar kind of progression we see with current LLMs, from almost no state of the world, to a pretty solid one.
It is truly remarkable. Even if you expected this to happen eventually, like I did, it seems like your assumptions are 2-10x of the time it’s needed to progress.
It makes me think that Stargate might actually lead to AGI/ASI
A neural net can produce information outside of its original data set, but it is all and directly derived from that initial set. There are fundamental information constraints here. You cannot use a neural net to itself generate from its existing data set wholly new and original full quality training data for itself.
You can use a neural net to generate data, and you can train a net on that data, but you'll end up with something which is no good.
Humans are dependent on their input data (through lifetime learning and, perhaps, information encoded in the brain from evolution), and yet they can produce out of distribution information. How?
There is an uncountably large number of models that perfectly replicate the data they're trained on; some generalize out of distribution much better. Something like dreaming might be a form of regularization: experimenting with simpler structures that perform equally well on training data but generalize better (e.g. by discovering simple algorithms that reproduce the data equally well as pure memorization but require simpler neural circuits than the memorizing circuits).
Once you have those better generalizing circuits, you can generate data that not only matches the input data in quality but potentially exceeds it, if the priors built into the learning algorithm match the real world.
Humans produce out-of-distribution data all the time, yet if you had a teacher making up facts and teaching them to your kids, you would probably complain.
I might be misunderstanding your comment so sorry if so. Robots have sensors and RL is a thing, they can collect real world data and then processing and consolidating real world experiences during downtime (or in real time), running simulations to prepare for scenarios, and updating models based on the day's collected data. The way I saw it that I thought was impressive was the robot understood the scene, but didn't know how the scene would respond to it's actions, so it gens videos of the possible scenarios, and then picks the best ones and models it's actuation based on it's "imagination".
This is definitely one of the potential issues that might happen to embodied agents/robots/bodies trained on the "world model". As we are training a model for the real world based on a model that simulates the real world, the glitches in the world simulator model will be incorporated into the training. There will be edge cases due to this layered "overtraining", where a robot/agent/body will expect Y to happen but X will happen, causing unpredictable behaviour.I assume that a generic world agent will be able to autocorrect, but this could also lead to dangerous issues.
I.e. if the simulation has enough videos of firefighters breaking glass where it seems to drop instantaneously and in the world sim it always breaks, a firefighter robot might get into a problem when confronted with unbreakable glass, as it expects it to break as always, leading to a loop of trying to shatter the glass instead of performing another action.
The benefit of these AI-generated simulation models as a training mechanism is that it helps add robustness without requiring a large training set. The recombinations can generate wider areas of the space to explore and learn with but using a smaller basis space.
To pick an almost trivial example, let's say OCR digit recognition. You'll train on the original data-set, but also on information-preserving skews and other transforms of that data set to add robustness (stretched numbers, rotated numbers, etc.). The core operation here is taking a smallset in some space (original training data) and producing some bigset in that same space (generated training data).
For simple things like digit recognition, we can imagine a lot of transforms as simple algorithms, but one can consider more complex problems and realize that an ML model would be able to do a good job of learning how to generate bigset candidates from the smallset.
We are miles away from the fundamental constraint. We know that our current training methodologies are scandalously data inefficient compared to human/animal brains. Augmenting observations with dreams has long been theorized to be (part of) the answer.
> current training methodologies are scandalously data inefficient compared to human/animal brains
Are you sure? I've been ingesting boatloads of high definition multi-sensory real-time data for quite a few decades now, and I hardly remember any of it. Perhaps the average quality/diversity of LLM training data has been higher, but they sure remember a hell of a lot more of it than I ever could.
It is possible - for example, getting a blob of physics data, fitting a curve then projecting the curve to theorise what would happen in new unseen situations. The information constraints don't limit the ability to generate new data in a specific domain from a small sample; indeed it might be possible to fully comprehend the domain if there is an underlying process it can infer. It is impossible to come up with wildly unrelated domains though.
Approximately speaking, you have a world model and an agent model. You continue to train the world model using data collected by the robot day-to-day. The robot "dreams" by running the agent model against the world model instead of moving around in the real world. Dreaming for thousands of (simulated) hours is much more efficient than actually running the physical hardware for thousands of wall clock hours.
Give it tool access let it formulate it's own experiments etc.
The only question here is if it becomes a / the singularity because of this, gets stuck in some local minimum or achieves random perfection and random local minimum locations.
Humans can learn from visualising situations and thinking through different scenarios. I don't see why AI / robots can't do similar. In fact I think quite a lot of training for things like Tesla self driving is done in simulation.
"Consciousness" is an overloaded thought killer that swerves all conversation into obfuscated semantic arguments. One person will be talking about 'internality' and self-image (in the testable, mechanical sense that you could argue Chain of Thought models already have in a petty way) and the other will be grappling with the concept of qualia and the ineffable nature of human experience.
That's not even a devil's advocate, many other animals clearly have consciousness, at least if we're not solipsistic. There have been many very dangerous precedents in medicine where people have been declared "brain dead" only to awake and remember.
Since consciousness is closely linked to being a moral patient, it is all the more important to err on the side of caution when denying qualia to other beings.
AI has traditionally been driven by "metaphor-driven development" where people assume the brain has system X, program something they give the same name, and then assume because they've given it that name it must work because it works in the brain.
This is generally a bad idea, but a few of the results like "neural networks" did work out… eventually.
"World model" is another example of a metaphor like this. They've assumed that humans have world models (most likely not true), and that if they program something and call it a "world model" it will work the same way (definitely not true) and will be beneficial (possibly true).
(The above critique comes from Phil Agre and David Chapman.)
I'm invested in a startup that is doing something unrelated robotics, but they're spending a lot of time in Shenzhen, I keep a very close eye on robotics and was talking to their CTO about what he is seeing in China, versions of this are already being implemented.
And these are consumer options, affordable to you and me, not only to some military. If those are the commonly available options... there may be way more advanced stuff that we haven't seen.
this stuff is old tech, and has nothing to do with transformers. The Boston Dynamics style robot dogs are always shown in marketing demos like the one you linked in secretly very controlled environments. Let me know when I can order one that will bring the laundry downstairs for my wife.
I asked for real examples from someone who claimed to have first hand experience, not more marketing bullshit
You don’t ask people to speak how you want, you simply only invite people who already have a history of speaking how you want. This phenomena is explained in detail I. Noam Chomsky’s work around mass media (eg NY Times doesn’t tell their editors what to do exactly, but only hire editors who already want to say what NY Times wants, or have a certain world view). The same can be applied to social media reviews. Invite the person who gives glowing reviews all the time.
Do you know where Noam makes that argument? I've been trying to figure out where I picked it up years ago. I'd like to revisit it to deepen my understanding. It's a pretty universal insight.
"I don't say you're self-censoring - I'm sure you believe everything you're saying; but what I'm saying is, if you believed something different, you wouldn't be sitting where you're sitting." -- Noam Chomksy to Andrew Marr
It's a shame the interviewer didn't quite grasp that point and dig a little deeper into it. Listening to it again I'm reminded of "The masters tools will never dismantle the master's house".
Though this is often associated with his and Herman's "Propaganda Model," Chomsky has also commented that the same appears in scholarly literature, despite the overt propaganda forces of ownership and advertisement being absent:
> What I don't think this technology will do is replace game engines. I just don't see how you could get the very precise and predictable editing you have in a regular game engine from anything like the current model. The real advantage of game engines is how they allow teams of game developers to work together, making small and localized changes to a game project.
I've been thinking about this a while and it's obvious to me:
Put Minecraft (or something similar) under the hood. You just need data structures to encode the world. To enable mutation, location, and persistence.
If the model is given additional parameters such as a "world mesh", then it can easily persist where things are, what color or texture they should be, etc.
That data structure or server can be running independently on CPU-bound processes. Genie or whatever "world model" you have is just your renderer.
It probably won't happen like this due to monopolistic forces, but a nice future might be a future where you could hot swap renderers between providers yet still be playing the same game as your friends - just with different looks and feels. Experiencing the world differently all at the same time. (It'll probably be winner take all, sadly, or several independent vertical silos.)
If I were Tim Sweeny at Epic Games, I'd immediately drop all work on Unreal Engine and start looking into this tech. Because this is going to shore them up on both the gaming and film fronts.
As a renderer, given a POV, lighting conditions, and world mesh might be a very, very good system. Sort of a tight MCP connection to the world-state.
I think in this context, it could be amazing for game creation.
I’d imagine you would provide item descriptions to vibe-code objects and behavior scripts, set up some initial world state(maps), populated with objects made of objects - hierarchically vibe-modeled, make a few renderings to give inspirational world-feel and textures, and vibe-tune the world until you had the look and feel you want. Then once the textures and models and world were finalised, it would be used as the rendering context.
I think this is a place that there is enough feedback loops and supervision that with decent tools along these lines, you could 100x the efficiency of game development.
It would blow up the game industry, but also spawn a million independent one or two person studios producing some really imaginative niche experiences that could be much, much more expansive (like a AAA title) than the typical indie-studio product.
> you could 100x the efficiency of game development.
> It would blow up the game industry, but also spawn a million independent one or two person studios producing some really imaginative niche experiences that could be much, much more expansive (like a AAA title) than the typical indie-studio product.
All video games become Minecraft / Roblox / VRChat. You don't need AAA studios. People can make and share their own games with friends.
Scary realization: YouTube becomes YouGame and Google wins the Internet forever.
I haven’t checked on Roblox recently, but afaik it doesn’t really allow complete creative freedom or the ability to have a picture and say “make the world look like this, and make the character textures match the vibe” and have it happen. Don’t they still have a unified world experience or can you really customize things that deeply now?
Can you make a basically indistinguishable copy of other games in Roblox? If so, that’s pretty cool, even without AI integration.
Roblox can't beat Google in AI. Roblox has network effects with users, but on an old school tech platform where users can't magic things into existence.
I've seen Roblox's creative tools, even their GenAI tools, but they're bolted on. It's the steam powered horse problem.
Don't put the world state into the model. Use the model as a renderer of whatever objects the "engine" throws at it.
Use the CPU and RAM for world state, then pass it off to the model to render.
Regardless of how this is done, Unreal Engine with all of its bells and whistles is toast. That C++ pile of engineering won't outdo something this flexible.
How many watts and how much capital does it take to run this model? How many watts and how much capital does it take to run unity or unreal? I suspect there's a huge discrepancy here, among other things.
I think this puts Epic Games, Nintendo, and the whole lot into a very tough spot if this tech takes off.
I don't see how Unreal Engine, with its voluminous and labyrinthine tomes of impenetrable legacy C++ code, survives this. Unreal Engine is a mess, gamers are unhappy about it, and it's a PITA to develop with. I certainly hate working with it.
Innovator's Dilemma fast approaching the entire gaming industry and they don't even see it coming it's happening so fast.
Exciting that building games could become as easy as having the idea itself. I'm imagining something like VRChat or Roblox or Fortnite, but where new things are simply spoken into existence.
It's absolutely terrifying that Google has this much power.
I played around with Diamond WM on my 3090 machine. I also ran fast SDXL-turbo and LCM models with ControlNets paired with a 3D game prototype I threw together. The results were very compelling, and I was just one person hacking things together.
This is 100% going to happen on-device. It's just a matter of time.
It is plausible to run a full simulation the old fashioned way and realtime render it with a diffusion model.
It is not currently, or near term, realistic to make a video game where a meaningful portion of the simulation is part of the model.
There will probably be a few interactive model-first experiences. But they’ll be popular as short novelties not meaningful or long experiences.
A simple question to consider is how would you adjust a set of simple tunables in a model-first simulator? For example giving the player more health, making enemies deal 2x damage, increasing move speed, etc etc. You can not.
Reality is not composed of words, syntax, and semantics. A human modal is.
Other human modals are sensory only, no language.
So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.
Software is for those who read the manual with their new NES game. Where are the words inside us?
Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.
> Genie 3’s consistency is an emergent capability
So this just happened from scaling the model, rather than being a consequence of deliberate architecture changes?
Edit: here is some commentary on limitations from someone who tried it: https://x.com/tejasdkulkarni/status/1952737669894574264
> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...