I think your timeline is off, at least for a tech demo. This model already runs ...

I think your timeline is off, at least for a tech demo.

This model already runs at 24fps, and I bet could be made to run at >75fps by scaling hardware and distilling/quantizing the model to only work on certain environments.

The two eye problem seems pretty trivial to me: add another image decoding head with the sole task of decoding the other eye. Training data for this can be plentifully gathered through simulated 3D data, or running existing 2D data (e.g. youtube videos) through slow mono to stereo models. This should add minimal latency as it's another head vs. subsequent layers.

If you can train the model to allow WASD movement + mouse, head tracking is not very different. I think with enough effort we could probably build a VR experience using this today. Getting it onto affordable hardware could be a totally different story, but certainly not decades.

Maybe I'm missing something though!