> We also change the architecture of the model to reduce the KV-cache memory tha...

> We also change the architecture of the model to reduce the KV-cache memory that tends to ex plo de with long context

This is key (pun not intended). It's one thing to run these models locally; it's a totally different game when you need longer context.

Sure, the new M3 Ultra can fit a Q4 DeepSeek r1 in URAM, but as soon as you wanna get usable context like +64k, the t/s and PP quickly become prohibitive.

Speaking of M3 Ultra, I really wish Apple had put more bandwidth in this beast of a machine. It's got a lot of "energy", not a lot of "power" to actually use that energy.