Of course, but every token generated by a 100B model is going to take minimally 100B FLOPS, and if this is being used as an IDE typing assistant then there is going to be a lot of tokens being generated.
If there is a common shift to using additional runtime compute to improve quality of output, such as OpenAI's GPT-o1, then FLOPs required goes up massively (OpenAI has said it takes exponential increase in FLOPS/cost to generate linear gains in quality).
So, while costs will of course decrease, those $20-30K NVIDEA chips are going to be kept burring, and are not going to pay for themselves ...
This may end up like the shift to cloud computing that sounds good in theory (save the cost of running your own data center), but where corporate America balks when the bill comes in. It may well be that the endgame for corporate AI is to run free tools from the likes of Meta (or open source) in their own datacenter, or maybe even locally on "AI PCs".
Which is why the work to improve the results of small models is so important. Running a 3B or even 1B model as typing assistant and reserving the 100B model for refactoring is a lot more viable.
> but every token generated by a 100B model is going to take minimally 100B FLOPS
Drop the S, I think. There’s no time dimension.
And FLOP is a generalized capability meaning you can do any operation. Hardware optimizations for ML can deliver the same 100B computations faster and cheaper by not being completely generalized. Same way ray tracing acceleration works: it does not use the same amount of compute as ray tracing in general CPU’s.
Sure, ANN computations are mostly multiplication (or multiply and add) - multiply an ANN input by a weight (parameter) and accumulate, parallelized into matrix multiplication which is the basic operation supported by accelerators like GPUs and TPUs.
Still, even with modern accelerators it's lot of computation, and is what drives the price per token of larger models vs smaller ones.
If there is a common shift to using additional runtime compute to improve quality of output, such as OpenAI's GPT-o1, then FLOPs required goes up massively (OpenAI has said it takes exponential increase in FLOPS/cost to generate linear gains in quality).
So, while costs will of course decrease, those $20-30K NVIDEA chips are going to be kept burring, and are not going to pay for themselves ...
This may end up like the shift to cloud computing that sounds good in theory (save the cost of running your own data center), but where corporate America balks when the bill comes in. It may well be that the endgame for corporate AI is to run free tools from the likes of Meta (or open source) in their own datacenter, or maybe even locally on "AI PCs".