Of course, but every token generated by a 100B model is going to take minimally ...

wongarsu · 2024-09-27T14:14:31 1727446471

Which is why the work to improve the results of small models is so important. Running a 3B or even 1B model as typing assistant and reserving the 100B model for refactoring is a lot more viable.

brookst · 2024-09-29T15:01:34 1727622094

> but every token generated by a 100B model is going to take minimally 100B FLOPS

Drop the S, I think. There’s no time dimension.

And FLOP is a generalized capability meaning you can do any operation. Hardware optimizations for ML can deliver the same 100B computations faster and cheaper by not being completely generalized. Same way ray tracing acceleration works: it does not use the same amount of compute as ray tracing in general CPU’s.

HarHarVeryFunny · 2024-10-01T17:58:50 1727805530

Sure, ANN computations are mostly multiplication (or multiply and add) - multiply an ANN input by a weight (parameter) and accumulate, parallelized into matrix multiplication which is the basic operation supported by accelerators like GPUs and TPUs.

Still, even with modern accelerators it's lot of computation, and is what drives the price per token of larger models vs smaller ones.