Most models can be inferenced-upon with merely borderline-consumer hardware.
Even the fancy models where you need to buy compute (rails) that's about the price of a new car, they have a power draw of ~700W[0] while running inference at 50 tokens/second.
But!
The constraint with current hardware isn't compute, the models are mostly constrained by RAM bandwidth: back of the envelope estimate says that e.g. if Apple took the compute already in their iPhones and reengineered the chips to have 256 GB of RAM and sufficient bandwidth to not be constrained by it, models that size could run locally for a few minutes before hitting thermal limits (because it's a phone), but we're still only talking one-or-two-digit watts.
> e.g. if Apple took the compute already in their iPhones and reengineered the chips to have 256 GB of RAM and sufficient bandwidth to not be constrained by it, models that size could run locally for a few minutes before hitting thermal limits (because it's a phone), but we're still only talking one-or-two-digit watts.
That hardware cost Apple tens of billions to develop and what you're talking about in term of "just the hardware needed" is so far beyond consumer hardware it's funny. Fairly sure most Windows laptops are still sold with 8GB RAM and basically 512MB of VRAM (probably less), practically the same thing for Android phones.
I was thinking of building a local LLM powered search engine but basically nobody outside of a handful of techies would be able to run it + their regular software.
Apple don't sell M4 chips separately, but the general best-guess I've seen seems to be they're in the $120 range as a cost to Apple. Certainly it can't exceed the list price of the cheapest Mac mini with one (US$599).
As bleeding-edge tech, those are expensive transistors, but still 10 of them would have enough transistors for 256 GB of RAM plus all the compute each chip already has. Actual RAM is much cheaper than that.
10x the price of the cheapest Mac Mini is $6k… but you could then save $400 by getting a Mac Studio with 256 GB RAM. The max power consumption (of that desktop computer but with double that, 512 GB RAM) is 270 W, representing an absolute upper bound: if you're doing inference you're probably using a fraction of the compute, because inference is RAM limited not compute limited.
But irregardless, I'd like to emphasise that these chips aren't even trying to be good at LLMs. Not even Apple's Neural Engine is really trying to do that, NPUs (like the Neural Engine) are all focused on what AI looked like it was going to be several years back, not what current models are actually like today. (And given how fast this moves, it's not even clear to me that they were wrong or that they should be optimised for what current models look like today).
> Fairly sure most Windows laptops are still sold with 8GB RAM and basically 512MB of VRAM (probably less), practically the same thing for Android phones.
That sounds exceptionally low even for budget laptops. Only examples I can find are the sub-€300 budget range and refurbished devices.
For phones, there is currently very little market for this in phones, the limit is not because it's an inconceivable challenge. Same deal as thermal imaging cameras in this regard.
> I was thinking of building a local LLM powered search engine but basically nobody outside of a handful of techies would be able to run it + their regular software.
This has been a standard database tool for a while already. Vector databases, RAG, etc.
Look at computer systems that cost 2000 or less and they are useless at running LLM coding assistants for example locally. A minimal subscription to a cloud service unfortunately beats them, and even more expensive systems that can run larger models, run them too slowly to be productive. Yes you can chat with them and perform tasks slowly on low cost hardware but that is all. If you put local LLMs in your IDE they slow you down or just don't work.