I have a Dell 7490 (intel 8350u cpu) I paid $250 for and I have no trouble running 13B models through a custom interactive interface I wrote as a hobby project in an afternoon. It can still get a lot better. I made it async the following day and its even more fun.
Most of peoples' problem is watching the AI type, it's not instant, but then not all (or even most) applications need to be instant. You can also avoid that by having it return everything at once instead of streaming style.
Local absolutely can scale. All kinds of fun things can be done on a machine with 16GB of RAM, or 8GB if you work harder.
> Most of peoples' problem is watching the AI type, it's not instant, but then not all (or even most) applications need to be instant. You can also avoid that by having it return everything at once instead of streaming style.
Funny, for me it is the complete opposite. I created an interface in Matrix that does just that: return everything at once. But the lag annoys me more than the slow typing in the regular chat interface. The slow typing helps me keep me focused on the conversation. Without it, my mind starts wandering while it waits.
Most of peoples' problem is watching the AI type, it's not instant, but then not all (or even most) applications need to be instant. You can also avoid that by having it return everything at once instead of streaming style.
Local absolutely can scale. All kinds of fun things can be done on a machine with 16GB of RAM, or 8GB if you work harder.