If one wants to play with LLMs locally, it is very difficult to find out what one's existing hardware can do – partly because most existing documentation either uses the maximal amount of cloud compute, or is written by startups hoping to sell their own services.
So, given someone has a decent gaming PC with a CUDA-compatible GPU (only Nvidia, I guess?), what can they do with it when it comes to LLMs? What parameter size models can be loaded for various VRAM sizes – for inference, fine tuning and training respectively?
Let's say the VRAM sizes are 8 GB, 12 GB, 16 GB and 24 GB, which seem to be the most common in the 40x0 series of GPUs. If system RAM matters, what can be done with 16 GB, 32 GB, 64 GB and beyond?
Full precision for most models is 16 bits. That means two bytes per parameter. This is a rule of thumb and there's other overheard, but generally, you can load a 7B model in ~14GB of VRAM or system RAM at full precision. But usually, to improve speed and memory usage, precision is reduced after training. Loading a model at 8-bit precision means you can fit a 13B model in ~13GB of (V)RAM. You can go even lower, with 4 bits being common and 3 or 2 bits available for the most popular large models.
GPT4all has the least setup friction but also a pretty limited interface. Last I checked though, if you're on Windows, it will not run on GPU. Building and installing llama.cpp from source is quite painless and has more options. Installing text-generation-webui has a lot more options in exchange for a few more steps in the install. Those are the top 3 I would recommend for ease of use.
Edit: all above numbers are for inference. For training numbers, https://huggingface.co/docs/transformers/perf_train_gpu_one is pretty approachable. tl;dr most commonly 8 bytes per parameter, but optimizations are possible with tradeoffs in complexity and accuracy.