Ask HN: What are the capabilities of consumer grade hardware to work with LLMs?

sp332 · on Aug 4, 2023

GPU will be faster if you can fit the data in VRAM. If not, using CPU and system RAM works fine but is slower. It's even possible to load layers from disk, but this is very very slow.

Full precision for most models is 16 bits. That means two bytes per parameter. This is a rule of thumb and there's other overheard, but generally, you can load a 7B model in ~14GB of VRAM or system RAM at full precision. But usually, to improve speed and memory usage, precision is reduced after training. Loading a model at 8-bit precision means you can fit a 13B model in ~13GB of (V)RAM. You can go even lower, with 4 bits being common and 3 or 2 bits available for the most popular large models.

GPT4all has the least setup friction but also a pretty limited interface. Last I checked though, if you're on Windows, it will not run on GPU. Building and installing llama.cpp from source is quite painless and has more options. Installing text-generation-webui has a lot more options in exchange for a few more steps in the install. Those are the top 3 I would recommend for ease of use.

Edit: all above numbers are for inference. For training numbers, https://huggingface.co/docs/transformers/perf_train_gpu_one is pretty approachable. tl;dr most commonly 8 bytes per parameter, but optimizations are possible with tradeoffs in complexity and accuracy.

HexDecOctBin · on Aug 4, 2023

Than you. Just to confirm, this for inferencing?

Are there any similar heuristics for fine-tuning an LLM on personal data?

TheRealSteel · on Aug 4, 2023

I agree, I've definitely seen way more information about running image synthesis models like Stable Diffusion locally than I have LLMs. It's counterintuitive to me that Stable Diffusion takes less RAM than an LLM, especially considering it still needs the word vectors. Goes to show I know nothing.

I guess it comes down to the requirement of a very high end (or multiple) GPU that makes it impractical for most vs just running it in Colab or something.

Tho there are some efforts:

https://github.com/cocktailpeanut/dalai

james-revisoai · on Aug 4, 2023

Well, if you looked at the AI world in 2019, the models were mostly less RAM intense (typically 0.8GB-2GB RAM for BERT/T5 non-large for example). And computer vision models always hit the max you'd throw at them.

Things twist and turn. Text requires a large amount of parameters to be useful. If we as humans required 8000x8000 pictures to discern objects, images would require more. It's quite anthropomorphic. That's really the core of it.

HexDecOctBin · on Aug 4, 2023

Even the README of this project is lacking in details. They cover sizes of system RAM, but not of VRAM. Are they running the models on CPU exclusively? How does that compare to running them on GPUs? What are the corresponding numbers for GPUs?

Not to mention more meta-topics, like which is preferred for inferencing - CPU or GPU? What are the corresponding numbers for fine-tuning or training for various model sizes? and so on.

PeterStuer · on Aug 4, 2023

The short practical answer for inference would be:

- You go to https://huggingface.co/TheBloke

- You select the model you are interested in

- You check which quantizations would fit your available resources.

HexDecOctBin · on Aug 4, 2023

But what if you are in the market to buy some hardware? Having some rules of thumb won't hurt.

PeterStuer · on Aug 5, 2023

Personally I'd go as high as I could afford in this order: 4090 > 3090 > 4080 > 4070 TI > 4070 > 3080 TI > 3080

I have no idea how the 4060 TI 16GB performs

Models change from week to week, so just aim for the basics (VRAM, Tensor cores (generation!), and memory bandwidth).

navjack27 · on Aug 4, 2023

I use em on my CPU (5900x and 128gb RAM) since my 2080 Ti is a bit too small. Gpt4all works great. Can load up big ass models no problem. The speed isn't an issue for me.

ianpurton · on Aug 4, 2023

This will run on a PC with 16GB of ram (possible less) It's a 7b model quantized.

docker run -it --rm ghcr.io/purton-tech/mpt-7b-chat