Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What are the capabilities of consumer grade hardware to work with LLMs?
12 points by HexDecOctBin on Aug 4, 2023 | hide | past | favorite | 10 comments
If one wants to play with LLMs locally, it is very difficult to find out what one's existing hardware can do – partly because most existing documentation either uses the maximal amount of cloud compute, or is written by startups hoping to sell their own services.

So, given someone has a decent gaming PC with a CUDA-compatible GPU (only Nvidia, I guess?), what can they do with it when it comes to LLMs? What parameter size models can be loaded for various VRAM sizes – for inference, fine tuning and training respectively?

Let's say the VRAM sizes are 8 GB, 12 GB, 16 GB and 24 GB, which seem to be the most common in the 40x0 series of GPUs. If system RAM matters, what can be done with 16 GB, 32 GB, 64 GB and beyond?




GPU will be faster if you can fit the data in VRAM. If not, using CPU and system RAM works fine but is slower. It's even possible to load layers from disk, but this is very very slow.

Full precision for most models is 16 bits. That means two bytes per parameter. This is a rule of thumb and there's other overheard, but generally, you can load a 7B model in ~14GB of VRAM or system RAM at full precision. But usually, to improve speed and memory usage, precision is reduced after training. Loading a model at 8-bit precision means you can fit a 13B model in ~13GB of (V)RAM. You can go even lower, with 4 bits being common and 3 or 2 bits available for the most popular large models.

GPT4all has the least setup friction but also a pretty limited interface. Last I checked though, if you're on Windows, it will not run on GPU. Building and installing llama.cpp from source is quite painless and has more options. Installing text-generation-webui has a lot more options in exchange for a few more steps in the install. Those are the top 3 I would recommend for ease of use.

Edit: all above numbers are for inference. For training numbers, https://huggingface.co/docs/transformers/perf_train_gpu_one is pretty approachable. tl;dr most commonly 8 bytes per parameter, but optimizations are possible with tradeoffs in complexity and accuracy.


Than you. Just to confirm, this for inferencing?

Are there any similar heuristics for fine-tuning an LLM on personal data?


I agree, I've definitely seen way more information about running image synthesis models like Stable Diffusion locally than I have LLMs. It's counterintuitive to me that Stable Diffusion takes less RAM than an LLM, especially considering it still needs the word vectors. Goes to show I know nothing.

I guess it comes down to the requirement of a very high end (or multiple) GPU that makes it impractical for most vs just running it in Colab or something.

Tho there are some efforts:

https://github.com/cocktailpeanut/dalai


Well, if you looked at the AI world in 2019, the models were mostly less RAM intense (typically 0.8GB-2GB RAM for BERT/T5 non-large for example). And computer vision models always hit the max you'd throw at them.

Things twist and turn. Text requires a large amount of parameters to be useful. If we as humans required 8000x8000 pictures to discern objects, images would require more. It's quite anthropomorphic. That's really the core of it.


Even the README of this project is lacking in details. They cover sizes of system RAM, but not of VRAM. Are they running the models on CPU exclusively? How does that compare to running them on GPUs? What are the corresponding numbers for GPUs?

Not to mention more meta-topics, like which is preferred for inferencing - CPU or GPU? What are the corresponding numbers for fine-tuning or training for various model sizes? and so on.


The short practical answer for inference would be:

- You go to https://huggingface.co/TheBloke

- You select the model you are interested in

- You check which quantizations would fit your available resources.


But what if you are in the market to buy some hardware? Having some rules of thumb won't hurt.


Personally I'd go as high as I could afford in this order: 4090 > 3090 > 4080 > 4070 TI > 4070 > 3080 TI > 3080

I have no idea how the 4060 TI 16GB performs

Models change from week to week, so just aim for the basics (VRAM, Tensor cores (generation!), and memory bandwidth).


I use em on my CPU (5900x and 128gb RAM) since my 2080 Ti is a bit too small. Gpt4all works great. Can load up big ass models no problem. The speed isn't an issue for me.


This will run on a PC with 16GB of ram (possible less) It's a 7b model quantized.

docker run -it --rm ghcr.io/purton-tech/mpt-7b-chat




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: