For someone jumping back on the local LLM train after having been out for 2 year...

flipflipper · 2025-03-12T16:12:07 1741795927

Ollama + open web-ui in a container

https://github.com/open-webui/open-webui

lastLinkedList · 2025-03-12T22:05:22 1741817122

preemptively adding for us AMD users - it’s pretty seamless to get Ollama working with rocm, and if you have a card that’s a bit below the waterline (lowest supported is a 6800xt, i bought a 6750xt), you can use a community patch that will enable it for your card anyway:

https://github.com/likelovewant/ollama-for-amd/wiki#demo-rel...

I specifically recommend the method where you grab the patched rocblas.dll for your card model, and replace the one that Ollama is using, as someone who is technical but isn’t proficient with building from source (yet!)

dunb · 2025-03-12T22:32:44 1741818764

What's the benefit of the container over installing as a tool with uv? It seems like extra work to get it up and running with a GPU, and if you're using a Mac, the container slows down your models.

genewitch · 2025-03-13T12:42:34 1741869754

LM studio in API mode, then literally any frontend that talks openAI api.

Or, just use the LM studio front end, it's better than anything I've used for desktop use.

I get 35t/s gemma 15b Q8 - you'll need a smaller one, probably gemma 3 15b q4k_l. I have a 3090, that's why.

rahimnathwani · 2025-03-12T17:43:39 1741801419

For that GPU the best Gemma 3 model you'll be able to run (with GPU-only inference) is 4-bit quantized 12b parameter model: https://ollama.com/library/gemma3:12b

You could use CPU for some of the layers, and use the 4-bit 27b model, but inference would be much slower.

mfro · 2025-03-12T18:03:48 1741802628

Librechat + ollama is the best I have tried. Fairly simple setup if you can grok yaml config.