For someone jumping back on the local LLM train after having been out for 2 years, what is the current best local web-server solution to host this for myself on a GPU (RTX3080) Linux server? Preferably with support for the multimodal image input and LaTeX rendering on the output..
I don't really care about insanely "full kitchen sink" things that feature 100 plugins to all existing cloud AI services etc. Just running the released models the way they are intended on a web server...
preemptively adding for us AMD users - it’s pretty seamless to get Ollama working with rocm, and if you have a card that’s a bit below the waterline (lowest supported is a 6800xt, i bought a 6750xt), you can use a community patch that will enable it for your card anyway:
I specifically recommend the method where you grab the patched rocblas.dll for your card model, and replace the one that Ollama is using, as someone who is technical but isn’t proficient with building from source (yet!)
What's the benefit of the container over installing as a tool with uv? It seems like extra work to get it up and running with a GPU, and if you're using a Mac, the container slows down your models.
For that GPU the best Gemma 3 model you'll be able to run (with GPU-only inference) is 4-bit quantized 12b parameter model: https://ollama.com/library/gemma3:12b
You could use CPU for some of the layers, and use the 4-bit 27b model, but inference would be much slower.
I don't really care about insanely "full kitchen sink" things that feature 100 plugins to all existing cloud AI services etc. Just running the released models the way they are intended on a web server...