Cost of self hosting Llama-3 8B-Instruct

philipkglass · on June 14, 2024

Instead of using AWS another approach involves self hosting the hardware as well. Even after factoring in energy, this does dramatically lower the price.

Assuming we want to mirror our setup in AWS, we’d need 4x NVidia Tesla T4s. You can buy them for about $700 on eBay.

Add in $1,000 to setup the rest of the rig and you have a final price of around:

$2,800 + $1,000 = $3,800

This whole exercise assumes that you're using the Llama 3 8b model. At full fp16 precision that will fit in one 3090 or 4090 GPU (the int8 version will too, and run faster, with very little degradation.) Especially if you're willing to buy GPU hardware from eBay, that will cost significantly less.

I have my home workstation with a 4090 exposed as a vLLM service to an AWS environment where I access it via reverse SSH tunnel.

speakspokespok · on June 14, 2024

Why did this only occur to me recently? You can selfhost a k8s cluster and expose the services using a $5 digital ocean droplet. The droplet and k8s services are point-to-point connected using tailscale. Performance is perfectly fine, keeps your skillset sharp, and you’re self-hosting!

Helithumper · on June 14, 2024

You can also just directly connect to containers using Tailscale if it's just for internal use. That is, having an internally addressable `https://container_name` on your tailnet per-container if you want. This way I can setup Immich for example and it's just on my tailnet at `https://immich` without the need for a reverse proxy, etc...

https://tailscale.com/blog/docker-tailscale-guide

SparkyMcUnicorn · on June 14, 2024

And you can use Tailscale Funnel to serve it publicly. No need to pay for a cloud instance.

https://tailscale.com/kb/1223/funnel

LorenzoGood · on June 16, 2024

I essentially do this with my homelab.

logtrees · on June 14, 2024

Whoa, so you have code running in AWS making use of your local hardware via what is called a reverse SSH tunnel? I will have to look into how that works, that's pretty powerful if so. I have a mac mini that I use for builds and deploys via FTP/SFTP and was going to look into setting up "messaging" via that pipeline to access local hardware compute through file messages lol, but reverse SSH tunnel sounds like it'll be way better for directly calling executables rather than needing to parse messages from files first.

sneak · on June 14, 2024

Look into Nebula (or Tailscale if you trust third parties). I have all my workstations and servers on a mesh network that appears as a single /24 that is end to end encrypted, mutually authenticated and works through/behind NAT. I can spawn a vhost on any server that reverse proxies an API to any port on any machine.

It’s been an absolute gamechanger.

aborsy · on June 14, 2024

Why do you have to trust a third party?

It’s end to end encrypted, and with tail lock enabled, nodes can not be added without user’s permission.

hammyhavoc · on June 15, 2024

Well, one example, depending on your threat model—their privacy policy states that they retain info and comply with subpoenas.

There's also potential for malicious updates to compromise a network (as there is with most software unless you're auditing the source for each update).

E2EE is only as meaningful as where the keys reside, and how easily those keys are abused.

aborsy · on June 16, 2024

That’s interesting!

The metadata is generally public information, I don’t care about that.

The malicious updates and key abuse are more concerning. It’s true for all software, and probably better done with OS, like on iOS.

The VPN could steal the keys, but that’s a lawsuit!

hammyhavoc · on June 16, 2024

Are the keys not already kept on their own infra?

aborsy · on June 17, 2024

No, private keys don’t leave user’s devices. This is the case in all such products.

But with a malicious update, they could ship them to their infra, targeting some users. The product then becomes malware!

sneak · on June 15, 2024

The idea of “user’s permission” is determined by tailscale and/or the oidc provider. I don’t know anything about “tail lock”, perhaps it is a new mitigation for this issue?

I didn’t start with tailscale because the only way you could log into it was with Google or GitHub or something. I don’t trust Microsoft or Google with auth for my internal network. I thought about running Headscale but Nebula was faster/easier for me.

aborsy · on June 15, 2024

Yes, Microsoft and Google will not be able to authenticate to your network if you enable tail lock. A node in your network has to sign.

elorant · on June 14, 2024

Is there any resource that goes into more detail about how to setup all this?

sneak · on June 14, 2024

https://github.com/slackhq/nebula

the docs are good. when creating the initial CA make absolutely sure you set the CA expiration to 10-30 years, the default is 1 which means your whole setup explodes in a year without warning.

drio · on June 14, 2024

@sneak, can you comment on your experience with nebula vs Tailscale?

logtrees · on June 14, 2024

Whooooaaa that is mind-blowing. Thanks for sharing. <3

1oooqooq · on June 14, 2024

why either of these over plain wireguard if you're not provisioning accounts?

sneak · on June 14, 2024

Wireguard doesn’t do nat punching and is not mesh, it’s p2p only.

totally different use case.

thot_experiment · on June 15, 2024

I feel like wire guard definitely does nat punching unless I misunderstand you, I've been doing this sort of thing to have my phone and desktop on the same "LAN" all the time so I can moonlight in from anywhere (among other things) and they're definitely natted.

1oooqooq · on June 14, 2024

true. i do may punch with a pnp client on the server side.

brrrrrm · on June 14, 2024

I use my mac mini exactly as described by the parent post but using ollama as the server. Super easy setup and obv chatgpt can guide you through it

logtrees · on June 14, 2024

Unfortunately my mac mini isn't beefy enough to run ollama, it's the base model m1 from a couple years ago lol. But it's very powerful for builds, deploys, and some computation via scripts. Now I'm curious to check out how much memory the newest ones support for potentially using ollama on it haha. Thanks!

brrrrrm · on June 14, 2024

Mine is also an m1. Just use llama3, its 8b quantized by default

logtrees · on June 14, 2024

I will try it out, curious to see how it will work with 8gb of memory haha. Thanks for the heads up!

apnew · on June 14, 2024

Do you happen to have any handy guides/docs/references for absolute beginners to follow?

SahAssar · on June 15, 2024

The absolute easiest way is https://github.com/Mozilla-Ocho/llamafile

Just download a single file and run it.

paulmd · on June 14, 2024

Ollama is not as powerful as llama.cpp or raw pytorch, but it is almost zero effort to get started.

brew install ollama; ollama serve; ollama pull llama3: 8b-v2.9-q5_K_M; ollama run llama3: 8b-v2.9-q5_K_M

https://ollama.com/library/dolphin-llama3:8b-v2.9-q5_K_M

(It may need to be Q4 or Q3 instead of Q5 depending on how the RAM shakes out. But the Q5_K_M quantization (k-quantization is the term) is generally the best balance of size vs performance vs intelligence if you can run it, followed by Q4_K_M. Running Q6, Q8, or fp16 is of course even better but you’re nowhere near fitting that on 8gb.)

https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

Dolphin-llama3 is generally more compliant and I’d recommend that over just the base model. It's been fine-tuned to filter out the dumb "sorry I can't do that" battle, and it turns out this also increases the quality of the results (by limiting the space you're generating, you also limit the quality of the results).

https://erichartford.com/uncensored-models

https://arxiv.org/abs/2308.13449

Most of the time you will want to look for an "instruct" model, if it doesn't have the instruct suffix it'll normally be a "fill in the blank" model that finishes what it thinks is the pattern in the input, rather than generate a textual answer to a question. But ollama typically pulls the instruct models into their repos.

(sometimes you will see this even with instruct models, especially if they're misconfigured. When llama3 non-dolphin first came out I played with it and I'd get answers that looked like stackoverflow format or quora format responses with ""scores"" etc, either as the full output or mixed in. Presumably a misconfigured model, or they pulled in a non-instruct model, or something.)

Dolphin-mixtral:8x7b-v2.7 is where things get really interesting imo. I have 64gb and 32gb machines and so far the Q6 and q4-k_m are the best options for those machines. dolphin-llama3 is reasonable but dolphin-mixtral is a richer better response.

I’m told there’s better stuff available now, but not sure what a good choice would be for for 64gb and 32gb if not mixtral.

Also, just keep an eye on r/LocalLLaMA in general, that's where all the enthusiasts hang out.

riddleronroof · on June 14, 2024

Ollama is llamma.cpp plus docker If you can do without docker, it’s faster

wkat4242 · on June 16, 2024

No, the ollama default quantisation is 4 bit

brrrrrm · on June 16, 2024

I meant 8b -> 8billion rather than 70b

wkat4242 · on June 16, 2024

Ah sorry!

curioussavage · on June 14, 2024

Using tailscale might be a better and easier solution.

verdverm · on June 14, 2024

using Tailscale can make the networking setup much easier, really like their service for things like this (or curling another dev's local running server)

favflam · on June 14, 2024

You can also check if you have ipv6. I have tried both, but prefer directly connecting home.

logtrees · on June 15, 2024

I don't know enough about networking and that level of configuration to do it confidently and safely yet. I'd rather rely on using credentials where the only real access is some limited command line executables or file transfer, rather than exposing more of the hardware directly to the network for a direct connection. I do have interest in learning this much, but I find that my current FTP/SFTP approach has more guard rails than a direct connection. Do you agree with this or am I just not understanding enough about ipv6 and direct connections home?

favflam · on June 21, 2024

You get the equivalent setup in ipv6 by having your home modem or router deny inbound connections.

I think port forwarding configuration is a pain that does not offer value over just poking a hole in your firewall to do an authenticated connection over ssh.

czhu12 · on June 14, 2024

I do the same thing with cloudflare tunnels and managing the cloudflare tunnel process and the llama.cpp server with systemd on my home internet.

Have a 13B running on a 3070 with 16 gpu layers and the rest running off CPU.

Performs okay, but way cheaper than renting a GPU on the cloud.

Melatonic · on June 14, 2024

Read the other comment and also immediately thought of Cloudflare tunnel instead - is there a reason you chose that? Wondering if I should do the same with my old Titan XP (probably slower than your 3070 but it does have 12gb of vram)

czhu12 · on June 15, 2024

No reason other than that cloudflare is quite a nice reverse proxy and I already use it for managing my DNS's. Thus far, I've not noticed any issues with it, other than when my home wifi goes down.

hehdhdjehehegwv · on June 14, 2024

I dropped $5k on an A6000 and I can run llama3:70b day and night for the price of my electricity bill.

I’ve gone through hundreds of millions, maybe billions, of tokens in the past year.

This article is just “cloud is expensive” 101. Nothing new.

EvgeniyZh · on June 14, 2024

1B of tokens for Gemini Flash (which is on par with llama3-70b in my experience or even better sometimes) with 2:1 input-output would cost ~600 bucks (ignoring the fact they offer 1M tokens a day for free now). Ignoring electricity you'd break even in >8 years. You can find llama3-70b for ~same prices if you're interested in the specific model.

hehdhdjehehegwv · on June 15, 2024

I answered the financial thinking in another reply, but another factor is I need to know if the model today is exactly the same as tomorrow for reliable scientific benchmarking.

I need to tell if I change I made was impactful, but if the model just magically gets smarter or dumber at my tasks with no warning then I can’t tell if I made an improvement or a regression.

Whereas the model on my GPU doesn’t change unless I change it. So it’s one less variable and LLM are black box to start with.

I may be wrong for Gemini, but my impression is all the companies are constantly tweaking the big models. I know GPT on Monday is not always the same GPT on Thursday for example.

hereonout2 · on June 14, 2024

I've worked professionally over the last 12 months hosting quite a few foundation models and fine tuned LLMs on our own hardware, aws + azure vms and also a variety of newer "inference serving" type services that are popping up everywhere.

I don't do any work with the output, I'm just the MLOps guy (ahem, DevOps).

You mention expense but on a purely financial basis I find any of these hosted solutions really hard to justify against GPT 3.5 turbo prices, including building your own rig. $5k + electricity is loads of 3.5 Turbo tokens.

Of course none of the data scientists or researchers I work with want to use that though - it's not their job to host these things or worry about the costs.

hehdhdjehehegwv · on June 15, 2024

So my main motivation is not so much to have the lowest cost, but to have the most predictable cost.

Knowing up front this is my fixed ML budget gives me peace of mind and gives me room to try stupid ideas without worrying about it.

Whereas doing it in the cloud you can a) get slammed with some crazy bill by accident, b) have to think more about what resources testing an idea will take, or conversely c) getting GPU FOMO and thinking “if just upgrade a level all my problems will be solved”.

It works for me, everybody mileage varies but personally I like to budget; spend; and then totally focus on my goals and not my cloud spend.

I’m also from the pre-cloud era, so doing stuff on my own bare metal is second nature.

logicallee · on June 14, 2024

Super cool, thanks for sharing. Do you mind sharing what you used the hundreds of millions (or billions) of tokens on?

hehdhdjehehegwv · on June 15, 2024

Doing really nuanced classification of documents at very large scale. Needle in the haystack type problems.

elorant · on June 14, 2024

Is this at 4-bit quantization? And how many tokens per second is the output?

hehdhdjehehegwv · on June 15, 2024

I’m doing non-interactive tasks, but in terms of the A6000 running llama3 70b in chat mode it’s as usable as any of the commercial offerings in terms of speed. I read quickly and it’s faster than I read.

brcmthrowaway · on June 14, 2024

Hows your ROI?

hehdhdjehehegwv · on June 14, 2024

Absolutely phenomenal.

brcmthrowaway · on June 15, 2024

Are you using it for trading?

hehdhdjehehegwv · on June 15, 2024

Nope, powers some low-level infrastructure-ish stuff.

cootsnuck · on June 14, 2024

Yea, for any hobbyist, indie developer, etc. I think it'd be ridiculous to not first try running one of these smaller (but decently powerful) open source models on your own hardware at home.

Ollama makes it dead simple just to try it out. I was pleasantly surprised by the tokens/sec I could get with Llama 3 8B on a 2021 M1 MBP. Now need to try on my gaming PC I never use. Would be super cool to just have a LLM server on my local network for me and the fam. Exciting times.

shostack · on June 14, 2024

How is inference latency for coding use cases on a local 3090 or 4090 compared to say, hitting the GPT-4o API?

whereismyacc · on June 14, 2024

I assume the characteristics would be pretty different, since your local hardware can keep the context loaded in memory, unlike APIs which I'm guessing have to re-load it for each query/generation?

christina97 · on June 14, 2024

If you integrate with existing tooling, it won’t do this optimization. Unless of course you really go crazy with your setup.

moffkalast · on June 14, 2024

Setting one launch flag on llama.cpp server hardly qualifies as going crazy with one's setup.

choppaface · on June 14, 2024

Yeah but this article is terrible. First it talks about naively copy-pasting code to get “a seeming 10x speed-up” and then “This ended up being incorrect way of calculating the tokens used.”

I would not bank on anything in this article. It might as well have been written by a tiny Llama model.

whoiscroberts · on June 15, 2024

This is great advice. I used to run my dev stuff on AWS, then built a small 6 server proxmox cluster in my basement, 300 cores, 1tb memory, 12tb ssd storage for about 3k usd. I don’t even want to know what it would cost to run a similar config on AWS. You can get cheap ddr4 servers on eBay all day.

causal · on June 14, 2024

Came here to say this. No way you need to spend more than $1500 to run L3 8B at FP16. And you can get near-identical performance at Q8 for even less.

I'm guessing actual break-even time is less than half that, so maybe 2 years.

causal · on June 14, 2024

Furthermore, the AWS estimates are also really poorly done. Using EKS this way is really inefficient, and a better comparison would be AWS Bedrock Haiku which averages $0.75/M tokens: https://aws.amazon.com/bedrock/pricing/

This whole post makes OpenAI look like a better deal than it actually is.

mrinterweb · on June 14, 2024

I was getting that sense too. It would not be difficult to build a desktop machine with a 4090 for around $2500. I run Llama-3 8b on my 4090, and it runs well. Plus side is I can play games with the machine too :)

kiratp · on June 14, 2024

Nvidia EULA prevents you from using consumer gaming GPUs in datacenters so 4xxx cards are a non-starter for any service usecases

EDIT: TOS -> EULA per comments below

J_Shelby_J · on June 14, 2024

What about on prem? Like, my small business needs an LLM. Can I put a 3090 in a box in a closet?

What if I’m a business and I’m selling LLMs in a box for you to put on a private network?

What constitutes a data center according to the ToS? Is it enforceable if you never agree to the ToS (buying through eBay?)

light_hue_1 · on June 14, 2024

Don't listen to this person. They have no idea what they're talking about.

No one cares about this TOS provision. I know both startups and large businesses that violate it as well as industry datacenters and academic clusters. There are companies that explicitly sell you hardware to violate it. Heck, Nvidia will even give you a discount when you buy the hardware to violate it in large enough volume!

You do you.

wongarsu · on June 14, 2024

In a previous AI wave hosters like OVH and Hetzner started offering servers with GTX 1080 at prices other hosters with datacenter-grade GPUs couldn't possibly compete with - and VRAM wasn't as big of a deal back then. That's who this clause targets.

If you don't rent our servers or VMs Nvidia doesn't care. They aren't Oracle.

kiratp · on June 14, 2024

By using the drivers you agree to their TOS. So yes, it applies even on your private network.

swatcoder · on June 14, 2024

The customer limitation described in the EULA is exactly this:

> No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

- https://www.nvidia.com/content/DriverDownloads/licence.php?l...

There's no further elaboration on what "datacenter" means here, and it's a fair argument to say that a closet with one consumer-GPU-enriched PC is not a "datacenter deployment". The odds that Nvidia would pursue a claim against an individual or small business who used it that way is infinitesimal.

So both the ethical issue (it's a fair-if-debatable read of the clause) and the practical legal issue (Nvidia wouldn't bother to argue either way) seem to say one needn't worry about.

The clause is there to deter at-scale commercial service providers from buying up the consumer card market.

nubinetwork · on June 14, 2024

That never stopped the crypto farmers...

byteknight · on June 14, 2024

They also weren't selling the usage of the cards.

jtriangle · on June 14, 2024

There are no nvidia police, they literally cannot stop you from doing this.

giancarlostoro · on June 14, 2024

It's not in a data center, it's in his home.

badgersnake · on June 14, 2024

How would they even know?

oneshtein · on June 14, 2024

Nvidia terms of what?

codetrotter · on June 14, 2024

Parent commenter used the wrong word. It’s the EULA that prevents it.

Regardless, it is true that it is a problem.

https://www.reddit.com/r/MachineLearning/comments/ikrk4u/d_c...

throwaway2016a · on June 14, 2024

Llama-3 is one of the models provided by AWS Bedrock which offers pay as you go pricing. I'm curious how it would break down on that.

LLAMA 8B on Bedrock is $0.40 per 1M input tokens and $0.60 per 1M output tokens which is a lot cheaper than OpenAI models.

Edit: to add to that, as technical people we tend to discount the value of our own time. Bedrock and the OpenAI are both very easy to integrate with and get started. How long did this server take to build? How much time does it take to maintain and make sure all the security patches are applied each month? How often does it crash and how much time will be needed to recover it? Do you keep spare parts on hand / how much is the cost of downtime if you have to wait to get a replacement part in the mail? That's got to be part of the break-even equation.

veryrealsid · on June 14, 2024

> How long did this server take to build?

About 3 days [from 0 and iterating multiple times to the final solution]

> How much time does it take to maintain and make sure all the security patches are applied each month?

A lot

> How often does it crash and how much time will be needed to recover it? Do you keep spare parts on hand / how much is the cost of downtime if you have to wait to get a replacement part in the mail?

All really good points, the exercise to self host is really just to see what is possible but completely agree that self hosting makes little to no sense unless you have a business case that can justify it.

Not to mention if you sign customers with SLAs and then end up having downtime would put even more pressure on your self hosted hardware

VagabundoP · on June 14, 2024

Just to bounce off this a little. If you are looking to fine-tune using an on demand service it seems Amazon Sagemaker can do it at seemingly decent prices:

https://aws.amazon.com/sagemaker/pricing/

I'd love to hear someones experience using this as I want to make an RPG rules bot tied to a specific ruleset as a project but I fear AWS as it might bankrupt me!

zsyllepsis · on June 14, 2024

In my experience SageMaker was relatively straightforward for fine-tuning models that could fit on a single instance, but distributed training still requires a good bit of detailed understanding of how things work under the covers. SageMaker Jumpstart includes some pretty easy out-of-the-box configurations for fine-tuning models that are a good starting point. They will incorporate some basic quantization and other cost-savings techniques to help reduce the total compute time.

To help control costs, you can choose pretty conservative settings in terms of how long you want to let the model train for. Once that iteration is done and you have a model artifact saved, you can always pick back up and perform more rounds of training using the previous checkpoint as a starting point.

croddin · on June 14, 2024

Groq also has pay as you go pricing for llama3 8B for only $0.05/$0.08 that is very fast.

sergiotapia · on June 14, 2024

Groq is actually allowing you to pay now and get real service?

coder543 · on June 14, 2024

The option to pay is still listed as coming soon, but I also see pricing information in the settings page, so maybe it actually is coming somewhat sooner. I’m seeing $0.05/1M input and $0.10/1M output for llama3 8B, which is not exactly identical to what the previous person quoted.

Either way, I wish Groq would offer a real service to people willing to pay.

croddin · on June 14, 2024

I found the .05/.08 here: https://wow.groq.com/

refulgentis · on June 14, 2024

tl;dr: no-ish, it's getting better but still not there.

I don't really get it, only thing I can surmise is it'd be such a no-brainer in various cases, that if they tried supporting it as a service, they'd have to cut users. I've seen multiple big media company employees begging for some sort of response on their discord.

localfirst · on June 14, 2024

didn't know they finally turned on pricing plans

johnklos · on June 14, 2024

> How long did this server take to build? How much time does it take to maintain and make sure all the security patches are applied each month? How often does it crash and how much time will be needed to recover it? Do you keep spare parts on hand / how much is the cost of downtime if you have to wait to get a replacement part in the mail? That's got to be part of the break-even equation.

All of these are the kinds of things that people say to non-technical people to try to sell cloud. It's all fluff.

Do you really think that cloud computing doesn't have security issues, or crashes, or data loss, or that it doesn't involve lots of administration? Thinking that we don't know any better is both disingenuous and a bit insulting.

websap · on June 14, 2024

I've managed fleets on cloud providers with over 100k instances, even with all the excellent features through APIs, managing instances can quickly get tricky.

Tbh, your comment is kind of insulting and belittles how far we've come ahead in infrastructure management.

The cloud is probably more secure than a set of janky servers that you have running in your basement. You can totally automate away 0-days, cves and get access to better security primitives.

yjftsjthsd-h · on June 14, 2024

> The cloud is probably more secure than a set of janky servers that you have running in your basement.

Apples/Oranges. Your janky cloud[0] is less secure than the servers in my basement, because I'm a mostly competent sysadmin. Cloud lets you trade some operational concerns for higher costs, but not all of them.

[0] If you can assume servers run by somebody who doesn't know how to do it properly, obviously I can assume the same about cloud configuration. Have fun with your leaked API keys.

johnklos · on June 14, 2024

If my comment is insulting, I apologize. That was not my intention. My intention was to say that writing sales speak in a technical discussion is insulting to those of us who know better.

However, you've now gone out of your way to try to be insulting. You know nothing about me, yet you want to suggest that the cloud is more secure than my servers, and that my servers are "janky"?

Please try a little harder to engage in reasonable discourse.

xyzzy123 · on June 15, 2024

Almost by definition you cannot automate away 0-days.

throwaway2016a · on June 14, 2024

I've managed both data centers and cloud and IMHO, no, it is not fluff. To take it in order:

> doesn't have security issues

It sure does, but the matrix of responsibility is very different when it is a hosted service. Note: I am making these comments about Bedrock, which is serverless not in relation to EC2.

> It crashes

Absolutely, but the recovery profile is not even close to the same. Unless you have a full time person with physical access to your server who can go press buttons.

> data loss

I'm going to shift this one a tiny bit. What about hardware loss? You need backups regardless. On the cloud when a HDD dies you provision a new one. On premise you need to have the replacement there and ready to swap out (unless you want to wait for shipping). Same with all the other components. So you basically need to buy two of everything. If you have a fleet of servers that's not too bad since presumably they aren't going to all fail on the same component at the same time. But for a single server it is literally double the cost.

> doesn't involve lots of administration

Again, this is relation to Bedrock with is a managed serverless environment. So there is litterally no administration aside from provisioning and securing access to the resource. You'd have a point if this was running on EC2 or EKS but that's not what my post was about.

> Thinking that we don't know any better is both disingenuous and a bit insulting.

I'm not saying cloud is perfect in any way, like all things it requires tradeoff, but quite frankly I find you dismissing my 25 years of experience, 1/3 of that has been working in real data centers (including a top-50 internet company at the time) as "fluff" as "disingenuous and a bit insulting".

amluto · on June 14, 2024

> Unless you have a full time person with physical access to your server who can go press buttons.

Every colo facility I’ve used offers “remote hands”. If you need a button pressed or a disk swapped, they will do it, with a fee structure and response time that varies depending on one’s arrangement with the operator. But it’s generally both inexpensive and fast.

> What about hardware loss? You need backups regardless. On the cloud when a HDD dies you provision a new one. On premise you need to have the replacement there and ready to swap out (unless you want to wait for shipping).

Two of everything may still be cheaper than expensive cloud services. But there’s an obvious middle ground: a service contract that guarantees you spare parts and a technician with a designated amount of notice. This service is widely available and reasonably priced. (Don’t believe the listed total prices on the web sites of big name server vendors — they negotiate substantial discounts, even in small quantities.)

throwaway2016a · on June 14, 2024

> But it’s generally both inexpensive and fast.

I guess inexpensive is relative. I've been on cloud for a while so I'm not sure what the going rates are for "remote hands" and most of my experience is with on-premise vs co-lo.

> Two of everything may still be cheaper than expensive cloud services.

That is true. Everything has tradeoffs. Though in the OPs case I think the math is relatively clear. With Open AIs pricing he calculated the break even at 5 years just for the hardware and electricity. Assuming that calculation is right, two of everything would up that to 7+ years, at which point... a lot can happen in 7 years.

amluto · on June 14, 2024

> I guess inexpensive is relative. I've been on cloud for a while so I'm not sure what the going rates are for "remote hands" and most of my experience is with on-premise vs co-lo.

At a low end facility, I’ve usually paid between $0 and $50 per remote hands incident. The staff was friendly and competent, and I had no complaints. The price list goes a bit higher, but I haven’t needed those services at that facility.

yolovoe · on June 14, 2024

You could have gotten rid of the middle paragraph. It’s not fluff. These are valid technical points. Issues most companies would rather (reasonably) pay to not have to deal with.

And do you really think you can offer better security and uptime than AWS? Not impossible but very expensive if you’re managing everything from your own hardware. You clearly vastly underestimate all that AWS is taking care of.

johnklos · on June 14, 2024

Self hosting means hosting it yourself, not running it on Amazon. I think the distinction the author intends to make is between running something that can't be hosted elsewhere, like ChatGPT, versus running Llama-3 yourself.

Overlooking that, the rest of the article feels a bit strange. Would we really have a use case where we can make use of those 157 million tokens a month? Would we really round $50 of energy cost to $100 a month? (Granted, the author didn't include power for the computer) If we buy our own system to run, why would we need to "scale your own hardware"?

I get that this is just to give us an idea of what running something yourself would cost when comparing with services like ChatGPT, but if so, we wouldn't be making most of the choices made here such as getting four NVIDIA Tesla T4 cards.

Memory is cheap, so running Llama-3 entirely on CPU is also an option. It's slower, of course, but it's infinitely more flexible. If I really wanted to spend a lot of time tinkering with LLMs, I'd definitely do this to figure out what I want to run before deciding on GPU hardware, then I'd get GPU hardware that best matches that, instead of the other way around.

williamstein · on June 14, 2024

> Self hosting means hosting it yourself, not running it on Amazon.

No. I googled "self hosting", read the first few definitions, and they agree with the article, not you. E.g., wikipedia -- https://en.wikipedia.org/wiki/Self-hosting_(web_services)

carom · on June 14, 2024

I would say that is "cloud hosted", which is obviously very expensive compared to running on hardware you own (assuming you own a computer and a GPU). That was the comparison I was interested in, the fact that renting a computer is more expensive than the OpenAI API is not a surprising result.

johnklos · on June 14, 2024

The very first definition from the link you provide is:

> Self-hosting is the practice of running and maintaining a website or service using a private web server, instead of using a service outside of someone's own control.

Hosting anything on Amazon is not "using a private web server" and is the very definition of using "a service outside of someone's own control".

The fact that the rest of the article talks about "enabled users to run their own servers on remote hardware or virtual machines" is just wrong. It's not "their own servers", and we don't have "more control over their data, privacy" when it's literally in the possession of others.

Majestic121 · on June 14, 2024

The second sentence is however :

> The practice of self-hosting web services became more feasible with the development of cloud computing and virtualization technologies, which enabled users to run their own servers on remote hardware or virtual machines. The first public cloud service, Amazon Web Services (AWS), was launched in 2006, offering Simple Storage Service (S3) and Elastic Compute Cloud (EC2) as its initial products.[3]

The mystery deepens

chasd00 · on June 14, 2024

I hate when terms get diluted like this. "self hosted", to me, means you own the physical machine. This reminds of how "air-gapped server" now means a route configuration vs an actual gap of air, no physical connection, between two networks. It really confuses things.

guilhas · on June 15, 2024

That's a bit of newspeak

I think we generally understand Iaas, Paas, and Saas, to be hosted offerings, managed and unmanaged...

https://duckduckgo.com/?q=iaas+paas+saas&ia=web

kiratp · on June 14, 2024

3 year commit pricing with Jetstream + Maxtext on TPU v5e is $0.25 per million tokens.

On demand pricing put it at about $0.45 per million tokens.

Source: We use TPUs at scale at https://osmos.io

Google Next 2024 session going into detail: https://www.youtube.com/watch?v=5QsM1K9ahtw

https://github.com/google/JetStream

https://github.com/google/maxtext

qihqi · on June 14, 2024

For pytorch users: checkout the sister project: https://github.com/google/jetstream-pytorch/blob/main/benchm...

gradus_ad · on June 14, 2024

I wonder how long NVIDIA can justify its current market cap once people realize just how cheap it is to run inference on these models given that LLM performance is plateauing, LLM's as a whole are becoming commoditized, and compute demand for training will drop off a cliff sooner than people expect.

dwaltrip · on June 14, 2024

> LLM performance is plateauing

It’s a wee bit early to call this. Let’s see what the top labs release in the next year or two, yeah?

GPT-4 was released only 15 months ago, which was about 3 years after GPT-3 was released.

These things don’t happen overnight, and many multi-year efforts are currently in the works, especially starting last year.

smokel · on June 14, 2024

As someone else points out, training is slightly more involved, but I also find that these smaller models are next to worthless compared to the larger ones.

There are probably some situations where it suffices to use a small model, but for most purposes, I'd prefer to use the state of the art, and I'm eager for that state to progress a little more.

vineyardmike · on June 15, 2024

> but for most purposes, I'd prefer to use the state of the art

I'm guessing in the future, we'll see a lot more automatic inference "on our behalf", and you won't care or notice if its using Llama5:3b or whatever comes out then.

I'm betting that in a few years we'll see LLMs baked into a ton of stuff - lots of "simple" things mostly, like email summary/rewording, and similar "light touch" use cases. Maybe light multi-modal work like photo labeling. Probably expanded to be running in every other SaaS applications doing god knows what. For those, the small models would be more than enough, and much cheaper to run in large volumes.

I'm guessing "chat with a bot to ask questions" will be a small amount of the inference that happens on our behalf, but will use the valuable SOTA model use case.

epolanski · on June 14, 2024

I am a partial believer that the real race for many tech players is actually AGI and ASI later and till the problem is solved the hardware arm race will keep being part of it.

Not only big tech is part of it but billion dollars startups are popping everywhere from China to US and Middle East.

chrisdbanks · on June 14, 2024

Unless we hit another AI winter. We might get to the point where the hardware just can't give better returns and have to wait another 20 years for the next leap forward. We're still orders of magnitude away from the human brain.

nextworddev · on June 14, 2024

It’s actually about training, not inference. You can’t do training on commodity gpus but yeah once someone figures that out, nvdia could crash

amluto · on June 14, 2024

Nvidia doesn’t obviously have a strong inference play right now for a widely-deployed small model. For a model that really needs a 4090, maybe. But for a model that can run on a Coral chip or an M1/M2/M3 or whatever Intel or AMD’s latest little AI engines can do? This market has plenty of players, and Nvidia doesn’t seem to be anywhere near the lead except insofar as it’s a little bit easier to run the software on CUDA.

gradus_ad · on June 14, 2024

I know, my point is that when training demand decreases people will be realize that inference does not make up the difference

nextworddev · on June 14, 2024

Yeah the big question I’m struggling with is exactly when training demand will fall if at all

vineyardmike · on June 15, 2024

Well every other tech company is writing checks to prove they have the chops to make an LLM. From IBM to Databricks to the big guys like Google. Tons of companies made one just to show off to investors or their CEO or just because they wanted skin in the game. But that probably won't continue forever. We've already seen certain orgs that seem to outshine others, and if they can't catch up, they may just accept to using the open-access models or APIs instead.

At some point everyone will realize that it is becoming a commodity, and it is very expensive to train, then only those with wither a structural advantage to lower price (eg Google) or a true goal of being on the high-end/SOTA of the market (OpenAI, Anthropic) will keep going.

fauigerzigerk · on June 15, 2024

You're talking about training the foundation models. But what about all the fine tuning on non-public business data that will be necessary to make gen AI useful for actual business processes?

I'm finding it difficult to estimate the size of this workload compared to continued training of foundation models. Perhaps it depends on whether there are new architectural breakthroughs that require retraining of foundation models.

And what about non-language tasks such as interpreting video and 3D sensory data? This is potentially huge, but between huge peaks there is often a valley of unknowable depth and breadth.

vineyardmike · on June 15, 2024

I have yet to hear a use-case for “fine tuning from business data” that relied on large models to succeed. Once again, I’m skeptical the average business will need this.

Yea yes video probably requires a lot of GPUs to train. And a lot of source material to train against. And a use case. Which again, most companies don’t have.

Model development is clearly here to stay, and clearly valuable. Models from every other company, either foundation or fine tuned - I’m not sure that emperor is wearing many clothes any time soon.

sroussey · on June 14, 2024

Every research lab is focused on new architectures that would reduce training costs.

nextworddev · on June 14, 2024

Yeah we need essentially hadoop for llm training

riku_iki · on June 14, 2024

> I wonder how long NVIDIA can justify its current market cap once people realize just how cheap it is to run inference on these models given that LLM performance is plateauing

next wave driving demand can be actual new products developed on LLMs. There are very few usecases currently well developed besides chatbots, but potential is very large.

angoragoats · on June 14, 2024

Agreed with the sentiments here that this article gets a lot of the facts wrong, and I'll add one: the cost for electricity when self-hosting is dramatically lower than the article says. The math assumes that each of the Tesla T4s will be using their full TDP (70W each) 24 hours a day, 7 days a week. In reality, GPUs throttle down to a low power state when not in use. So unless you're conversing with your LLM literally 24 hours a day, it will be using dramatically less power. Even when actively doing inference, my GPU doesn't quite max out its power usage.

Your self-hosted LLM box is going to use maybe 20-30% of the power this article suggests it will.

Source: I run LLMs at home on a machine I built myself.

wesleyyue · on June 14, 2024

Surprised no comments are pointing out that the analysis is pretty far off simply due to the fact that the author runs with batch size of 1. The cost being 100x - 1000x what API providers are charging should be a hint that something is seriously off, even if you expect some of these APIs to be subsidized.

causal · on June 14, 2024

No way you need $3,800 to run an 8B model. 3090 and a basic rig is enough.

That being said, the difference between OpenAI and AWS cost ($1 vs $17) is huge. Is OpenAI just operating at a massive loss?

Edit: Turns out AWS is actually cheaper if you don't use the terrible setup in this article, see comments below.

throwup238 · on June 14, 2024

AWS's pricing is just ridiculous. Their 1-year reserve pricing for an 8x H100 or A100 instance (p4/p5) costs just as much as buying the machine outright with tens of thousands left over for the NVIDIA enterprise license and someone to manage them (per instance!). Their on demand pricing is even more insane - they're charging $3.x/hr for six year old cards.

readams · on June 14, 2024

What about the cost of the power and cooling to run the machine (a lot!), and the staff to keep it running?

throwup238 · on June 14, 2024

That's why I said "and someone to manage them". The difference is in the tens of thousands of dollars per instance. The savings from even a dozen instances is enough to pay for someone to manage them full time, and that's just for the first year. Year 2 and 3 you're saving six figures per instance so you'd be able to afford one person per machine to hand massage them like some fancy kobe beef.

A100 TDP is 400W so assuming 4kW for the whole machine, that's a little more than $5k/year at $0.15/kWh. Again, the difference is in the tens of thousands per instance. Even at 50% utilization over three years, if you need more than a dozen machines it's much cheaper to buy them outright, especially on credit.

throwaway240403 · on June 14, 2024

I thought it was generally known they were operating at a loss?

even with the subs and api charges, they still let people use chatGPT for free with no monetization options. Sure they are collecting the data for training, but that's hard to quantify the value of.

refulgentis · on June 14, 2024

I mean, no, I came to scan the comments quick after reading because there's a lot of bad info you can walk away with from the post, it's sort of starting from scratch on hosting LLMs

If you keep reading past there, they get it down significantly. The 8 tkn/s number AWS was evaluated on is really funny, that's about what you'd get on last year's iphone and it's not because apples special, it's because theres barely any reasonable optimization being done here. No batching, float32 weights (8 bit is guaranteed indistinguishable from 32 bit, 5 bit tests as definitely indistinguishable in blind tests, 4 bit arguably is indistinguishable)

causal · on June 14, 2024

You're right. In fact, using EKS at all is silly when AWS offers their Bedrock service with Claude Haiku (rated #19 on Chat Arena vs. ChatGPT3.5-Turbo at #37) for a much lower cost of $0.75/M tokens (averaging input and output like OP does)[0].

So in reality AWS is cheaper for a much better model if you don't go with a wildly suboptimal setup.

[0] https://aws.amazon.com/bedrock/pricing/

thot_experiment · on June 15, 2024

yeah what, a GV100 is $1600 on eBay and you can run like a 3.8bpw quant of 70b llama at some decent number of tk/s, if you buy two you can run a 70b 5bpw quant with 32k context no problem

forrest2 · on June 14, 2024

A single synchronous request is not a good way to understand cost here unless your workload is truly singular tiny requests. Chatgpt handles many requests in parallel and this article's 4 GPU setup certainly can handle more too.

It is miraculous that the cost comparison isn't worse given how adversarial this test is.

Larger requests, concurrent requests, and request queueing will drastically reduce cost here.

liquidise · on June 14, 2024

Great mix of napkin math and proper analysis, but what strikes me most is how cheap LLM access is. For it being relatively bleeding edge, us splitting hairs on < $20/M tokens is remarkable itself, and something tech people should be thrilled about.

refulgentis · on June 14, 2024

Smacks of the "starving kids in Africa" fallacy, you could make the same argument that tech people should be thrilled for current thing being available at $X for X = $2/$20/$200/$2000...

throwup238 · on June 14, 2024

The T4 is a six year old card. A much better comparison would be a 3090, 4090, A10, A100, etc.

Havoc · on June 14, 2024

>initial server cost of $3,800

Not following?

Llama 8B is like 17ish gigs. You can throw that onto a single 3090 off ebay. 700 for the card and another 500 for some 2nd hand basic gaming rig.

Plus you don't need a 4 slot PCIE mobo. Plus it's a gen4 pcie card (vs gen3). Plus skipping the complexity of multi-GPU. And wouldn't be surprised if it ends up faster too (everything in one GPU tends to be much faster in my experience, plus 3090 is just organically faster 1:1)

Or if you're feeling extra spicy you can do same on a 7900XTX (inference works fine on those & it's likely that there will be big optimisation gains in next months).

Sohcahtoa82 · on June 14, 2024

> Llama 8B is like 17ish gigs. You can throw that onto a single 3090 off ebay

Someone correct me if I'm wrong, but I've always thought you needed enough VRAM to have at least double the model size so that the GPU has enough VRAM for the calculated values from the model. So that 17 GB model requires 34 GB of RAM.

Though you can quantize to fp8/int8 with surprisingly little negative effect and then run that 17 GB model with 17 GB of VRAM.

jokethrowaway · on June 14, 2024

No, you don't need that much

Here is a calculator (if you have a GPU you want to use EXL2, otherwise GGUF) https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

Also model quantisation goes a long way with surprisingly little loss in quality.

wkat4242 · on June 16, 2024

You do need more if you use larger context sizes though. It can really blow up to multiple times the model size even for 128k context.

Edit: Oh I see that calculator you linked shows that too. My information was more trial and error, thanks for the calculator link.

AaronFriel · on June 14, 2024

These costs don't line up with my own experiments using vLLM on EKS for hosting small to medium sized models. For small (under 10B parameters) models on g5 instances, with prefix caching and an agent style workload with only 1 or a small number of turns per request, I saw on the order of tens of thousands of tokens/second of prefill (due to my common system prompts) and around 900 tokens/second of output.

I think this worked out to around $1/million tokens of output and orders of magnitude less for input tokens, and before reserved instances or other providers were considered.

veryrealsid · on June 14, 2024

Interesting, I think how the model runs makes a big difference and I plan to re-run this experiment with different models and different ways of running the model.

xmonkee · on June 14, 2024

Does anyone know the impact of the prompt size in terms of throughput? If I'm only generating 10 tokens, does it matter if my initial prompt is 10 tokens or 8000 tokens? How much does it matter?

visarga · on June 14, 2024

I just bought a $1099 MacBook Air M3, I get about 10 tokens/s for a q5 quant. Doesn't even get hot, and I can take it with me on the plane. It's really easy to install ollama.

mark_l_watson · on June 14, 2024

Until January this year I mostly used Google Colab for both LLMs and deep learning projects. In January I spent about $1800 getting Apple Silicon M2Pro 32G. When I first got it, I was only so-so happy with the models I could run. Now I am ecstatically happy with the quality of the models I can run on this hardware.

I sometimes use Groq Llama3 APIs (so fast!) or OpenAI APIs, but I mostly use my 32G M2 system.

The article calculates cost of self-hosting, but I think it is also good taking into account how happy I am self hosting on my own hardware.

segmondy · on June 14, 2024

I own an 8 GPU cluster that I built for super cheap < $4,000. 180gb vram, 7 24gb + 1 24gb. There are tons of models that I run that's not hosted by any provider. The only way to run it is to host myself. Furthermore, the author has 39 tokens in 6 seconds. For llama3-8b, I get almost 80 tk/s and if parallel, can easily get up to 800 tk/s. Most users at home infer only one at a time because they are doing chat or role play. If you are doing more serious work, you will most likely have multiple inference running at once. When working with smaller models, it's not unusual to have 4-5 models loaded at once with multiple inference going. I have about 2tb of models downloaded, I don't have to shuffle data back and forth to the cloud, etc. To each their own, the author's argument is made today by many on why you should host in the cloud. Yet if you are not flush with cash and a little creative, it's far cheaper to run your own server than in the cloud.

To run llama-3 8b. A new $300 3060 12gb will do, it will load fine in Q8 gguf. If you must load in fp16 and cash is a problem a $160 P40 will do. If performance is desired a used 3090 for ~$650 will do.

kennethwolters · on June 14, 2024

I am looking into renting Hetzner GEX44 dedicated server to run a couple models on with Ollama. I haven't done the arithmetics yet but I wouldn't be surprised to see a 100x cost-decrease compared to OpenAI APIs (granted the models I'll run on the GEX44 machines will be less powerful)

ekkyv6 · on June 14, 2024

What kind of setup were you able to do for so cheap? I'd love to be able to do more locally. I have access to a single RTX A5000 at work, but it is often not enough for what I'm wanting to do, and I end up renting cloud GPU.

segmondy · on June 16, 2024

https://www.reddit.com/r/LocalLLaMA/comments/1bqv5au/144gb_v...

zingerlio · on June 15, 2024

Interested in your sub-$4k 8 GPU setup. Care to elaborate a bit or do you have a write up somewhere?

segmondy · on June 16, 2024

check my reply above. 2 xeon cpus, 40 cores, 88 lanes, 128gb drive, fast nvme 2tb SSD.

Melatonic · on June 14, 2024

Curious how the older Titan XP with 12GB Vram might compare

rfw300 · on June 14, 2024

I agree with most of the criticisms here, and will add on one more: while it is generally true that you can’t beat “serverless” inference pricing for LLMs, production deployments often depend on fine-tuned models, for which these providers typically charge much more to host. That’s where the cost (and security, etc.) advantage for running on dedicated hardware comes in.

theogravity · on June 14, 2024

The energy costs in the bay area are double the reported 24c cost, so energy alone would be around $100-ish a month instead of $50-ish.

angoragoats · on June 14, 2024

Except that the article assumes that the GPUs would be using their max TDP all the time, which is incorrect. GPUs will throttle down to 5-20w (depending on the specific GPU). So your actual power consumption is going to be much, much lower, unless you’re literally using your LLM 24/7.

pkaye · on June 14, 2024

Unless you are in Santa Clara with Silicon Valley Power rates.

https://www.siliconvalleypower.com/residents/rates-and-fees

veryrealsid · on June 14, 2024

Yeah agreed, some of the areas we have access to were 16c (PA) and up to 24c (NYC), doubled that cost in the analysis because of things like this

jezzarax · on June 14, 2024

llama.cpp + llama-3-8b in Q8 run great on a single T4 machine. Cannot remember the TPS I got there, but it was much above 6 mentioned in the article.

veryrealsid · on June 14, 2024

Interesting, I got very different results depending on how I ran the model, will definitely give this a try!

edit: Actually could you share how long it took to make a query? One of our issues is we need it to respond in a fast time frame

jezzarax · on June 14, 2024

I checked some logs from my past experiments, the decoding went for about 400 tps over a ~3k token query, so about 7 seconds to process it, and then the generation speed was about 28 tokens.

yousif_123123 · on June 14, 2024

deepinfra.com hosts Llama 3 8b for 8 cents per 1m tokens. I'm not sure it's the cheapest but it's pretty cheap. There may be even cheaper options.

(Haven't used it in production, thinking to use it for side projects).

winddude · on June 14, 2024

does aws not have lower vcpu and memory instances with multiple T4s? because with 192gbs of memory and 24 cores, you're paying for a ton of resources you won't be using if you're only running inference.

agcat · on June 14, 2024

This is a good way to do math. But honestly, how many products actually have 100% utilisation. I did some math a few months ago but mostly on the basis of active users, on what would be the % difference if you have 1k to 10K users/mo. You can run this as low as $0.3K/mo on Serverless GPUs and $0.7K/mo on EC2.

The pricing is outdated now.

Here is the piece -https://www.inferless.com/learn/unraveling-gpu-inference-cos...

michaelmior · on June 14, 2024

There's also the option of platforms such as BentoML (I have no affiliation) that offer usage-based pricing so you can at least take the 100% utilization assumption off the table. I'm not sure how the price compares to EKS.

https://www.bentoml.com/

baobabKoodaa · on June 14, 2024

If we care about cost efficiency when running LLMs, the most important things are:

1. Don't use AWS, because it's one of the most expensive cloud providers

2. Use quantized models, because they offer the best output quality per money spent, regardless of the budget

This article, on the other hand, focuses exclusively on running an unquantized model on AWS...

melbourne_mat · on June 14, 2024

This is another one of those "I used this for 5 minutes and found this out" naive posts which add nothing useful.

Check out the host LLM's at home crowd. One app to look at is llama.cpp. Model compression is one of the first techniques to successfully run models on low capacity hardware.

barbegal · on June 14, 2024

There's some dodgy maths

>( 100 / 157,075,200 ) * 1,000,000 = $0.000000636637738

Should be $0.64 so still expensive

jasonjmcghee · on June 14, 2024

being 6 orders of magnitude off in your cost calculation isn't great.

groq costs about that for llama 3 70b (which is a monumentally better model) and 1/10th of that for llama 3 8b

pants2 · on June 14, 2024

Groq doesn’t currently have a paid API that one can sign up for.

jasonjmcghee · on June 14, 2024

Yup. True. Should say "will" - currently free but heavily rate-limited. Together AI looks to be about $0.30 / 1M tokens, as another price comparison. Which you can pay for.

badgersnake · on June 14, 2024

I’ve used llama3 on my work laptop with ollama. It wrote an amazing pop song about k-nearest neighbours in the style of PJ and Duncan’s ‘Let’s Get Ready to Rhumble’ called ‘Let’s Get Ready to Classify’ For everything else it’s next to useless.

vinni2 · on June 14, 2024

Ggml Q8 models on ollama can run on much cheaper hardware without losing much performance.

cheptsov · on June 14, 2024

With dstack you can either utilize multiple affordable cloud GPU providers at once to get the cheapest GPU offer or also use an own cluster of on-prem servers. Dstack supports both altogether. Disclaimer: I’m a core contributor to dstack

axegon_ · on June 14, 2024

Up until not too long ago I assumed that self-hosting an llm would come at an outrageous cost. I have a bunch of problems with LLM's in general. The major one is that all LLMs(even openAI) will produce output which will give anyone a great sense of confidence, only to be later slapped across the face with the harsh reality-for anything involving serious reasoning, chances are the response you got was at large bullshit. The second one is that I do not entirely trust those companies with my data, be it OpenAI, Microsoft or Github or any other.

That said, a while ago there was this[1] thread on here which helped me snatch a brand new, unboxed p40 for peanuts. Really, the cost was 2 or 3 jars of good quality peanut butter. Sadly it's still collecting dust since although my workstation can accommodate it, cooling is a bit of an issue - I 3D printed a bunch of hacky vents but I haven't had the time to put it all together.

The reason why I went this road was phi-3, which blew me away by how powerful, yet compact it is. Again, I would not trust it with anything big, but I have been using it for sifting through a bunch of raw, unstructured text and extract data from it and it's honestly done wonders. Overall, depending on your budget and your goal, running an llm in your home lab is a very appealing idea.

[1] https://news.ycombinator.com/item?id=39477848

waldrews · on June 14, 2024

Hetzner GPU servers at $200/month for an RTX 4000 with 20GB seem competitive. Anyone have experience with what kind of token throughput you could get with that?

sgt101 · on June 14, 2024

Running 13b code llama on my m1 macbook pro as I type this...

cloudking · on June 14, 2024

What do you use it for? What problems does it solve?

k__ · on June 14, 2024

Half-OT: can I shard Llama3 and run it on multiple wasm processes?

yieldcrv · on June 14, 2024

this is not what I consider self hosting but ok

I would like to compare the costs vs hardware on prem, so this helps with one side of the equation

guluarte · on June 14, 2024

? you can run llama 3 8b with a 3060

jokethrowaway · on June 14, 2024

Yeah, or you can get a gpu server with 20GB VRAM on hetzner for ~200 EUR per month. Runpod and DigitalOcean are also quite competitive on prices if you need a different GPU.

AWS is stupidly expensive.

hereonout2 · on June 14, 2024

Expensive in general but combine some decent tooling and spot instances and it can be insanely cheap.

The latest Nvidia L4 GPUs (24GB) instances are currently less than 15c p/h spot.

T4s are around 20c per hour spot, though they are smaller and slower.

I've been provisioning hundreds of these at a time to do large batch jobs at a fraction of the price of commercial solutions (i.e. 10-100x cheaper).

Any problem that fits in a smaller GPU and can be expressed as a batch job using spot instances can be done very cheaply on AWS.

ilaksh · on June 14, 2024

Kind of a ridiculous approach, especially for this model. Use together.ai, fireworks.ai, RunPod serverless, any serverless. Or use ollama with the default quantization, will work on many home computers, including my gaming laptop which is about 5 years old.