Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cost of self hosting Llama-3 8B-Instruct (lytix.co)
245 points by veryrealsid on June 14, 2024 | hide | past | favorite | 183 comments


Instead of using AWS another approach involves self hosting the hardware as well. Even after factoring in energy, this does dramatically lower the price.

Assuming we want to mirror our setup in AWS, we’d need 4x NVidia Tesla T4s. You can buy them for about $700 on eBay.

Add in $1,000 to setup the rest of the rig and you have a final price of around:

$2,800 + $1,000 = $3,800

This whole exercise assumes that you're using the Llama 3 8b model. At full fp16 precision that will fit in one 3090 or 4090 GPU (the int8 version will too, and run faster, with very little degradation.) Especially if you're willing to buy GPU hardware from eBay, that will cost significantly less.

I have my home workstation with a 4090 exposed as a vLLM service to an AWS environment where I access it via reverse SSH tunnel.


Why did this only occur to me recently? You can selfhost a k8s cluster and expose the services using a $5 digital ocean droplet. The droplet and k8s services are point-to-point connected using tailscale. Performance is perfectly fine, keeps your skillset sharp, and you’re self-hosting!


You can also just directly connect to containers using Tailscale if it's just for internal use. That is, having an internally addressable `https://container_name` on your tailnet per-container if you want. This way I can setup Immich for example and it's just on my tailnet at `https://immich` without the need for a reverse proxy, etc...

https://tailscale.com/blog/docker-tailscale-guide


And you can use Tailscale Funnel to serve it publicly. No need to pay for a cloud instance.

https://tailscale.com/kb/1223/funnel


I essentially do this with my homelab.


Whoa, so you have code running in AWS making use of your local hardware via what is called a reverse SSH tunnel? I will have to look into how that works, that's pretty powerful if so. I have a mac mini that I use for builds and deploys via FTP/SFTP and was going to look into setting up "messaging" via that pipeline to access local hardware compute through file messages lol, but reverse SSH tunnel sounds like it'll be way better for directly calling executables rather than needing to parse messages from files first.


Look into Nebula (or Tailscale if you trust third parties). I have all my workstations and servers on a mesh network that appears as a single /24 that is end to end encrypted, mutually authenticated and works through/behind NAT. I can spawn a vhost on any server that reverse proxies an API to any port on any machine.

It’s been an absolute gamechanger.


Why do you have to trust a third party?

It’s end to end encrypted, and with tail lock enabled, nodes can not be added without user’s permission.


Well, one example, depending on your threat model—their privacy policy states that they retain info and comply with subpoenas.

There's also potential for malicious updates to compromise a network (as there is with most software unless you're auditing the source for each update).

E2EE is only as meaningful as where the keys reside, and how easily those keys are abused.


That’s interesting!

The metadata is generally public information, I don’t care about that.

The malicious updates and key abuse are more concerning. It’s true for all software, and probably better done with OS, like on iOS.

The VPN could steal the keys, but that’s a lawsuit!


Are the keys not already kept on their own infra?


No, private keys don’t leave user’s devices. This is the case in all such products.

But with a malicious update, they could ship them to their infra, targeting some users. The product then becomes malware!


The idea of “user’s permission” is determined by tailscale and/or the oidc provider. I don’t know anything about “tail lock”, perhaps it is a new mitigation for this issue?

I didn’t start with tailscale because the only way you could log into it was with Google or GitHub or something. I don’t trust Microsoft or Google with auth for my internal network. I thought about running Headscale but Nebula was faster/easier for me.


Yes, Microsoft and Google will not be able to authenticate to your network if you enable tail lock. A node in your network has to sign.


Is there any resource that goes into more detail about how to setup all this?


https://github.com/slackhq/nebula

the docs are good. when creating the initial CA make absolutely sure you set the CA expiration to 10-30 years, the default is 1 which means your whole setup explodes in a year without warning.


@sneak, can you comment on your experience with nebula vs Tailscale?


Whooooaaa that is mind-blowing. Thanks for sharing. <3


why either of these over plain wireguard if you're not provisioning accounts?


Wireguard doesn’t do nat punching and is not mesh, it’s p2p only.

totally different use case.


I feel like wire guard definitely does nat punching unless I misunderstand you, I've been doing this sort of thing to have my phone and desktop on the same "LAN" all the time so I can moonlight in from anywhere (among other things) and they're definitely natted.


true. i do may punch with a pnp client on the server side.


I use my mac mini exactly as described by the parent post but using ollama as the server. Super easy setup and obv chatgpt can guide you through it


Unfortunately my mac mini isn't beefy enough to run ollama, it's the base model m1 from a couple years ago lol. But it's very powerful for builds, deploys, and some computation via scripts. Now I'm curious to check out how much memory the newest ones support for potentially using ollama on it haha. Thanks!


Mine is also an m1. Just use llama3, its 8b quantized by default


I will try it out, curious to see how it will work with 8gb of memory haha. Thanks for the heads up!


Do you happen to have any handy guides/docs/references for absolute beginners to follow?


The absolute easiest way is https://github.com/Mozilla-Ocho/llamafile

Just download a single file and run it.


Ollama is not as powerful as llama.cpp or raw pytorch, but it is almost zero effort to get started.

brew install ollama; ollama serve; ollama pull llama3: 8b-v2.9-q5_K_M; ollama run llama3: 8b-v2.9-q5_K_M

https://ollama.com/library/dolphin-llama3:8b-v2.9-q5_K_M

(It may need to be Q4 or Q3 instead of Q5 depending on how the RAM shakes out. But the Q5_K_M quantization (k-quantization is the term) is generally the best balance of size vs performance vs intelligence if you can run it, followed by Q4_K_M. Running Q6, Q8, or fp16 is of course even better but you’re nowhere near fitting that on 8gb.)

https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

Dolphin-llama3 is generally more compliant and I’d recommend that over just the base model. It's been fine-tuned to filter out the dumb "sorry I can't do that" battle, and it turns out this also increases the quality of the results (by limiting the space you're generating, you also limit the quality of the results).

https://erichartford.com/uncensored-models

https://arxiv.org/abs/2308.13449

Most of the time you will want to look for an "instruct" model, if it doesn't have the instruct suffix it'll normally be a "fill in the blank" model that finishes what it thinks is the pattern in the input, rather than generate a textual answer to a question. But ollama typically pulls the instruct models into their repos.

(sometimes you will see this even with instruct models, especially if they're misconfigured. When llama3 non-dolphin first came out I played with it and I'd get answers that looked like stackoverflow format or quora format responses with ""scores"" etc, either as the full output or mixed in. Presumably a misconfigured model, or they pulled in a non-instruct model, or something.)

Dolphin-mixtral:8x7b-v2.7 is where things get really interesting imo. I have 64gb and 32gb machines and so far the Q6 and q4-k_m are the best options for those machines. dolphin-llama3 is reasonable but dolphin-mixtral is a richer better response.

I’m told there’s better stuff available now, but not sure what a good choice would be for for 64gb and 32gb if not mixtral.

Also, just keep an eye on r/LocalLLaMA in general, that's where all the enthusiasts hang out.


Ollama is llamma.cpp plus docker If you can do without docker, it’s faster


No, the ollama default quantisation is 4 bit


I meant 8b -> 8billion rather than 70b


Ah sorry!


Using tailscale might be a better and easier solution.


using Tailscale can make the networking setup much easier, really like their service for things like this (or curling another dev's local running server)


You can also check if you have ipv6. I have tried both, but prefer directly connecting home.


I don't know enough about networking and that level of configuration to do it confidently and safely yet. I'd rather rely on using credentials where the only real access is some limited command line executables or file transfer, rather than exposing more of the hardware directly to the network for a direct connection. I do have interest in learning this much, but I find that my current FTP/SFTP approach has more guard rails than a direct connection. Do you agree with this or am I just not understanding enough about ipv6 and direct connections home?


You get the equivalent setup in ipv6 by having your home modem or router deny inbound connections.

I think port forwarding configuration is a pain that does not offer value over just poking a hole in your firewall to do an authenticated connection over ssh.


I do the same thing with cloudflare tunnels and managing the cloudflare tunnel process and the llama.cpp server with systemd on my home internet.

Have a 13B running on a 3070 with 16 gpu layers and the rest running off CPU.

Performs okay, but way cheaper than renting a GPU on the cloud.


Read the other comment and also immediately thought of Cloudflare tunnel instead - is there a reason you chose that? Wondering if I should do the same with my old Titan XP (probably slower than your 3070 but it does have 12gb of vram)


No reason other than that cloudflare is quite a nice reverse proxy and I already use it for managing my DNS's. Thus far, I've not noticed any issues with it, other than when my home wifi goes down.


I dropped $5k on an A6000 and I can run llama3:70b day and night for the price of my electricity bill.

I’ve gone through hundreds of millions, maybe billions, of tokens in the past year.

This article is just “cloud is expensive” 101. Nothing new.


1B of tokens for Gemini Flash (which is on par with llama3-70b in my experience or even better sometimes) with 2:1 input-output would cost ~600 bucks (ignoring the fact they offer 1M tokens a day for free now). Ignoring electricity you'd break even in >8 years. You can find llama3-70b for ~same prices if you're interested in the specific model.


I answered the financial thinking in another reply, but another factor is I need to know if the model today is exactly the same as tomorrow for reliable scientific benchmarking.

I need to tell if I change I made was impactful, but if the model just magically gets smarter or dumber at my tasks with no warning then I can’t tell if I made an improvement or a regression.

Whereas the model on my GPU doesn’t change unless I change it. So it’s one less variable and LLM are black box to start with.

I may be wrong for Gemini, but my impression is all the companies are constantly tweaking the big models. I know GPT on Monday is not always the same GPT on Thursday for example.


I've worked professionally over the last 12 months hosting quite a few foundation models and fine tuned LLMs on our own hardware, aws + azure vms and also a variety of newer "inference serving" type services that are popping up everywhere.

I don't do any work with the output, I'm just the MLOps guy (ahem, DevOps).

You mention expense but on a purely financial basis I find any of these hosted solutions really hard to justify against GPT 3.5 turbo prices, including building your own rig. $5k + electricity is loads of 3.5 Turbo tokens.

Of course none of the data scientists or researchers I work with want to use that though - it's not their job to host these things or worry about the costs.


So my main motivation is not so much to have the lowest cost, but to have the most predictable cost.

Knowing up front this is my fixed ML budget gives me peace of mind and gives me room to try stupid ideas without worrying about it.

Whereas doing it in the cloud you can a) get slammed with some crazy bill by accident, b) have to think more about what resources testing an idea will take, or conversely c) getting GPU FOMO and thinking “if just upgrade a level all my problems will be solved”.

It works for me, everybody mileage varies but personally I like to budget; spend; and then totally focus on my goals and not my cloud spend.

I’m also from the pre-cloud era, so doing stuff on my own bare metal is second nature.


Super cool, thanks for sharing. Do you mind sharing what you used the hundreds of millions (or billions) of tokens on?


Doing really nuanced classification of documents at very large scale. Needle in the haystack type problems.


Is this at 4-bit quantization? And how many tokens per second is the output?


I’m doing non-interactive tasks, but in terms of the A6000 running llama3 70b in chat mode it’s as usable as any of the commercial offerings in terms of speed. I read quickly and it’s faster than I read.


Hows your ROI?


Absolutely phenomenal.


Are you using it for trading?


Nope, powers some low-level infrastructure-ish stuff.


Yea, for any hobbyist, indie developer, etc. I think it'd be ridiculous to not first try running one of these smaller (but decently powerful) open source models on your own hardware at home.

Ollama makes it dead simple just to try it out. I was pleasantly surprised by the tokens/sec I could get with Llama 3 8B on a 2021 M1 MBP. Now need to try on my gaming PC I never use. Would be super cool to just have a LLM server on my local network for me and the fam. Exciting times.


How is inference latency for coding use cases on a local 3090 or 4090 compared to say, hitting the GPT-4o API?


I assume the characteristics would be pretty different, since your local hardware can keep the context loaded in memory, unlike APIs which I'm guessing have to re-load it for each query/generation?


If you integrate with existing tooling, it won’t do this optimization. Unless of course you really go crazy with your setup.


Setting one launch flag on llama.cpp server hardly qualifies as going crazy with one's setup.


Yeah but this article is terrible. First it talks about naively copy-pasting code to get “a seeming 10x speed-up” and then “This ended up being incorrect way of calculating the tokens used.”

I would not bank on anything in this article. It might as well have been written by a tiny Llama model.


This is great advice. I used to run my dev stuff on AWS, then built a small 6 server proxmox cluster in my basement, 300 cores, 1tb memory, 12tb ssd storage for about 3k usd. I don’t even want to know what it would cost to run a similar config on AWS. You can get cheap ddr4 servers on eBay all day.


Came here to say this. No way you need to spend more than $1500 to run L3 8B at FP16. And you can get near-identical performance at Q8 for even less.

I'm guessing actual break-even time is less than half that, so maybe 2 years.


Furthermore, the AWS estimates are also really poorly done. Using EKS this way is really inefficient, and a better comparison would be AWS Bedrock Haiku which averages $0.75/M tokens: https://aws.amazon.com/bedrock/pricing/

This whole post makes OpenAI look like a better deal than it actually is.


I was getting that sense too. It would not be difficult to build a desktop machine with a 4090 for around $2500. I run Llama-3 8b on my 4090, and it runs well. Plus side is I can play games with the machine too :)


Nvidia EULA prevents you from using consumer gaming GPUs in datacenters so 4xxx cards are a non-starter for any service usecases

EDIT: TOS -> EULA per comments below


What about on prem? Like, my small business needs an LLM. Can I put a 3090 in a box in a closet?

What if I’m a business and I’m selling LLMs in a box for you to put on a private network?

What constitutes a data center according to the ToS? Is it enforceable if you never agree to the ToS (buying through eBay?)


Don't listen to this person. They have no idea what they're talking about.

No one cares about this TOS provision. I know both startups and large businesses that violate it as well as industry datacenters and academic clusters. There are companies that explicitly sell you hardware to violate it. Heck, Nvidia will even give you a discount when you buy the hardware to violate it in large enough volume!

You do you.


In a previous AI wave hosters like OVH and Hetzner started offering servers with GTX 1080 at prices other hosters with datacenter-grade GPUs couldn't possibly compete with - and VRAM wasn't as big of a deal back then. That's who this clause targets.

If you don't rent our servers or VMs Nvidia doesn't care. They aren't Oracle.


By using the drivers you agree to their TOS. So yes, it applies even on your private network.


The customer limitation described in the EULA is exactly this:

> No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

- https://www.nvidia.com/content/DriverDownloads/licence.php?l...

There's no further elaboration on what "datacenter" means here, and it's a fair argument to say that a closet with one consumer-GPU-enriched PC is not a "datacenter deployment". The odds that Nvidia would pursue a claim against an individual or small business who used it that way is infinitesimal.

So both the ethical issue (it's a fair-if-debatable read of the clause) and the practical legal issue (Nvidia wouldn't bother to argue either way) seem to say one needn't worry about.

The clause is there to deter at-scale commercial service providers from buying up the consumer card market.


That never stopped the crypto farmers...


They also weren't selling the usage of the cards.


There are no nvidia police, they literally cannot stop you from doing this.


It's not in a data center, it's in his home.


How would they even know?


Nvidia terms of what?


Parent commenter used the wrong word. It’s the EULA that prevents it.

Regardless, it is true that it is a problem.

https://www.reddit.com/r/MachineLearning/comments/ikrk4u/d_c...


Llama-3 is one of the models provided by AWS Bedrock which offers pay as you go pricing. I'm curious how it would break down on that.

LLAMA 8B on Bedrock is $0.40 per 1M input tokens and $0.60 per 1M output tokens which is a lot cheaper than OpenAI models.

Edit: to add to that, as technical people we tend to discount the value of our own time. Bedrock and the OpenAI are both very easy to integrate with and get started. How long did this server take to build? How much time does it take to maintain and make sure all the security patches are applied each month? How often does it crash and how much time will be needed to recover it? Do you keep spare parts on hand / how much is the cost of downtime if you have to wait to get a replacement part in the mail? That's got to be part of the break-even equation.


> How long did this server take to build?

About 3 days [from 0 and iterating multiple times to the final solution]

> How much time does it take to maintain and make sure all the security patches are applied each month?

A lot

> How often does it crash and how much time will be needed to recover it? Do you keep spare parts on hand / how much is the cost of downtime if you have to wait to get a replacement part in the mail?

All really good points, the exercise to self host is really just to see what is possible but completely agree that self hosting makes little to no sense unless you have a business case that can justify it.

Not to mention if you sign customers with SLAs and then end up having downtime would put even more pressure on your self hosted hardware


Just to bounce off this a little. If you are looking to fine-tune using an on demand service it seems Amazon Sagemaker can do it at seemingly decent prices:

https://aws.amazon.com/sagemaker/pricing/

I'd love to hear someones experience using this as I want to make an RPG rules bot tied to a specific ruleset as a project but I fear AWS as it might bankrupt me!


In my experience SageMaker was relatively straightforward for fine-tuning models that could fit on a single instance, but distributed training still requires a good bit of detailed understanding of how things work under the covers. SageMaker Jumpstart includes some pretty easy out-of-the-box configurations for fine-tuning models that are a good starting point. They will incorporate some basic quantization and other cost-savings techniques to help reduce the total compute time.

To help control costs, you can choose pretty conservative settings in terms of how long you want to let the model train for. Once that iteration is done and you have a model artifact saved, you can always pick back up and perform more rounds of training using the previous checkpoint as a starting point.


Groq also has pay as you go pricing for llama3 8B for only $0.05/$0.08 that is very fast.


Groq is actually allowing you to pay now and get real service?


The option to pay is still listed as coming soon, but I also see pricing information in the settings page, so maybe it actually is coming somewhat sooner. I’m seeing $0.05/1M input and $0.10/1M output for llama3 8B, which is not exactly identical to what the previous person quoted.

Either way, I wish Groq would offer a real service to people willing to pay.


I found the .05/.08 here: https://wow.groq.com/


tl;dr: no-ish, it's getting better but still not there.

I don't really get it, only thing I can surmise is it'd be such a no-brainer in various cases, that if they tried supporting it as a service, they'd have to cut users. I've seen multiple big media company employees begging for some sort of response on their discord.


didn't know they finally turned on pricing plans


> How long did this server take to build? How much time does it take to maintain and make sure all the security patches are applied each month? How often does it crash and how much time will be needed to recover it? Do you keep spare parts on hand / how much is the cost of downtime if you have to wait to get a replacement part in the mail? That's got to be part of the break-even equation.

All of these are the kinds of things that people say to non-technical people to try to sell cloud. It's all fluff.

Do you really think that cloud computing doesn't have security issues, or crashes, or data loss, or that it doesn't involve lots of administration? Thinking that we don't know any better is both disingenuous and a bit insulting.


I've managed fleets on cloud providers with over 100k instances, even with all the excellent features through APIs, managing instances can quickly get tricky.

Tbh, your comment is kind of insulting and belittles how far we've come ahead in infrastructure management.

The cloud is probably more secure than a set of janky servers that you have running in your basement. You can totally automate away 0-days, cves and get access to better security primitives.


> The cloud is probably more secure than a set of janky servers that you have running in your basement.

Apples/Oranges. Your janky cloud[0] is less secure than the servers in my basement, because I'm a mostly competent sysadmin. Cloud lets you trade some operational concerns for higher costs, but not all of them.

[0] If you can assume servers run by somebody who doesn't know how to do it properly, obviously I can assume the same about cloud configuration. Have fun with your leaked API keys.


If my comment is insulting, I apologize. That was not my intention. My intention was to say that writing sales speak in a technical discussion is insulting to those of us who know better.

However, you've now gone out of your way to try to be insulting. You know nothing about me, yet you want to suggest that the cloud is more secure than my servers, and that my servers are "janky"?

Please try a little harder to engage in reasonable discourse.


Almost by definition you cannot automate away 0-days.


I've managed both data centers and cloud and IMHO, no, it is not fluff. To take it in order:

> doesn't have security issues

It sure does, but the matrix of responsibility is very different when it is a hosted service. Note: I am making these comments about Bedrock, which is serverless not in relation to EC2.

> It crashes

Absolutely, but the recovery profile is not even close to the same. Unless you have a full time person with physical access to your server who can go press buttons.

> data loss

I'm going to shift this one a tiny bit. What about hardware loss? You need backups regardless. On the cloud when a HDD dies you provision a new one. On premise you need to have the replacement there and ready to swap out (unless you want to wait for shipping). Same with all the other components. So you basically need to buy two of everything. If you have a fleet of servers that's not too bad since presumably they aren't going to all fail on the same component at the same time. But for a single server it is literally double the cost.

> doesn't involve lots of administration

Again, this is relation to Bedrock with is a managed serverless environment. So there is litterally no administration aside from provisioning and securing access to the resource. You'd have a point if this was running on EC2 or EKS but that's not what my post was about.

> Thinking that we don't know any better is both disingenuous and a bit insulting.

I'm not saying cloud is perfect in any way, like all things it requires tradeoff, but quite frankly I find you dismissing my 25 years of experience, 1/3 of that has been working in real data centers (including a top-50 internet company at the time) as "fluff" as "disingenuous and a bit insulting".


> Unless you have a full time person with physical access to your server who can go press buttons.

Every colo facility I’ve used offers “remote hands”. If you need a button pressed or a disk swapped, they will do it, with a fee structure and response time that varies depending on one’s arrangement with the operator. But it’s generally both inexpensive and fast.

> What about hardware loss? You need backups regardless. On the cloud when a HDD dies you provision a new one. On premise you need to have the replacement there and ready to swap out (unless you want to wait for shipping).

Two of everything may still be cheaper than expensive cloud services. But there’s an obvious middle ground: a service contract that guarantees you spare parts and a technician with a designated amount of notice. This service is widely available and reasonably priced. (Don’t believe the listed total prices on the web sites of big name server vendors — they negotiate substantial discounts, even in small quantities.)


> But it’s generally both inexpensive and fast.

I guess inexpensive is relative. I've been on cloud for a while so I'm not sure what the going rates are for "remote hands" and most of my experience is with on-premise vs co-lo.

> Two of everything may still be cheaper than expensive cloud services.

That is true. Everything has tradeoffs. Though in the OPs case I think the math is relatively clear. With Open AIs pricing he calculated the break even at 5 years just for the hardware and electricity. Assuming that calculation is right, two of everything would up that to 7+ years, at which point... a lot can happen in 7 years.


> I guess inexpensive is relative. I've been on cloud for a while so I'm not sure what the going rates are for "remote hands" and most of my experience is with on-premise vs co-lo.

At a low end facility, I’ve usually paid between $0 and $50 per remote hands incident. The staff was friendly and competent, and I had no complaints. The price list goes a bit higher, but I haven’t needed those services at that facility.


You could have gotten rid of the middle paragraph. It’s not fluff. These are valid technical points. Issues most companies would rather (reasonably) pay to not have to deal with.

And do you really think you can offer better security and uptime than AWS? Not impossible but very expensive if you’re managing everything from your own hardware. You clearly vastly underestimate all that AWS is taking care of.


Self hosting means hosting it yourself, not running it on Amazon. I think the distinction the author intends to make is between running something that can't be hosted elsewhere, like ChatGPT, versus running Llama-3 yourself.

Overlooking that, the rest of the article feels a bit strange. Would we really have a use case where we can make use of those 157 million tokens a month? Would we really round $50 of energy cost to $100 a month? (Granted, the author didn't include power for the computer) If we buy our own system to run, why would we need to "scale your own hardware"?

I get that this is just to give us an idea of what running something yourself would cost when comparing with services like ChatGPT, but if so, we wouldn't be making most of the choices made here such as getting four NVIDIA Tesla T4 cards.

Memory is cheap, so running Llama-3 entirely on CPU is also an option. It's slower, of course, but it's infinitely more flexible. If I really wanted to spend a lot of time tinkering with LLMs, I'd definitely do this to figure out what I want to run before deciding on GPU hardware, then I'd get GPU hardware that best matches that, instead of the other way around.


> Self hosting means hosting it yourself, not running it on Amazon.

No. I googled "self hosting", read the first few definitions, and they agree with the article, not you. E.g., wikipedia -- https://en.wikipedia.org/wiki/Self-hosting_(web_services)


I would say that is "cloud hosted", which is obviously very expensive compared to running on hardware you own (assuming you own a computer and a GPU). That was the comparison I was interested in, the fact that renting a computer is more expensive than the OpenAI API is not a surprising result.


The very first definition from the link you provide is:

> Self-hosting is the practice of running and maintaining a website or service using a private web server, instead of using a service outside of someone's own control.

Hosting anything on Amazon is not "using a private web server" and is the very definition of using "a service outside of someone's own control".

The fact that the rest of the article talks about "enabled users to run their own servers on remote hardware or virtual machines" is just wrong. It's not "their own servers", and we don't have "more control over their data, privacy" when it's literally in the possession of others.


The second sentence is however :

> The practice of self-hosting web services became more feasible with the development of cloud computing and virtualization technologies, which enabled users to run their own servers on remote hardware or virtual machines. The first public cloud service, Amazon Web Services (AWS), was launched in 2006, offering Simple Storage Service (S3) and Elastic Compute Cloud (EC2) as its initial products.[3]

The mystery deepens


I hate when terms get diluted like this. "self hosted", to me, means you own the physical machine. This reminds of how "air-gapped server" now means a route configuration vs an actual gap of air, no physical connection, between two networks. It really confuses things.


That's a bit of newspeak

I think we generally understand Iaas, Paas, and Saas, to be hosted offerings, managed and unmanaged...

https://duckduckgo.com/?q=iaas+paas+saas&ia=web


3 year commit pricing with Jetstream + Maxtext on TPU v5e is $0.25 per million tokens.

On demand pricing put it at about $0.45 per million tokens.

Source: We use TPUs at scale at https://osmos.io

Google Next 2024 session going into detail: https://www.youtube.com/watch?v=5QsM1K9ahtw

https://github.com/google/JetStream

https://github.com/google/maxtext


For pytorch users: checkout the sister project: https://github.com/google/jetstream-pytorch/blob/main/benchm...


I wonder how long NVIDIA can justify its current market cap once people realize just how cheap it is to run inference on these models given that LLM performance is plateauing, LLM's as a whole are becoming commoditized, and compute demand for training will drop off a cliff sooner than people expect.


> LLM performance is plateauing

It’s a wee bit early to call this. Let’s see what the top labs release in the next year or two, yeah?

GPT-4 was released only 15 months ago, which was about 3 years after GPT-3 was released.

These things don’t happen overnight, and many multi-year efforts are currently in the works, especially starting last year.


As someone else points out, training is slightly more involved, but I also find that these smaller models are next to worthless compared to the larger ones.

There are probably some situations where it suffices to use a small model, but for most purposes, I'd prefer to use the state of the art, and I'm eager for that state to progress a little more.


> but for most purposes, I'd prefer to use the state of the art

I'm guessing in the future, we'll see a lot more automatic inference "on our behalf", and you won't care or notice if its using Llama5:3b or whatever comes out then.

I'm betting that in a few years we'll see LLMs baked into a ton of stuff - lots of "simple" things mostly, like email summary/rewording, and similar "light touch" use cases. Maybe light multi-modal work like photo labeling. Probably expanded to be running in every other SaaS applications doing god knows what. For those, the small models would be more than enough, and much cheaper to run in large volumes.

I'm guessing "chat with a bot to ask questions" will be a small amount of the inference that happens on our behalf, but will use the valuable SOTA model use case.


I am a partial believer that the real race for many tech players is actually AGI and ASI later and till the problem is solved the hardware arm race will keep being part of it.

Not only big tech is part of it but billion dollars startups are popping everywhere from China to US and Middle East.


Unless we hit another AI winter. We might get to the point where the hardware just can't give better returns and have to wait another 20 years for the next leap forward. We're still orders of magnitude away from the human brain.


It’s actually about training, not inference. You can’t do training on commodity gpus but yeah once someone figures that out, nvdia could crash


Nvidia doesn’t obviously have a strong inference play right now for a widely-deployed small model. For a model that really needs a 4090, maybe. But for a model that can run on a Coral chip or an M1/M2/M3 or whatever Intel or AMD’s latest little AI engines can do? This market has plenty of players, and Nvidia doesn’t seem to be anywhere near the lead except insofar as it’s a little bit easier to run the software on CUDA.


I know, my point is that when training demand decreases people will be realize that inference does not make up the difference


Yeah the big question I’m struggling with is exactly when training demand will fall if at all


Well every other tech company is writing checks to prove they have the chops to make an LLM. From IBM to Databricks to the big guys like Google. Tons of companies made one just to show off to investors or their CEO or just because they wanted skin in the game. But that probably won't continue forever. We've already seen certain orgs that seem to outshine others, and if they can't catch up, they may just accept to using the open-access models or APIs instead.

At some point everyone will realize that it is becoming a commodity, and it is very expensive to train, then only those with wither a structural advantage to lower price (eg Google) or a true goal of being on the high-end/SOTA of the market (OpenAI, Anthropic) will keep going.


You're talking about training the foundation models. But what about all the fine tuning on non-public business data that will be necessary to make gen AI useful for actual business processes?

I'm finding it difficult to estimate the size of this workload compared to continued training of foundation models. Perhaps it depends on whether there are new architectural breakthroughs that require retraining of foundation models.

And what about non-language tasks such as interpreting video and 3D sensory data? This is potentially huge, but between huge peaks there is often a valley of unknowable depth and breadth.


I have yet to hear a use-case for “fine tuning from business data” that relied on large models to succeed. Once again, I’m skeptical the average business will need this.

Yea yes video probably requires a lot of GPUs to train. And a lot of source material to train against. And a use case. Which again, most companies don’t have.

Model development is clearly here to stay, and clearly valuable. Models from every other company, either foundation or fine tuned - I’m not sure that emperor is wearing many clothes any time soon.


Every research lab is focused on new architectures that would reduce training costs.


Yeah we need essentially hadoop for llm training


> I wonder how long NVIDIA can justify its current market cap once people realize just how cheap it is to run inference on these models given that LLM performance is plateauing

next wave driving demand can be actual new products developed on LLMs. There are very few usecases currently well developed besides chatbots, but potential is very large.


Agreed with the sentiments here that this article gets a lot of the facts wrong, and I'll add one: the cost for electricity when self-hosting is dramatically lower than the article says. The math assumes that each of the Tesla T4s will be using their full TDP (70W each) 24 hours a day, 7 days a week. In reality, GPUs throttle down to a low power state when not in use. So unless you're conversing with your LLM literally 24 hours a day, it will be using dramatically less power. Even when actively doing inference, my GPU doesn't quite max out its power usage.

Your self-hosted LLM box is going to use maybe 20-30% of the power this article suggests it will.

Source: I run LLMs at home on a machine I built myself.


Surprised no comments are pointing out that the analysis is pretty far off simply due to the fact that the author runs with batch size of 1. The cost being 100x - 1000x what API providers are charging should be a hint that something is seriously off, even if you expect some of these APIs to be subsidized.


No way you need $3,800 to run an 8B model. 3090 and a basic rig is enough.

That being said, the difference between OpenAI and AWS cost ($1 vs $17) is huge. Is OpenAI just operating at a massive loss?

Edit: Turns out AWS is actually cheaper if you don't use the terrible setup in this article, see comments below.


AWS's pricing is just ridiculous. Their 1-year reserve pricing for an 8x H100 or A100 instance (p4/p5) costs just as much as buying the machine outright with tens of thousands left over for the NVIDIA enterprise license and someone to manage them (per instance!). Their on demand pricing is even more insane - they're charging $3.x/hr for six year old cards.


What about the cost of the power and cooling to run the machine (a lot!), and the staff to keep it running?


That's why I said "and someone to manage them". The difference is in the tens of thousands of dollars per instance. The savings from even a dozen instances is enough to pay for someone to manage them full time, and that's just for the first year. Year 2 and 3 you're saving six figures per instance so you'd be able to afford one person per machine to hand massage them like some fancy kobe beef.

A100 TDP is 400W so assuming 4kW for the whole machine, that's a little more than $5k/year at $0.15/kWh. Again, the difference is in the tens of thousands per instance. Even at 50% utilization over three years, if you need more than a dozen machines it's much cheaper to buy them outright, especially on credit.


I thought it was generally known they were operating at a loss?

even with the subs and api charges, they still let people use chatGPT for free with no monetization options. Sure they are collecting the data for training, but that's hard to quantify the value of.


I mean, no, I came to scan the comments quick after reading because there's a lot of bad info you can walk away with from the post, it's sort of starting from scratch on hosting LLMs

If you keep reading past there, they get it down significantly. The 8 tkn/s number AWS was evaluated on is really funny, that's about what you'd get on last year's iphone and it's not because apples special, it's because theres barely any reasonable optimization being done here. No batching, float32 weights (8 bit is guaranteed indistinguishable from 32 bit, 5 bit tests as definitely indistinguishable in blind tests, 4 bit arguably is indistinguishable)


You're right. In fact, using EKS at all is silly when AWS offers their Bedrock service with Claude Haiku (rated #19 on Chat Arena vs. ChatGPT3.5-Turbo at #37) for a much lower cost of $0.75/M tokens (averaging input and output like OP does)[0].

So in reality AWS is cheaper for a much better model if you don't go with a wildly suboptimal setup.

[0] https://aws.amazon.com/bedrock/pricing/


yeah what, a GV100 is $1600 on eBay and you can run like a 3.8bpw quant of 70b llama at some decent number of tk/s, if you buy two you can run a 70b 5bpw quant with 32k context no problem


A single synchronous request is not a good way to understand cost here unless your workload is truly singular tiny requests. Chatgpt handles many requests in parallel and this article's 4 GPU setup certainly can handle more too.

It is miraculous that the cost comparison isn't worse given how adversarial this test is.

Larger requests, concurrent requests, and request queueing will drastically reduce cost here.


Great mix of napkin math and proper analysis, but what strikes me most is how cheap LLM access is. For it being relatively bleeding edge, us splitting hairs on < $20/M tokens is remarkable itself, and something tech people should be thrilled about.


Smacks of the "starving kids in Africa" fallacy, you could make the same argument that tech people should be thrilled for current thing being available at $X for X = $2/$20/$200/$2000...


The T4 is a six year old card. A much better comparison would be a 3090, 4090, A10, A100, etc.


>initial server cost of $3,800

Not following?

Llama 8B is like 17ish gigs. You can throw that onto a single 3090 off ebay. 700 for the card and another 500 for some 2nd hand basic gaming rig.

Plus you don't need a 4 slot PCIE mobo. Plus it's a gen4 pcie card (vs gen3). Plus skipping the complexity of multi-GPU. And wouldn't be surprised if it ends up faster too (everything in one GPU tends to be much faster in my experience, plus 3090 is just organically faster 1:1)

Or if you're feeling extra spicy you can do same on a 7900XTX (inference works fine on those & it's likely that there will be big optimisation gains in next months).


> Llama 8B is like 17ish gigs. You can throw that onto a single 3090 off ebay

Someone correct me if I'm wrong, but I've always thought you needed enough VRAM to have at least double the model size so that the GPU has enough VRAM for the calculated values from the model. So that 17 GB model requires 34 GB of RAM.

Though you can quantize to fp8/int8 with surprisingly little negative effect and then run that 17 GB model with 17 GB of VRAM.


No, you don't need that much

Here is a calculator (if you have a GPU you want to use EXL2, otherwise GGUF) https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

Also model quantisation goes a long way with surprisingly little loss in quality.


You do need more if you use larger context sizes though. It can really blow up to multiple times the model size even for 128k context.

Edit: Oh I see that calculator you linked shows that too. My information was more trial and error, thanks for the calculator link.


These costs don't line up with my own experiments using vLLM on EKS for hosting small to medium sized models. For small (under 10B parameters) models on g5 instances, with prefix caching and an agent style workload with only 1 or a small number of turns per request, I saw on the order of tens of thousands of tokens/second of prefill (due to my common system prompts) and around 900 tokens/second of output.

I think this worked out to around $1/million tokens of output and orders of magnitude less for input tokens, and before reserved instances or other providers were considered.


Interesting, I think how the model runs makes a big difference and I plan to re-run this experiment with different models and different ways of running the model.


Does anyone know the impact of the prompt size in terms of throughput? If I'm only generating 10 tokens, does it matter if my initial prompt is 10 tokens or 8000 tokens? How much does it matter?


I just bought a $1099 MacBook Air M3, I get about 10 tokens/s for a q5 quant. Doesn't even get hot, and I can take it with me on the plane. It's really easy to install ollama.


Until January this year I mostly used Google Colab for both LLMs and deep learning projects. In January I spent about $1800 getting Apple Silicon M2Pro 32G. When I first got it, I was only so-so happy with the models I could run. Now I am ecstatically happy with the quality of the models I can run on this hardware.

I sometimes use Groq Llama3 APIs (so fast!) or OpenAI APIs, but I mostly use my 32G M2 system.

The article calculates cost of self-hosting, but I think it is also good taking into account how happy I am self hosting on my own hardware.


I own an 8 GPU cluster that I built for super cheap < $4,000. 180gb vram, 7 24gb + 1 24gb. There are tons of models that I run that's not hosted by any provider. The only way to run it is to host myself. Furthermore, the author has 39 tokens in 6 seconds. For llama3-8b, I get almost 80 tk/s and if parallel, can easily get up to 800 tk/s. Most users at home infer only one at a time because they are doing chat or role play. If you are doing more serious work, you will most likely have multiple inference running at once. When working with smaller models, it's not unusual to have 4-5 models loaded at once with multiple inference going. I have about 2tb of models downloaded, I don't have to shuffle data back and forth to the cloud, etc. To each their own, the author's argument is made today by many on why you should host in the cloud. Yet if you are not flush with cash and a little creative, it's far cheaper to run your own server than in the cloud.

To run llama-3 8b. A new $300 3060 12gb will do, it will load fine in Q8 gguf. If you must load in fp16 and cash is a problem a $160 P40 will do. If performance is desired a used 3090 for ~$650 will do.


I am looking into renting Hetzner GEX44 dedicated server to run a couple models on with Ollama. I haven't done the arithmetics yet but I wouldn't be surprised to see a 100x cost-decrease compared to OpenAI APIs (granted the models I'll run on the GEX44 machines will be less powerful)


What kind of setup were you able to do for so cheap? I'd love to be able to do more locally. I have access to a single RTX A5000 at work, but it is often not enough for what I'm wanting to do, and I end up renting cloud GPU.



Interested in your sub-$4k 8 GPU setup. Care to elaborate a bit or do you have a write up somewhere?


check my reply above. 2 xeon cpus, 40 cores, 88 lanes, 128gb drive, fast nvme 2tb SSD.


Curious how the older Titan XP with 12GB Vram might compare


I agree with most of the criticisms here, and will add on one more: while it is generally true that you can’t beat “serverless” inference pricing for LLMs, production deployments often depend on fine-tuned models, for which these providers typically charge much more to host. That’s where the cost (and security, etc.) advantage for running on dedicated hardware comes in.


The energy costs in the bay area are double the reported 24c cost, so energy alone would be around $100-ish a month instead of $50-ish.


Except that the article assumes that the GPUs would be using their max TDP all the time, which is incorrect. GPUs will throttle down to 5-20w (depending on the specific GPU). So your actual power consumption is going to be much, much lower, unless you’re literally using your LLM 24/7.


Unless you are in Santa Clara with Silicon Valley Power rates.

https://www.siliconvalleypower.com/residents/rates-and-fees


Yeah agreed, some of the areas we have access to were 16c (PA) and up to 24c (NYC), doubled that cost in the analysis because of things like this


llama.cpp + llama-3-8b in Q8 run great on a single T4 machine. Cannot remember the TPS I got there, but it was much above 6 mentioned in the article.


Interesting, I got very different results depending on how I ran the model, will definitely give this a try!

edit: Actually could you share how long it took to make a query? One of our issues is we need it to respond in a fast time frame


I checked some logs from my past experiments, the decoding went for about 400 tps over a ~3k token query, so about 7 seconds to process it, and then the generation speed was about 28 tokens.


deepinfra.com hosts Llama 3 8b for 8 cents per 1m tokens. I'm not sure it's the cheapest but it's pretty cheap. There may be even cheaper options.

(Haven't used it in production, thinking to use it for side projects).


does aws not have lower vcpu and memory instances with multiple T4s? because with 192gbs of memory and 24 cores, you're paying for a ton of resources you won't be using if you're only running inference.


This is a good way to do math. But honestly, how many products actually have 100% utilisation. I did some math a few months ago but mostly on the basis of active users, on what would be the % difference if you have 1k to 10K users/mo. You can run this as low as $0.3K/mo on Serverless GPUs and $0.7K/mo on EC2.

The pricing is outdated now.

Here is the piece -https://www.inferless.com/learn/unraveling-gpu-inference-cos...


There's also the option of platforms such as BentoML (I have no affiliation) that offer usage-based pricing so you can at least take the 100% utilization assumption off the table. I'm not sure how the price compares to EKS.

https://www.bentoml.com/


If we care about cost efficiency when running LLMs, the most important things are:

1. Don't use AWS, because it's one of the most expensive cloud providers

2. Use quantized models, because they offer the best output quality per money spent, regardless of the budget

This article, on the other hand, focuses exclusively on running an unquantized model on AWS...


This is another one of those "I used this for 5 minutes and found this out" naive posts which add nothing useful.

Check out the host LLM's at home crowd. One app to look at is llama.cpp. Model compression is one of the first techniques to successfully run models on low capacity hardware.


There's some dodgy maths

>( 100 / 157,075,200 ) * 1,000,000 = $0.000000636637738

Should be $0.64 so still expensive


being 6 orders of magnitude off in your cost calculation isn't great.

groq costs about that for llama 3 70b (which is a monumentally better model) and 1/10th of that for llama 3 8b


Groq doesn’t currently have a paid API that one can sign up for.


Yup. True. Should say "will" - currently free but heavily rate-limited. Together AI looks to be about $0.30 / 1M tokens, as another price comparison. Which you can pay for.


I’ve used llama3 on my work laptop with ollama. It wrote an amazing pop song about k-nearest neighbours in the style of PJ and Duncan’s ‘Let’s Get Ready to Rhumble’ called ‘Let’s Get Ready to Classify’ For everything else it’s next to useless.


Ggml Q8 models on ollama can run on much cheaper hardware without losing much performance.


With dstack you can either utilize multiple affordable cloud GPU providers at once to get the cheapest GPU offer or also use an own cluster of on-prem servers. Dstack supports both altogether. Disclaimer: I’m a core contributor to dstack


Up until not too long ago I assumed that self-hosting an llm would come at an outrageous cost. I have a bunch of problems with LLM's in general. The major one is that all LLMs(even openAI) will produce output which will give anyone a great sense of confidence, only to be later slapped across the face with the harsh reality-for anything involving serious reasoning, chances are the response you got was at large bullshit. The second one is that I do not entirely trust those companies with my data, be it OpenAI, Microsoft or Github or any other.

That said, a while ago there was this[1] thread on here which helped me snatch a brand new, unboxed p40 for peanuts. Really, the cost was 2 or 3 jars of good quality peanut butter. Sadly it's still collecting dust since although my workstation can accommodate it, cooling is a bit of an issue - I 3D printed a bunch of hacky vents but I haven't had the time to put it all together.

The reason why I went this road was phi-3, which blew me away by how powerful, yet compact it is. Again, I would not trust it with anything big, but I have been using it for sifting through a bunch of raw, unstructured text and extract data from it and it's honestly done wonders. Overall, depending on your budget and your goal, running an llm in your home lab is a very appealing idea.

[1] https://news.ycombinator.com/item?id=39477848


Hetzner GPU servers at $200/month for an RTX 4000 with 20GB seem competitive. Anyone have experience with what kind of token throughput you could get with that?


Running 13b code llama on my m1 macbook pro as I type this...


What do you use it for? What problems does it solve?


Half-OT: can I shard Llama3 and run it on multiple wasm processes?


this is not what I consider self hosting but ok

I would like to compare the costs vs hardware on prem, so this helps with one side of the equation


? you can run llama 3 8b with a 3060


Yeah, or you can get a gpu server with 20GB VRAM on hetzner for ~200 EUR per month. Runpod and DigitalOcean are also quite competitive on prices if you need a different GPU.

AWS is stupidly expensive.


Expensive in general but combine some decent tooling and spot instances and it can be insanely cheap.

The latest Nvidia L4 GPUs (24GB) instances are currently less than 15c p/h spot.

T4s are around 20c per hour spot, though they are smaller and slower.

I've been provisioning hundreds of these at a time to do large batch jobs at a fraction of the price of commercial solutions (i.e. 10-100x cheaper).

Any problem that fits in a smaller GPU and can be expressed as a batch job using spot instances can be done very cheaply on AWS.


Kind of a ridiculous approach, especially for this model. Use together.ai, fireworks.ai, RunPod serverless, any serverless. Or use ollama with the default quantization, will work on many home computers, including my gaming laptop which is about 5 years old.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: