I mean, at some point someone has to buy them to be able to offer services on them to others.
Renting comes with certain limitations owners don't have.
And some people have too much money to not invest in fun.
This build is 3kVA max. That’s about 1/3 of a current gen EV, only 15% of an original Tesla Model S with dual chargers, and about equal to a standard American oven. This is much more polite to the grid than, say, a couple of tea kettles or especially a reasonably sized electric tankless water heater.
This article was written or rewritten via your model right?
The last paragraphs fell totally like AI.
Anyway I'd like a follow up on the curating, cleaning and training part which is far more interesting than how to select hardware which we've been doing for over 25 years.
The bottleneck for most model training sizes is VRAM, and since each 4090 has 24 GB VRAM, that's 96 GB VRAM total. The article mentions that it can train LLMs from scratch up to 1 billion hyperparameters, which tracks.
Nowadays that's not a lot: a single H100 that you can now rent has 80 GB VRAM, and doesn't have the technical overhead of handling work across GPUs.
You should be able to train/full-fine-tune (i.e. full weight updates, not LoRA) a much larger model with 96GB of VRAM. I generally have been able to do a full fine-tune (which is equivalent to training a model from scratch) of 34B parameter models at full bf16 using 8XA100 servers (640GB of VRAM) if I enable gradient checkpointing, meaning a 96GB VRAM box should be able to handle models of up to 5B parameters. Of course if you use LoRA, you should be able to go much larger than this, depending on your rank.
Is there a reason you used hyperparameters rather than parameters? I was going to politely correct the terminology but you seem to be in AI for some time so either it was a mistype or I am misunderstanding what you are referencing.
People who are making quick social media posts while taking a casual walk outside on websites that don't make it easy to edit posts and are not expecting to be nitpicked about it.
Overall, it's something I've seen very often on social media and less technical articles about LLMs. OpenAI would fall into the "almost" category.
It's okay to say that you mistyped or whatever, while taking a casual walk outside on websites that don't make it easy to edit posts and are not expected to be nitpicked about it. Throwing in that everyone uses them interchangeably, however, is just profoundly wrong on every level.
I wasn't nitpicking. It is a HUGE differentiation, and I pointed it out specifically because people pick up on terminology so people who might not know better will go forward and just drop in the more super duper hyperparameter, not realizing that it makes them look like they don't know what they're talking about. As I said in the other post, no one who knows anything uses them interchangeably. It is just completely wrong.
Again, I've heard and used the terminology "model hyperparameter" in place of "model parameter", and I've also heard "model parameter" in place of "model hyperparameter" because not every human interaction is a paper on arXiv and the terms are obviously very similar. The context of the term is what matters in the end (as demonstrated by other comments following my correct intent), and society will not crumble if using either term incorrectly in casual conversation. No one intentionally uses the wrong term, but as jokingly said in another comment "when you get really deep into model training, it can seem like there are a billion hyperparameters you have to worry about."
I appreciate being corrected, but you are the one who asked for my opinion based on my extensive time in AI, you can choose to believe it or not.
I doubt the RAM is added up. I think that’s only a feature reserved for their NVLinked HPC series cards. In fact, without nvlink, I don’t see how you’d connect them together to compute a single task in a performant and efficient way.
How long does training a 1B or 500M model take approximately on the 4-GPU setup? Or does that dramatically depend on the training data? I didn’t see that info on your pages.
this is a decent birds eye view thanks, could you expand on this to show how long it took to produce... what model you produced? What did you produce? what did you train for.. the posts seems to suggest its for diffusion purposes?
On a tangent, if I wished to fine-tune one of those medium sized models like Gemma2 9B or Llama 3.2 Vision 11B, what kind of hardware would I need and how would I go about it?
I see a lot of guides but most focus on getting the toolchain up and running, and not much talk about what kind of dataset do I need to do a good fine tuning.
> As for dataset, depending on your task, you need image / text pairs.
I guess the main question is, do you just prepare training data as if you were training from scratch, or is there some particularities to finetuning that should be considered?
In several cases I've been wanting better prompt adherence.
Llama 3.2 Vision is very strictly trained to output a summary at the end which I find difficult to get it stop doing for example.
Another one is that when given a math problem and asked to generate some code that computes the result, most models outputs code fine but insists on doing calculations themselves even if the prompt explicitly say they shouldn't. As expected, sometimes these intermediate calculations are incorrect and hence I don't want the LLM to do that when the produced code would handle it perfectly. If the input prompt contains "four times five" I want the model to generate "4 * 5" rather than "20", consistently.
I've been curious to see if I could tune them to adhere better to the kind of prompts I would be giving.
For LLama 3.2 Vision I've also been curios if I can get it to focus on different details when asked to describe certain images. In many cases it is great but sometimes misses some key aspects.
As for the input training material, that's what I'm trying to figure out what I need. I feel a lot of the guides are like that "how to draw an owl" meme[1], leaving out some crucial aspects of the whole process. Obviously I need input prompts and expected answers, but how many, how much variation on each example, and do I need to include data it was already trained on to avoid overfitting or something like that? None of the guides I've found so far touch on these aspects.
nice writeup, but i feel that for most people, the software side of training models should be more interesting and accessible.
for one, "full" gpu utilization, one or many, remains an open topic in training workflows. spending efforts towards that, while renting from cloud, is a more accessible and fruitful to me than to finetune for marginal improvements.
this course was a nice source of inspiration - https://efficientml.ai/ - and i highly recommend looking into this to see what to do next with whatever hardware you have to work with.
Let's talk riser cables. I keep encountering issues with riser connectors claiming to support PCIe 4.0, which seem to have sub-par performance. They work fine with the GPUs and NICs I tested them with, but attaching a nvme drive causes all kinds of issues and prevents the machine from booting. I guess nvme isn't as tolerant of elevated bit-error-rates.
That just doesn't inspire a lot of confidence in those risers, so now I'm contemplating mcio risers.
NVMe sits over PCIe. I'd be more inclined to believe they're playing games with their voltage levels to lower power consumption on mobile/embedded (not based on anything but I wouldn't be surprised). Or, if you're then going to an m.2 adapter, something with that.
Why are people downvoting this? Yes, you really do need a dedicated circuit to run this type of machine. You will trip your circuit breaker if you don't have sufficient wattage on the line to run something rated for this power draw.
Commercial setups are not appropriate for typical 15 amp circuit loads.
Further, If you can afford to build this, you can afford to purchase at least the Romex, an AFCI circuit breaker, raceway, and run it into whatever room in the house you plan on operating this in.
His power supplies are 2x1500 Watt. That puts it at 3KW max which is more than a 20A circuit can provide (2400W).
The standard outlet is typically rated at 15 amps or 1800W. And the 15A breaker is on one circuit. You can get 20A circuits but they need to be wired for it, and replacing the breaker won't cut it.
Assuming his GPU is ~450W (his number) and power supplies are 80% efficient, well that means he's pulling close to ~2400 watts which is super close to the limit of a 20A circuit.
4 * 450 / 0.80 efficiency = 2250W.
That doesn't include the power consumed by the CPU or mother board or other things on that circuit. But a 170W CPU would easily push this over 2400W provided by a 20A circuit.
Thanks, good to know. Perhaps it is different for diffusion; with llms, layers are generally split across gpus, meaning inference has to happen on one gpu before the values can be passed between the layer split.
Why not 3090s? Same VRAM and cheaper. With both setups you'd be limited to 1B. By contrast, you can run 4-bit quants of Llama 70B on two {3,4}090s, and it's still pretty lobotomized by modern standards.
You can also train your own model even without GPUs. Just depends on parameter size.
Thanks for sharing. Have you prodded the model with various inputs and written an article that show various output examples? I'd love to get an idea of what sort of "end product" 4x4090s is capable of producing.
You might find more information here helpful https://sabareesh.com/posts/llm-intro/
But i am still in process of evaluating post training process with RL. RLHF is almost a mirage that shows what is possible but not the full capability of what model can do
If you are willing and able to put together the type of system described in the OP (a workstation-class PC, with multiple discrete GPUs and often multiple power supplies), a Mac never makes sense. There are hardware options available at essentially every price point that beat (in some cases drastically) the performance and memory capacity of a Mac.
And I say this at the risk of being called pedantic, but a cluster of Mac minis would have zero VRAM.
You can get 4060 ti 16GB cards for ~$450 or 4070 ti 16gb for ~850 instead of the $2.5k for a 4090. I wonder how well 4 of those cards would perform. The 4060 TDP is 165w instead of 450w for the 4090. The 4070 looks like the best tradeoff though for cost/power/etc though. You could probably set up an 8 card 4070 ti 16gb system for less than the 4 card 4090 system
The 4060 Ti is hampered by having a narrow memory bus, there's various benchmarks out there, here[1][2] are some examples, and here's[3] one which tests dual 4060 Ti's.
The 4090 computer per watt is the best (on paper) between the 4060 ti, 4070 ti and 4090. Best bang for $$ though looks like the 4070ti 16GB. I've been eying that one for a new dual card training rig.
Can someone definitively say for sure that I can just use two independent PSUs? One for GPUs and one for GPUs and motherboard and SATA? No additional hardware?
Is anyone else concerned with the power usage of recent AI? Computational efficiency doesn't seem to be a strong point... And for what benefit? IMO the usefulness payoff is too low
I clarified bit more on the article regarding this. But basically "Well this may not directly provide benefit but because this is a consumer grade card these features enabled having support for more advanced features such as bfloat16 and event float8 training support also the sheer number of cuda cores."
The GPU rental market is fairly reasonable. There's lots of companies doing it. (I work at one of them). 4x 4090 can be fetched for around $0.40/hour on some platforms ... about $1.20 on others depending on how available you want it. Regardless, all in, you can do an average 10-or-so-day train for < $500.
If you want on-prem, wait a few months. The supply of 5000 series (probably announced at CES in a few days) should push more 4000 on the market and, maybe, for a bit, over-supply and push the price down.
Nvidia stopped manufacturing the 4000 a few months ago because they don't have endless factories. Those resources were reallocated to 5000 series and thus pushed the price for the 4000 up to the ridiculous place it is now (about $2,000 on ebay)
I think the current appetite for crypto and ai is big enough to consume all 4000 and 5000 series cards to a point of scarcity (even 3090s are still fetching about $1000) but there should be a window where things aren't crazy expensive coming up.
There's no evidence supply will continually outstrip demand unless something unusual happens.
Some suppliers have support for it, some don't. They either use docker or kvm and it depends on how clever their hosting software is. We can do it, but that's a recent thing. it's really hit or miss
Btw, for other people reading this, the main player in the "rentable gamer gpu" space is salad.com who 6 months ago cut a deal with civitai (https://blog.salad.com/civitai-salad/). They're trying to capture enterprise customers to use the extra cycles on teenager's gaming rigs.
The industry is full of effectively "imitation companies" right now. For instance, runpod, quickpod, simplepod and clore are the ones cloning us at vast right now.
We see them in our discord, they try to snipe away customers, get in our comment threads on reddit and twitter with self-promotes, clone our features ... this is the ferocious wild west days of this industry. I've even gotten personal emails from a few who I guess scanned their database looking for registration addresses from other companies in the space.
There's even companies like primeintellect which are trying to become the market of markets - but they have their own program - it's clearly a play to snipe other customers by funneling them through some interface where they'll eventually push out the other companies and promote their own instances.
Then there's interesting insider hype players with their own infra like sfcompute who are trying to pretend like they invented interruptible instances and somehow get a bunch of people treating them like they're innovators. The resellable contracts they talk about are a pretty common feature and especially from the host's programmatic command line controller, it's just usually tucked deep in the documentation. They're doing effectively a re-prioritization play.
I guess my angle is "highest integrity possible". It's certainly a gamble - scammy companies sometimes capture a market then become unscammy - I'll hold my tongue but there's plenty of examples.
Wow, I question the ethical side of this comment. It starts praising a company as if it were an unrelated entity, then quietly switches to "us", then makes implications about competing enterpreneural efforts being scams without any evidence. And "clones" (as if everyone knew about them - I didn't until about 1y into mine for instance).
There's also the hypocrisy of complaining about competitors jumping in on "their threads" in a comment on a competitor thread.
is your electricity free? Some of these cards probably cost about $0.10/hr to run ... depending on your card/electricity rate etc.
It's probably somewhere between 12months-never depending on how the market shakes out. Maybe 2 years is a good idea ... really, if power is cheap/free and the machine is on and idle then it's free money - that's the way to look at it.
There's a lot of competition in the "airbnb gpu" so if you don't like us, the number is around 12 or so globally. We're probably either #2 or #3. Companies don't really disclose these things so it's hard to know.
Some people probably list on more than one platform. There may be some host management software somewhere that helps with that. I haven't actually checked.
I'd be happy to talk more about these privately. Some are better than others and I've got no interest posting less than charitable things about our competitors publicly, regardless of how accurate I think it is. My email is in my profile.
I was contemplating between building rig vs using the cloud but for some reason I want to get hands on. So you can always rent them for a fraction of cost
Also (at least in Southern California) electricity prices and how long the rig is on. Not as bad as the initial build cost, but run costs will add up over time.
The last time I checked, a modern Threadripper build is a bit over $10,000. So if you have the budget for that but need something GPU-oriented instead, then I could see that being a reasonable option.
Depends on the Application. In Bitcoin farming it famously was not an issue at all, manufacturers came up with the weirdest motherboards featuring many x1 pcie slots. Look up the Biostar TB360-BTC PRO 2.0 if you want to see a curiosity.
In Deep Learning it depends on your sharding strategy.
The best build I have seen so far had 6x4090's. Video: https://www.youtube.com/watch?v=C548PLVwjHA
An interesting choice to go with 256GB of DDR5 ECC; if spending so much on the 6x4090's, might as well try to hit 1 TB of RAM as well.The cost of this... not even sure. Astronomical.