Hacker Newsnew | past | comments | ask | show | jobs | submit | zackangelo's commentslogin

This uses Nvidia’s CUDA snapshot API under the hood, but you have to pair it with a host side snapshot as well. Modal uses gVisor for this, which is notoriously high overhead.

Does anyone know of a more efficient alternative if you’re running a trusted container?


Post author here: there are other projects that will create a proxy for CUDA calls and use the log of CUDA operations to checkpoint / restore or live migration tasks. We haven’t used them. I don’t believe they are very popular nor used outside specific orgs.

This is the only API available for snapshotting NVIDIA GPU memory, afaik.

As for needing to combine it with a host memory snapshot step, this is required because CUDA sessions need to be mapped to a host process, so you need to snapshot both things in order for the program to be restored correctly.

CRIU is another project that uses the same technique (CUDA snapshot + host memory snapshot). Different than CRIU, our snapshots work at the function level so we’re able to take snapshots after functions have been initialized (including GPU memory), making Modal cold boots fast. One would have to implement this entire process using CRIU.


Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.


That doesn't answer the question, which was how to get a high-speed interconnect between a Mac and a DGX Spark. The most likely solution would be a Thunderbolt PCIe enclosure and a 100Gb+ NIC, and passive DAC cables. The tricky part would be macOS drivers for said NIC.


You’re right I misunderstood.

I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.

You could run pipeline parallel but not sure it’d be that much better than what we already have.



No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)


I usually call it "head parallelism" (which is a type of tensor parallelism, but paralllelize for small clusters, and specific to attention). That is what you described: sharding input tensor by number of heads and send to respective Q, K, V shard. They can do Q / K / V projection, rope, qk norm whatever and attention all inside that particular shard. The out projection will be done in that shard too but then need to all reduce sum amongst shard to get the final out projection broadcasted to every participating shard, then carry on to do whatever else themselves.

I am asking, however, is whether that will speed up decoding as linearly as it would for prefilling.


Right, my comment was mostly about decoding speed. For prefill you can get a speed up but there you are less latency bound.

In our benchmarks with MLX / mlx-lm it's as much as 3.5x for token generation (decoding) at batch size 1 over 4 machines. In that case you are memory bandwidth bound so sharding the model and KV cache 4-ways means each machine only needs to access 1/4th as much memory.


Oh! That's great to hear. Congrats! Now, I want to get the all-to-all primitives ready in s4nnc...


What 1T parameter base model have you seen from any of those labs?


its moe, each expert tower can be branched from some smaller model.


That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.


Wouldn't you be able to test nccl if you had 2 of these?


What kind of NCCL testing are you thinking about? Always curious what’s hardest to validate in people’s setups.


Not with Mac studio(s), but yes multi host NCCL over RoCE with two DGX Sparks or over PCI with one


Just a bit of feedback:

> Instead of one brittle giant, we orchestrate a Mixture of Experts…

“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.

I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.


I hear you and valid. A mixture of Models is probably a better phrase - we are constantly moving between AI experts and very skilled Developers who use OpenAI endpoints and call it AI, so we are constantly working on finding the correct language. This was a miss though - will do better :)


Because it depends on how much better “best” is. If it’s only incrementally better than open source models that have other advantages, why would you bother?

OpenAI’s moat will only come from the products they built on top. Theoretically their products will be better because they’ll be more vertically integrated with the underlying models. It’s not unlike Apple’s playbook with regard to hardwares and software integration.


Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.


I feel a bit silly for not noticing this before. Over the last year or so I've often wondered when ssh added protocol-level support for session resume. I'd open my laptop on a new network and everything would be ready to go. But of course, it's nothing to do with ssh, it's just that I started using tailscale.


And really they didn't even do anything special. This was a killer reason we loved Wireguard at our company and pitched heavily to keep it around to he company who acquired us and wanted us to switch to their VPN Appliance instead.


The main thing a big company IT admin wants is control over the users. At a previous company, they would ship really crappy software, by our own admission, to "enterprise" customers and all we had to do to keep them happy was to give a fancy control-panel that make then feel like king.

Yes, flattery works, pandering-to-ego works. Too bad, you can only push it so far...at some point, CTO/CEO notices.


Agreed. In this company the IT team was being spread thin without their budget being increased so Tailscale was the obvious solution here, but a non-starter for them. "We already pay for a VPN. Let's just use that."

We managed to survive with our solution for a while thanks to it being super simple and "free" besides the instance running wireguard. Last I heard (I left), they shut that all down a few years ago.


Curious what issues you were having. The kernel should compile natively if you pass nvcc the correct arch flags, although it probably won't take advantage of any new hardware features.


High-performance GPU code typically uses nonportable features that are not supported across generations.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: