How to Build and Use a Multi GPU System for Deep Learning

dave_sullivan · on Oct 18, 2014

The direct-to-network stuff is very cool and useful in this scenario. I should point out that for those of you experimenting with deep learning, you'll probably not be writing your own code from scratch. There are various open source libraries (pylearn2, torch, caffe, others) that make things a lot easier when you're getting started. They still have something of a learning curve though.

I should caution also that not all of the libraries work as well with the latest or earliest GPUs, so the model of GPU you buy still makes a big difference. And it should be NVIDIA--the deep learning community has largely standardized around their hardware. This is a state of affairs that is constantly changing.

Pertinent self promotion: my company (http://www.ersatzlabs.com) provides a cloud GPU deep learning solution, which I'd argue is an even easier way to get started with deep learning, particularly in visualization and prototyping phases.

But anyway, if anyone's curious about deep learning and just getting their feet wet, I'm always happy to talk about it, my email is in my profile.

nemonemo · on Oct 18, 2014

This article contains good tips for building a GPU cluster with RDMA. One thing I would like to add is that there are two types of GPUDirect depending on CUDA versions. Previous CUDA supported GPUDirect through CPU memory, and now CUDA supports "true" GPUDirect between the RDMA device and the GPU memory. However, some chipsets may not support the "true" GPUDirect very well, and two of our old machines had up to 20x times of throughput asymmetry with GPUDirect (which is, send was much slower than recv.) There are several papers that discuss this limitation. Our work, GPUnet[1], overcame this performance issue with GPUDirect by using fairly recent chipsets, but you can probably imagine our pain when we saw around 150MB/s throughput with GPUDirect, when 3GB/s is the expected one.

[1] GPUnet: Networking Abstractions for GPU Programs, OSDI 2014 https://sites.google.com/site/silbersteinmark/GPUnet

jeffreyrogers · on Oct 18, 2014

I'm curious about the author's experience with ML. He mentioned one of the Kaggle competitions, and from my understanding most of the people doing Kaggle are using R, Python, or some other language which provides a large degree of support for ML type tasks.

I wonder if the author also uses those and CUDA/GPU makes up a relatively small part of his solutions, or whether it's largely done at such a low level. It'd also be interesting to see how some of the other people who place highly in Kaggle competitions do their coding.

timdettmers · on Oct 20, 2014

I mainly use python and sklearn for Kaggle competitions for my initial models. If I understand the problem better I use some of my own deep learning solutions in python (built on gnumpy and cudamat). However, sometimes my own C++/CUDA implementations come in handy, especially if the data set is large.

Other Kaggle competitors that use deep learning mostly use python libraries like pylearn2 and torch7 for their deep learning models (which are also built on CUDA/C++).

In general it is not so easy to use deep learning on problems other than object recognition. So yes, I do not use deep learning in all of my Kaggle competition simply because it is hard to get them to work well. Using different simple models and to then ensemble them yields often better results for the time invested.

jeffreyrogers · on Oct 23, 2014

Thanks for posting that. It gave me some interesting things to look into.

ansible · on Oct 19, 2014

I was recently surprised to learn that they make server systems with up to eight PCIe x16 slots.

We were looking at this particular beastie [1] to host some Nvidia Tesla K40s for some simulation software. It would be a very expensive box, but the sim software costs a lot more.

[1] http://www.supermicro.com/products/system/4U/4027/SYS-4027GR...