Nvidia Pascal GPU Architecture to Provide 10X Speedup for Deep Learning Apps

modeless · on May 18, 2015

Processing half float data on the CPU will be tricky without hardware support. There's plenty of CPU processing needed for deep learning, for capturing and preparing datasets.

Half precision floating point is really something that should have been in hardware a long time ago. There are other applications besides deep learning that could benefit from the dynamic range of floating point, but don't need 32 bits. For example, imaging and audio.

minthd · on May 18, 2015

Logarithmic number systems could offer even greater improvement(100x over gpu), if you're willing to live with more errors and/or invest some time in fixing them:

http://www.gwern.net/docs/2010-bates.pdf

DennisP · on May 18, 2015

Seems like a great idea, and the last couple pages sound like they were making good progress toward commercializing it. But the document's from 2010. I wonder if they're still working on it.

minthd · on May 18, 2015

I know it was implemented in some military UAV, and achieved great results. So maybe the military took this.

But it still doesn't mean it can't be implemented outside the u.s. or in FPGA's.

Even a simulation of this in software could be really interesting.

listic · on May 18, 2015

It looks like that's what F16C instruction set was made for http://en.wikipedia.org/wiki/F16C

Narishma · on May 18, 2015

That looks to be just for converting to and from half-precision floats.

bitL · on May 18, 2015

For both imaging and audio even 64 bit doubles aren't sufficient if you use many filters together.

raverbashing · on May 18, 2015

[Citation needed]

It might be not enough if you have a lot of filters and doesn't want to care about their ordering/gains etc

If you have something like 16bit exponent and 48bit mantissa (64-bit) that's absurd

monk_e_boy · on May 18, 2015

Haven't read the article, but wanted to ask a question: Once the data is mined and the lessons learnt, isn't that it? You don't need the massive learning algorithm, just the little 'rules reading' algorithm?

Is this a thing? Something that takes a convolutional neural network and spits out a little app?

nl · on May 18, 2015

If I understand what you are talking about, then yes.

For example, offline speech recognition in Android phones uses a CNN that was trained on Google's GPU cluster[1]

[1] I can't find the good reference of this, but Slide 31 of this is outdated, but kind of says it (you aren't training a 2.7M parameter CNN on a phone). http://www.cs.nyu.edu/~eugenew/asr13/lecture_14.pdf

derefr · on May 18, 2015

Most of the interesting things neural networks do involve online learning.

For a simple non-NN analogy, think of spam detection: you want to be able to correct misclassifications, such that it won't make them again. This requires more effort than it sounds: it's not enough to just add the one piece of weighted evidence to the corpus and re-run the algorithm, because that won't necessarily make the filter spit out the new correct answer in the case of the original. Just like there's overfitting, there's also underfitting, and the naive training method results in underfitting.

Thus, what you tend to want is something more like a garbage collector: a process constantly running in a background thread, gradually retraining the system. At any given point, the system will answer questions using an MVCC-like point-in-time view of its beliefs, while those beliefs are getting played with and re-evaluated elsewhere.

Also, every time the NN changes its mind, there will be a subset of non-training samples it has already seen, that it will now classify differently than it originally did when it saw them. Continuously going back and amending its judgements on these is usually helpful.

minthd · on May 18, 2015

How large is this machinery for online-learning ?

soup10 · on May 18, 2015

I think it depends on the algorithm. Something like siri trains on a larger data set with better hardware and uses a reduced version on the phone hardware. I think most neural networks can be reduced and compressed into a 'read only' format with limited or no new training, but it's a secondary problem.

monk_e_boy · on May 18, 2015

The memory chips have been moved closer to the GPU, from inches to mm away (aside: inches vs mm? Imperial vs metric? What is this craziness? Have they taken a leaf out of the plywood industry where everything is measured width and height in feet and inches and depth in mm... I mean, what the flipping heck?)

Where was I? Oh, yes, moving a memory chip a few cm makes the data bus faster? I thought that the speed of light (or electrons in this case) was so quick that a few cm wouldn't make any difference?

modeless · on May 18, 2015

1. The speed of electrons in wires is quite slow. Electrical signals travel much faster than electrons.

2. Light travels ~30 cm per 1 GHz clock cycle. That's approaching the point where we have to worry about it, but...

3. The speed of light constrains latency, not bandwidth. GPUs care much more about bandwidth. The advantage of using shorter wires is it allows increasing bandwidth more easily.

minthd · on May 18, 2015

Very generally ,both bus clock speed and bus power(per bit) are proportional to the parasitic capacitor that's formed between any two bus wires which is proportional to their length.

This is because every new bit on the bus needs to charge this parasitic capacitor and this takes time and power.

unwind · on May 18, 2015

I'm not an electrical engineer, but I think that one difference can be in the required drive voltage and power for the bus. Remember that the GPU needs to drive the bus, i.e. it must apply sufficient voltage to make the lines reach the required voltages, and do so quickly. This can become easier if you know the distance is smaller; the driver can be faster since the losses in the line will be smaller too (and less risk of it picking up interference).

It's probably a differential drive which makes things more complicated in the details, but I think in general the above is true anyway.

pjc50 · on May 18, 2015

Component pin spacing is nearly always in decimal fractions of an inch. As others have commented, it improves latency at GHz speeds. The optimal approach for this is stacked dies, like the Raspberry Pi, although there it's done for cost and area reasons instead.

mdda · on May 18, 2015

I was at an Nvidia presentation where they made a big deal about just being 'on-silicon' rather than having to go through copper connects (to a bus external to the chip). Each join, corner, etc adds to capacitance on the wire, which then leads to delays (alternatively power consumption, and heat).

web007 · on May 18, 2015

I doubt I'm the only one to confuse the Pascal architecture (successor to Maxwell) with the Pascal language.

This will make all of the stuff in Caffe (C++?), Torch (C) and Theano (Python) faster by inclusion of cuDNN, a low-level CUDA-optimized library of deep neural network primitives.

lars · on May 18, 2015

Both Theano and Caffe (dont know about Torch) will already use cuDNN if its installed.

nrmn · on May 18, 2015

Torch (7 at least) uses Lua as the language you program in.

dharma1 · on May 18, 2015

anyone using Titan X cards? How does it compare to a cloud solution like an ec2 gpu instance with around 1000 cuda cores per instance?

johanneskanybal · on May 18, 2015

Been looking at Titan X the last few days. Here is one article I came across on the topic:

https://timdettmers.wordpress.com/2014/08/14/which-gpu-for-d...

tl;dr: GTX Titan X = 0.35 GTX 680 = 0.35 AWS GPU instance (g2.2 and g2.8) = 0.33 GTX 960

GTX Titan X = 0.66 GTX 980 = 0.6 GTX 970 = 0.5 GTX Titan = 0.40 GTX 580

also: I was under the impression single precision was fine for most deep learning applications and double precision doesen't even have good support in most libraries but I guess it depends on the use case.

mdda · on May 18, 2015

FLOP-wise that makes sense. But for deep learning, the big deal is in the 12Gb GPU-local memory, which has enormous bandwidth (and can store more of your dataset / parameters at once). The largest concern with GPU processing is keeping the GPU adequately fed with data - and avoiding round-trips of blobs of data with the CPU helps a lot.

johanneskanybal · on May 18, 2015

Oh I agree and the article talks plenty about that topic as well. For me the temptation with Titan X is primarily the "laziness" of a) not manualy having to try to parallelize AWS units and b) not needing to try to squeeze in models into 4-6gb.

Rather than a speedup factor of 2-3.