Processing half float data on the CPU will be tricky without hardware support. There's plenty of CPU processing needed for deep learning, for capturing and preparing datasets.
Half precision floating point is really something that should have been in hardware a long time ago. There are other applications besides deep learning that could benefit from the dynamic range of floating point, but don't need 32 bits. For example, imaging and audio.
Logarithmic number systems could offer even greater improvement(100x over gpu), if you're willing to live with more errors and/or invest some time in fixing them:
Seems like a great idea, and the last couple pages sound like they were making good progress toward commercializing it. But the document's from 2010. I wonder if they're still working on it.
Haven't read the article, but wanted to ask a question: Once the data is mined and the lessons learnt, isn't that it? You don't need the massive learning algorithm, just the little 'rules reading' algorithm?
Is this a thing? Something that takes a convolutional neural network and spits out a little app?
If I understand what you are talking about, then yes.
For example, offline speech recognition in Android phones uses a CNN that was trained on Google's GPU cluster[1]
[1] I can't find the good reference of this, but Slide 31 of this is outdated, but kind of says it (you aren't training a 2.7M parameter CNN on a phone). http://www.cs.nyu.edu/~eugenew/asr13/lecture_14.pdf
Most of the interesting things neural networks do involve online learning.
For a simple non-NN analogy, think of spam detection: you want to be able to correct misclassifications, such that it won't make them again. This requires more effort than it sounds: it's not enough to just add the one piece of weighted evidence to the corpus and re-run the algorithm, because that won't necessarily make the filter spit out the new correct answer in the case of the original. Just like there's overfitting, there's also underfitting, and the naive training method results in underfitting.
Thus, what you tend to want is something more like a garbage collector: a process constantly running in a background thread, gradually retraining the system. At any given point, the system will answer questions using an MVCC-like point-in-time view of its beliefs, while those beliefs are getting played with and re-evaluated elsewhere.
Also, every time the NN changes its mind, there will be a subset of non-training samples it has already seen, that it will now classify differently than it originally did when it saw them. Continuously going back and amending its judgements on these is usually helpful.
I think it depends on the algorithm. Something like siri trains on a larger data set with better hardware and uses a reduced version on the phone hardware. I think most neural networks can be reduced and compressed into a 'read only' format with limited or no new training, but it's a secondary problem.
The memory chips have been moved closer to the GPU, from inches to mm away (aside: inches vs mm? Imperial vs metric? What is this craziness? Have they taken a leaf out of the plywood industry where everything is measured width and height in feet and inches and depth in mm... I mean, what the flipping heck?)
Where was I? Oh, yes, moving a memory chip a few cm makes the data bus faster? I thought that the speed of light (or electrons in this case) was so quick that a few cm wouldn't make any difference?
1. The speed of electrons in wires is quite slow. Electrical signals travel much faster than electrons.
2. Light travels ~30 cm per 1 GHz clock cycle. That's approaching the point where we have to worry about it, but...
3. The speed of light constrains latency, not bandwidth. GPUs care much more about bandwidth. The advantage of using shorter wires is it allows increasing bandwidth more easily.
Very generally ,both bus clock speed and bus power(per bit) are proportional to the parasitic capacitor that's formed between any two bus wires which is proportional to their length.
This is because every new bit on the bus needs to charge this parasitic capacitor and this takes time and power.
I'm not an electrical engineer, but I think that one difference can be in the required drive voltage and power for the bus. Remember that the GPU needs to drive the bus, i.e. it must apply sufficient voltage to make the lines reach the required voltages, and do so quickly. This can become easier if you know the distance is smaller; the driver can be faster since the losses in the line will be smaller too (and less risk of it picking up interference).
It's probably a differential drive which makes things more complicated in the details, but I think in general the above is true anyway.
Component pin spacing is nearly always in decimal fractions of an inch. As others have commented, it improves latency at GHz speeds. The optimal approach for this is stacked dies, like the Raspberry Pi, although there it's done for cost and area reasons instead.
I was at an Nvidia presentation where they made a big deal about just being 'on-silicon' rather than having to go through copper connects (to a bus external to the chip). Each join, corner, etc adds to capacitance on the wire, which then leads to delays (alternatively power consumption, and heat).
I doubt I'm the only one to confuse the Pascal architecture (successor to Maxwell) with the Pascal language.
This will make all of the stuff in Caffe (C++?), Torch (C) and Theano (Python) faster by inclusion of cuDNN, a low-level CUDA-optimized library of deep neural network primitives.
also: I was under the impression single precision was fine for most deep learning applications and double precision doesen't even have good support in most libraries but I guess it depends on the use case.
FLOP-wise that makes sense. But for deep learning, the big deal is in the 12Gb GPU-local memory, which has enormous bandwidth (and can store more of your dataset / parameters at once). The largest concern with GPU processing is keeping the GPU adequately fed with data - and avoiding round-trips of blobs of data with the CPU helps a lot.
Oh I agree and the article talks plenty about that topic as well. For me the temptation with Titan X is primarily the "laziness" of a) not manualy having to try to parallelize AWS units and b) not needing to try to squeeze in models into 4-6gb.
Half precision floating point is really something that should have been in hardware a long time ago. There are other applications besides deep learning that could benefit from the dynamic range of floating point, but don't need 32 bits. For example, imaging and audio.