did my thesis on this topic (at that time we were searching the lower bound of ALU needed to have them running in zero power devices)
it's interesting, NN degrade at about 6bit, and that's mostly because the transfer function become stable and the training gets stuck more often in local minimums.
we built a training methodology in two step, first you trained them in 16bit precision, finding the absolute minimum, then retrain them with 6bit precision, and the NN basically learned to cope with the precision loss on its own.
funny part is, the less bit you have, the more robust the network became, because error correcting became a normal part of its transfer function.
we couldn't make the network solution converge on 4bit however. we tried using different transfer function, but then ran out of time before getting meaningful results (Each function needs it's own back propagation adjustment and things like that take time, I'm not a mathematician :D)
I had similar empirical results on one of my PhD projects for medical image classification. With small data sets, we got better results on 8-bit data sets compared to 16-bit. We viewed it as a form of regularization that was extremely effective on smaller data sets with a lot of noise (x-rays in this case).
When using 8-bit weights, what kind of mapping do you do? Do you map the 8-bit range into -10 to 10? Do you have more precision near zero or is it a linear mapping?
Don't know about him but I was working with -8 8 for input and -4 4 for weights, using atan function for transfer maps quite well and there is no need to oversaturate the next layer.
it's interesting, NN degrade at about 6bit, and that's mostly because the transfer function become stable and the training gets stuck more often in local minimums.
we built a training methodology in two step, first you trained them in 16bit precision, finding the absolute minimum, then retrain them with 6bit precision, and the NN basically learned to cope with the precision loss on its own.
funny part is, the less bit you have, the more robust the network became, because error correcting became a normal part of its transfer function.
we couldn't make the network solution converge on 4bit however. we tried using different transfer function, but then ran out of time before getting meaningful results (Each function needs it's own back propagation adjustment and things like that take time, I'm not a mathematician :D)