Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An interesting work, with some to-be-addressed questions: 1.The paper only covers the GEMM part with small-scale experiments(CIFAR-10/100), not covering convolution, not covering GEMM part in more popular network such as Transformer/BERT, etc. 2. It is still an approximating method, meaning potential accuracy loss. So I think this method is less attractive to training acceleration scenario, maybe potentially as a complementing methods for inference acceleration. 3. No results evaluated in GPU with TensorCore equipment. I am a little bit curious, since modern AI accelerator(including NV GPU) all incorporate TensorCore which by-design supports GEMM acceleration, what is the add-on value brought by the approximating method mentioned in this paper.


Great observations. I see this paper as the first in a three-part series. The second part is specializing it for convolution (which has additional structure to exploit), and the third is hooking these approximate ops into deep neural nets the way people currently do with scalar quantization / pruning.

I'm not optimistic about beating tensor cores when running on GPUs, at least until/unless we get similar hardware support.*

Barring better hardware support, the killer app is probably CPU inference--once there are Conv implementations and the necessary GPU kernels to train the network.

*Aside: this support would be pretty doable since the kernels look almost identical to GEMM kernels--you just need a multiplex-add rather than a multiply-add. On an x86 machine, all it would take is a vpshufb-add and a 4-bit unpack instruction.


If it works better for inference, it could enable fast inference on devices which don't have good tensor cores/gpus


> I think this method is less attractive to training acceleration scenario

The proposed hash based encoding function is not differentiable, so it doesn’t appear this method can be used for training at all.

I’m not aware of any hash functions that are analytically differentiable, so to support efficient back-propagation I suspect that some fundamental changes to this method would be necessary.


You could still optimize the prototypes, so fine-tuning with this in place would be possible (see, e.g., [1]). But we don't yet have data on how well this would work using our exact method, how early in training you could do the op replacement, etc.

[1] http://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: