If you’re willing, I’d love your insight on the “why one might want to do this”.
Conceptually I understand embedding quantization, and I have some hint of why it works for things like WAV2VEC - human phonemes are (somewhat) finite so forcing the representation to be finite makes sense - but I feel like there’s a level of detail that I’m missing regarding whats really going on and when quantisation helps/harms that I haven’t been able to gleam from papers.
Quantization also works as regularization; it stops the neural network from being able to use arbitrarily complex internal rules.
But really it's only really useful if you absolutely need to have a discrete embedding space for some sort of downstream usage. VQVAEs can be difficult to get to converge, they have problems stemming from the approximation of the gradient like codebook collapse
Maybe it helps to point out that the first version of Dall-E (of 'baby daikon radish in a tutu walking a dog' fame) used the same trick, but they quantized the image patches.
Conceptually I understand embedding quantization, and I have some hint of why it works for things like WAV2VEC - human phonemes are (somewhat) finite so forcing the representation to be finite makes sense - but I feel like there’s a level of detail that I’m missing regarding whats really going on and when quantisation helps/harms that I haven’t been able to gleam from papers.