Can you share some paper names or links to architectures that demonstrate the le...

phowon · on Feb 5, 2019

I'm not sure if you're understanding me correctly.

Attention is generally length invariant. You take some transformation on the hidden representations (/+ inputs) at that each time step, and then you normalize over all the transformed values to get weights that sum to one. No part of this is constrained by length.

For CNNs, any network that has pooling has the potential to be length/dimension invariant. Whether it actually is is a combination of the architectural design and an implementation detail (e.g. some implementations when trying to pool will specifically define a pooling operation over, say, a 9x9 window. You could define the same pooling operation over a variable-dimension window).

The length/dimension invariance aren't a special or novel property. In the case of attention it's built in. In the case of CNNs, the convolutions are not length invariant, but depending on the architecture, the pooling operations are (or can be modified to be).

mrcoder111 · on Feb 5, 2019

In order to get a variable length context, you need to add some machinery to some forms of attention. For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors. You cannot train on images with 5 annotations/context vectors and expect anything to transfer to a setting with 10 annotations. That's why I would be interested in a specific paper to solidify what you're saying.

phowon · on Feb 8, 2019

>For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors

That's not true.

You compute an attention weight across however many context steps you have by computing an interaction between some current decoder hidden state and every encoder hidden state, and normalizing over all of them via a softmax. There is no constraint whatsoever on a fixed context length or a fixed number of context vectors. See section 3.1 in the paper.

I will be happy to discuss and clarify over email.

mrcoder111 · on Feb 12, 2019

Sounds good - can you send me an email? I put mine in my about