It's interesting that optimizing a computation that can be described in a single...

d3m0t3p · on Aug 8, 2024

Yea, because the math have stripped down the whole thing to : I have data I do operation on them. while in reality we deal with multi head attention / grouped query and the positional encoding.

That’s all without taking into account the broadcasting done on the batch dimension

chillee · on Aug 8, 2024

I would agree with this. For example, how would you represent causal attention in the standard equation?

brrrrrm · on Aug 8, 2024

this is true of even just matrix multiplication (A*B) of which attention has two