It's interesting that optimizing a computation that can be described in a single line of math takes so much work. It took forever even to discover Flash attention. And in the 6 years since transformers were invented, thousands of papers worked on making it faster.
Attention(Q,K,V) = Softmax(Q*K^T/sqrt(d_k))*V
FlexAttention seems to have found the right abstraction for the task.
Yea, because the math have stripped down the whole thing to : I have data I do operation on them. while in reality we deal with multi head attention / grouped query and the positional encoding.
That’s all without taking into account the broadcasting done on the batch dimension
Attention(Q,K,V) = Softmax(Q*K^T/sqrt(d_k))*V
FlexAttention seems to have found the right abstraction for the task.