Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's interesting that optimizing a computation that can be described in a single line of math takes so much work. It took forever even to discover Flash attention. And in the 6 years since transformers were invented, thousands of papers worked on making it faster.

Attention(Q,K,V) = Softmax(Q*K^T/sqrt(d_k))*V

FlexAttention seems to have found the right abstraction for the task.



Yea, because the math have stripped down the whole thing to : I have data I do operation on them. while in reality we deal with multi head attention / grouped query and the positional encoding.

That’s all without taking into account the broadcasting done on the batch dimension


I would agree with this. For example, how would you represent causal attention in the standard equation?


this is true of even just matrix multiplication (A*B) of which attention has two




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: