Closed
Description
Hi,
We know that cu_seqlens
is for the compute efficiency when we do training over multiple variable-length samples. And the attention mask can be calculated through cu_seqlens
. We can cut the original cumulative sequence into a batch sequence and pad the empty positions with zeros. But this approach will waste training efficiency, since computing resources are consumed for meaningless padding tokens.
I am not quite familiar with the implementation details of flash-attn. So, I am just curious about where can I find the implementation or mechanism that how did flash-attn compute attention directly over cumulative sequence and get the separate results?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels