How did flash-attn compute attention for cu_seqlens

Hi,

We know that `cu_seqlens` is for the compute efficiency when we do training over multiple variable-length samples. And the attention mask can be calculated through `cu_seqlens`. We can cut the original cumulative sequence into a batch sequence and pad the empty positions with zeros. But this approach will waste training efficiency, since computing resources are consumed for meaningless padding tokens.

I am not quite familiar with the implementation details of flash-attn. So, I am just curious about where can I find the implementation or mechanism that how did flash-attn compute attention directly over cumulative sequence and get the separate results?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How did flash-attn compute attention for cu_seqlens #850

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How did flash-attn compute attention for cu_seqlens #850

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions