[Question]Questions regarding INTERLEAVE vs. SW128 Layouts for SM90 Sparse Attention Decode

In the SM90 implementation of sparse attention for decoding, the LayoutKV is currently set to INTERLEAVE. Specifically, after loading the KV cache from global memory (HBM), the data is dequantized and stored in Shared Memory (SMEM) using this interleaved layout.

In this implementation, 4 threads cooperate to load a single token. While 8 contiguous threads access different addresses in global memory, they access contiguous addresses in SMEM.

I have the following questions regarding this design:

Why was the INTERLEAVE layout chosen over SW128? Is this decision primarily intended to reduce CUDA core ALU overhead for address calculation, or are there specific bank conflict considerations for SM90?

Why not utilize all 32 threads in a warp to load the 576B of a single token? Is this constraint due to the complexity of handling the 16B floating-point scales during dequantization, or is it related to optimizing memory coalescing?

Will 8 random global address in one WARP  leading to very badding performance ? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]Questions regarding INTERLEAVE vs. SW128 Layouts for SM90 Sparse Attention Decode #166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question]Questions regarding INTERLEAVE vs. SW128 Layouts for SM90 Sparse Attention Decode #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions