Skip to content

[Question]Questions regarding INTERLEAVE vs. SW128 Layouts for SM90 Sparse Attention Decode #166

@pengwubj

Description

@pengwubj

In the SM90 implementation of sparse attention for decoding, the LayoutKV is currently set to INTERLEAVE. Specifically, after loading the KV cache from global memory (HBM), the data is dequantized and stored in Shared Memory (SMEM) using this interleaved layout.

In this implementation, 4 threads cooperate to load a single token. While 8 contiguous threads access different addresses in global memory, they access contiguous addresses in SMEM.

I have the following questions regarding this design:

Why was the INTERLEAVE layout chosen over SW128? Is this decision primarily intended to reduce CUDA core ALU overhead for address calculation, or are there specific bank conflict considerations for SM90?

Why not utilize all 32 threads in a warp to load the 576B of a single token? Is this constraint due to the complexity of handling the 16B floating-point scales during dequantization, or is it related to optimizing memory coalescing?

Will 8 random global address in one WARP leading to very badding performance ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions