In the SM90 implementation of sparse attention for decoding, the LayoutKV is currently set to INTERLEAVE. Specifically, after loading the KV cache from global memory (HBM), the data is dequantized and stored in Shared Memory (SMEM) using this interleaved layout.
In this implementation, 4 threads cooperate to load a single token. While 8 contiguous threads access different addresses in global memory, they access contiguous addresses in SMEM.
I have the following questions regarding this design:
Why was the INTERLEAVE layout chosen over SW128? Is this decision primarily intended to reduce CUDA core ALU overhead for address calculation, or are there specific bank conflict considerations for SM90?
Why not utilize all 32 threads in a warp to load the 576B of a single token? Is this constraint due to the complexity of handling the 16B floating-point scales during dequantization, or is it related to optimizing memory coalescing?
Will 8 random global address in one WARP leading to very badding performance ?