[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel #16693

DefTruth · 2025-04-16T03:11:47Z

Fix potential CUDA graph broken for the merge_attn_states kernel. A CUDA graph error related to merge_state was observed in sglang (sgl-project/sglang#5404) and fixed in sgl-project/sglang#5419. Since the merge_attn_states kernel is often active as a fundamental kernel in many scenarios, it would be better to bind the merge_attn_states kernel to the CUDA stream, as required by the CUDA graph. This binding won't affect the performance.

Signed-off-by: DefTruth <[email protected]>

github-actions · 2025-04-16T03:11:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

houseroad

Looks good.

Could you paste the test plan in the PR description?

DefTruth · 2025-04-16T06:56:34Z

Looks good.

Could you paste the test plan in the PR description?

@houseroad local unit tests for this kernel is passed

python3 -m pytest -s test_merge_attn_states.py
INFO 04-16 12:59:54 [__init__.py:239] Automatically detected platform cuda.
/usr/local/lib/python3.10/dist-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.12, pytest-8.3.3, pluggy-1.5.0
rootdir: /workspace/dev/vipshop/vllm
configfile: pyproject.toml
plugins: anyio-4.9.0, langsmith-0.3.18, forked-1.6.0, shard-0.1.2, buildkite-test-collector-0.1.9, mock-3.14.0, asyncio-0.24.0, rerunfailures-14.0
asyncio: mode=strict, default_loop_scope=None
collected 648 items
Running 648 items in this shard

test_merge_attn_states.py
NUM_TOKENS:256, NUM_HEADS:4, HEAD_SIZE:32, DTYPE: torch.float32, Device: NVIDIA L20
 Torch time: 0.164301ms
Triton time: 0.062115ms
  CUDA time: 0.018589ms, Performance: 3.34154x
----------------------------------------------------------------------------------------------------
Output all match, max abs diff:
(Triton vs Torch) : 4.76837158203125e-07
  (CUDA vs Torch) : 2.384185791015625e-07
  (CUDA vs Triton): 4.76837158203125e-07
----------------------------------------------------------------------------------------------------
Output LSE all match, max abs diff:
(Triton vs Torch) : 1.1920928955078125e-07
  (CUDA vs Torch) : 0.0
  (CUDA vs Triton): 1.1920928955078125e-07
----------------------------------------------------------------------------------------------------
All output values test passed! All inf values are correctly replaced with -inf.
----------------------------------------------------------------------------------------------------
.
NUM_TOKENS:512, NUM_HEADS:4, HEAD_SIZE:32, DTYPE: torch.float32, Device: NVIDIA L20
 Torch time: 0.179862ms
Triton time: 0.059288ms
  CUDA time: 0.017562ms, Performance: 3.37600x
----------------------------------------------------------------------------------------------------
Output all match, max abs diff:
(Triton vs Torch) : 4.76837158203125e-07
  (CUDA vs Torch) : 2.384185791015625e-07
  (CUDA vs Triton): 4.76837158203125e-07
----------------------------------------------------------------------------------------------------
Output LSE all match, max abs diff:
(Triton vs Torch) : 2.384185791015625e-07
  (CUDA vs Torch) : 0.0
  (CUDA vs Triton): 2.384185791015625e-07
----------------------------------------------------------------------------------------------------
All output values test passed! All inf values are correctly replaced with -inf.
----------------------------------------------------------------------------------------------------
.
NUM_TOKENS:613, NUM_HEADS:4, HEAD_SIZE:32, DTYPE: torch.float32, Device: NVIDIA L20
 Torch time: 0.165590ms
Triton time: 0.058877ms
  CUDA time: 0.020069ms, Performance: 2.93375x
----------------------------------------------------------------------------------------------------
Output all match, max abs diff:
(Triton vs Torch) : 4.76837158203125e-07
  (CUDA vs Torch) : 2.384185791015625e-07
  (CUDA vs Triton): 4.76837158203125e-07
----------------------------------------------------------------------------------------------------
Output LSE all match, max abs diff:
(Triton vs Torch) : 2.384185791015625e-07
  (CUDA vs Torch) : 0.0
  (CUDA vs Triton): 2.384185791015625e-07
----------------------------------------------------------------------------------------------------
All output values test passed! All inf values are correctly replaced with -inf.
----------------------------------------------------------------------------------------------------
// ......
============================================================================ 648 passed in 123.44s (0:02:03) =============================================================================

some performance results:

tokens	heads	headsize	dtype	device	torch	triton	cuda	speedup
256	4	32	float32	L20	0.16430ms	0.06212ms	0.01859ms	3.3415x
512	4	32	float32	L20	0.17986ms	0.05929ms	0.01756ms	3.3760x
613	4	32	float32	L20	0.16559ms	0.05888ms	0.02007ms	2.9337x
1024	4	32	float32	L20	0.16312ms	0.05745ms	0.01756ms	3.2713x
1536	4	32	float32	L20	0.16476ms	0.05872ms	0.01874ms	3.1334x
4096	4	32	float32	L20	0.16675ms	0.06226ms	0.01705ms	3.6520x
256	8	32	float32	L20	0.19216ms	0.05703ms	0.01746ms	3.2658x
256	32	256	bfloat16	L20	0.16143ms	0.05386ms	0.01649ms	3.2657x
512	32	256	bfloat16	L20	0.18058ms	0.05392ms	0.01684ms	3.2010x
613	32	256	bfloat16	L20	0.19149ms	0.05704ms	0.01736ms	3.2855x
1024	32	256	bfloat16	L20	0.33562ms	0.06523ms	0.01916ms	3.4053x
1536	32	256	bfloat16	L20	0.50728ms	0.07685ms	0.02422ms	3.1729x
4096	32	256	bfloat16	L20	1.32142ms	0.32629ms	0.30771ms	1.0604x
256	48	256	bfloat16	L20	0.16998ms	0.05412ms	0.01736ms	3.1181x
512	48	256	bfloat16	L20	0.21401ms	0.06036ms	0.01720ms	3.5087x
613	48	256	bfloat16	L20	0.29475ms	0.06297ms	0.01803ms	3.4921x
1024	48	256	bfloat16	L20	0.50677ms	0.07680ms	0.02417ms	3.1778x
1536	48	256	bfloat16	L20	0.79488ms	0.20915ms	0.16789ms	1.2458x
4096	48	256	bfloat16	L20	1.91892ms	0.52148ms	0.45199ms	1.1537x
256	64	256	bfloat16	L20	0.18099ms	0.05652ms	0.01726ms	3.2747x
512	64	256	bfloat16	L20	0.33525ms	0.06589ms	0.01947ms	3.3851x
613	64	256	bfloat16	L20	0.42850ms	0.06983ms	0.02104ms	3.3190x

for cuda graph, i only test it on sglang, see sgl-project/sglang#5419., the attn part for prefill/chunk-prefill in vllm seems will running with eager mode. Cuda-graph is currently enabled for decoding only. so, that's why vllm will not encounter the same cuda graph error for merge_attn_states kernel as sglang. But, since the merge_attn_states kernel is often active as a fundamental kernel in many scenarios, it would be better to bind the merge_attn_states kernel to the CUDA stream, as required by the CUDA graph. This binding won't affect the performance.

vllm/vllm/attention/backends/mla/common.py

Lines 452 to 460 in 44fa4d5

    
           # Whether or not if cuda graph is enabled. 
        
           # Cuda-graph is currently enabled for decoding only. 
        
           # TODO(woosuk): Move `use_cuda_graph` out since it's unrelated to attention. 
        
           use_cuda_graph: bool 
        
           # New for MLA (compared to FlashAttention) 
        
           # Input positions for rotrary embeddings since for MLA the rotary 
        
           # position embeddings are applied inside the attention backend 
        
           input_positions: torch.Tensor

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]>

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]> Signed-off-by: Yang Wang <[email protected]>

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]>

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]> Signed-off-by: Mu Huai <[email protected]>

[Bugfix] fix potential cuda graph broken for merge_attn_states kernel

0150b0b

Signed-off-by: DefTruth <[email protected]>

houseroad approved these changes Apr 16, 2025

View reviewed changes

houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 16, 2025

vllm-bot merged commit e82ee40 into vllm-project:main Apr 16, 2025
82 of 89 checks passed

lionelvillard pushed a commit to lionelvillard/vllm that referenced this pull request Apr 17, 2025

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_state…

237652f

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]>

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_state…

9fd5de9

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]> Signed-off-by: Yang Wang <[email protected]>

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_state…

c10385e

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_state…

ac7d663

…s kernel (vllm-project#16693) Signed-off-by: DefTruth <[email protected]>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

DefTruth deleted the fix-potential-cuda-graph-broken branch July 2, 2025 05:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel #16693

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel #16693

Uh oh!

DefTruth commented Apr 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

houseroad left a comment •

edited

Loading

Uh oh!

DefTruth commented Apr 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel #16693

[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel #16693

Uh oh!

Conversation

DefTruth commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

houseroad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DefTruth commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DefTruth commented Apr 16, 2025 •

edited

Loading

houseroad left a comment •

edited

Loading

DefTruth commented Apr 16, 2025 •

edited

Loading