Skip to content

[Fix] fix FlashMLA cudagraph config #4591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

yinfan98
Copy link
Collaborator

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla
python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256
  • FlashMLA
batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s
  • FlashInfer
batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s

With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

Co-authored-by: sleepcoo <[email protected]>
@yinfan98
Copy link
Collaborator Author

yinfan98 commented Mar 19, 2025

Avoid CPU-GPU synchronization due to PR: #4514. cc: @zhyncs @merrymercy

@zhyncs
Copy link
Member

zhyncs commented Mar 20, 2025

ref #4577

@tianchongchong
Copy link

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla
python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256
  • FlashMLA
batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s
  • FlashInfer
batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s

With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

hi @yinfan98 I want to know what gpu the above test results are based on. Is there test results of compared with triton-backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants