Skip to content

Support async in DeepEP #4610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Mar 23, 2025
Merged

Support async in DeepEP #4610

merged 35 commits into from
Mar 23, 2025

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Mar 20, 2025

Motivation

When doing #4068, DeepEP needs to be async. This PR enables that in a minimal way.

(This is a separate PR because #4068 may not be done in a day, and I hope less merge conflicts happen, so extract this part first)

Related: #4232 (Initial DeepEP support)

When viewing diff, please subtract from change in #4608

In order to demonstrate the PR works, I set async_finish=True temporarily. In real world, maybe async will be true only when doing two-batch-overlap. The flag will be handled in #4068. (I can also make it false in this PR if needed)

Modifications

Checklist

@fzyzcjy fzyzcjy changed the title Feat/deepseek async Support async in DeepEP Mar 20, 2025
@fzyzcjy fzyzcjy marked this pull request as ready for review March 20, 2025 06:37
@fzyzcjy fzyzcjy requested a review from merrymercy as a code owner March 20, 2025 06:37
@ch-wan
Copy link
Collaborator

ch-wan commented Mar 23, 2025

For future reference, I did a simple benchmark on one 8xH200 machine for this PR. Here is the command

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code   --tp 8 --dp 8 --host 0.0.0.0 --port 30000   --enable-dp-attention --enable-deepep-moe   --disable-cuda-graph
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 512 --random-input 1000 --random-output 1000 --random-range-ratio 1 --host 127.0.0.1 --port 30000 --max-concurrency 128

Results:

Version Concurrency Input Output Num Requests Input Throughput(tok/s) Output Throughput (tok/s) Total Throughput (tok/s)
Oritinal 127.98 1000 1000 512 555.48 555.48 1110.97
Current 127.96 1000 1000 512 612.16 612.16 1224.33

@fzyzcjy fzyzcjy mentioned this pull request Mar 23, 2025
6 tasks
@zhyncs zhyncs merged commit ca75741 into sgl-project:main Mar 23, 2025
0 of 18 checks passed
@ch-wan ch-wan mentioned this pull request Mar 24, 2025
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants