Skip to content

Improving Total Token Throughput by 1%: Reducing CPU Overhead in Zero-Overhead Scheduling #4790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

WANG-GH
Copy link

@WANG-GH WANG-GH commented Mar 26, 2025

Background & Motivation

In the Zero-Overhead Scheduling mechanism, the goal is to maximize the overlap between GPU forward computation and CPU scheduling, ensuring optimal utilization of compute resources. However, in practice, the GPU kernel launch handled by TpModelWorkerClient.forward_thread_func_ and the finish sync handled by resolve_batch_result cannot be overlapped with GPU execution, which introduces a performance bottleneck.

Therefore, by shortening the time TpModelWorkerClient spends, the GPU can start earlier and remain busy longer. This freed-up time can then be used to process more inference batches, increasing throughput.

Optimization Approach

In the original implementation of TpModelWorkerClient, a launch_done event was created and used for synchronization both before and after launching the GPU kernel. At the same time, a copy_done event was also used, which implicitly guarantees that the kernel has already been launched and that the data has been copied back to the CPU.

Optimization Details

In this optimization, we removed the redundant launch_done event and only kept copy_done, since:

  • copy_done implicitly guarantees that launch_done has completed;
  • This eliminates the overhead of creating, setting, and waiting on launch_done;
  • And allows the GPU to begin computation earlier and remain fully utilized, resulting in improved throughput.

Experimental Results

We conducted 10 rounds of throughput testing on an A800 GPU server using the following command:

python3 -m sglang.bench_offline_throughput \
    --model-path /home/wyy/models/llama-8B \
    --dataset-path /home/wyy/models/ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompt 100 \
    --attention-backend flashinfer

Before Optimization:

avg Request throughput (req/s):       4.20
avg Input token throughput (tok/s):   1410.32
avg Output token throughput (tok/s):  893.66
avg Total token throughput (tok/s):   2303.99

After Optimization:

avg Request throughput (req/s):       4.25
avg Input token throughput (tok/s):   1424.93
avg Output token throughput (tok/s):  902.92
avg Total token throughput (tok/s):   2327.86

The MMLU score remains stable at 71 before and after the optimization.

Summary of Gains:

  • Total token throughput improved by ~25 tokens/s, a 1.04% increase;
  • The performance gain is expected to be even more significant with more powerful GPUs, as faster GPUs are more likely to be bottlenecked by CPU launch overhead;
  • This is a low-cost optimization involving the removal of a single synchronization variable, yet it brings consistent and measurable improvements.

@WANG-GH
Copy link
Author

WANG-GH commented Mar 28, 2025

It's quite strange. When I run the following command to test throughput:

python3 -m sglang.bench_offline_throughput \
    --model-path /home/wyy/models/llama-8B \
    --dataset-path /home/wyy/models/ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompt 100 \
    --attention-backend flashinfer

I get 25 tokens/s higher throughput on H100 GPUs.

However, when I use the following script to test throughput:

python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_default

The overall throughput drops by about 100 tokens/s on H100.

In theory, the original resolve_batch_result function contains the following lines:

copy_done.synchronize()
self.launch_done.wait()

If launch_done represents GPU-side synchronized inference completion, then removing launch_done.wait() should theoretically reduce the CPU-side bubble time, and shouldn't cause a drop in overall throughput.

I've added several logs to measure the function execution time, but I still haven't been able to pinpoint why these two different throughput testing methods produce such different results.

@merrymercy merrymercy closed this Apr 21, 2025
@merrymercy
Copy link
Contributor

the diff seems small and within the error range

@merrymercy
Copy link
Contributor

fixed by #5788 (review)

@merrymercy merrymercy reopened this Apr 27, 2025
@hnyls2002 hnyls2002 closed this Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants