Improving Total Token Throughput by 1%: Reducing CPU Overhead in Zero-Overhead Scheduling #4790

WANG-GH · 2025-03-26T09:06:41Z

Background & Motivation

In the Zero-Overhead Scheduling mechanism, the goal is to maximize the overlap between GPU forward computation and CPU scheduling, ensuring optimal utilization of compute resources. However, in practice, the GPU kernel launch handled by TpModelWorkerClient.forward_thread_func_ and the finish sync handled by resolve_batch_result cannot be overlapped with GPU execution, which introduces a performance bottleneck.

Therefore, by shortening the time TpModelWorkerClient spends, the GPU can start earlier and remain busy longer. This freed-up time can then be used to process more inference batches, increasing throughput.

Optimization Approach

In the original implementation of TpModelWorkerClient, a launch_done event was created and used for synchronization both before and after launching the GPU kernel. At the same time, a copy_done event was also used, which implicitly guarantees that the kernel has already been launched and that the data has been copied back to the CPU.

Optimization Details

In this optimization, we removed the redundant launch_done event and only kept copy_done, since:

copy_done implicitly guarantees that launch_done has completed;
This eliminates the overhead of creating, setting, and waiting on launch_done;
And allows the GPU to begin computation earlier and remain fully utilized, resulting in improved throughput.

Experimental Results

We conducted 10 rounds of throughput testing on an A800 GPU server using the following command:

python3 -m sglang.bench_offline_throughput \
    --model-path /home/wyy/models/llama-8B \
    --dataset-path /home/wyy/models/ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompt 100 \
    --attention-backend flashinfer

Before Optimization:

avg Request throughput (req/s):       4.20
avg Input token throughput (tok/s):   1410.32
avg Output token throughput (tok/s):  893.66
avg Total token throughput (tok/s):   2303.99

After Optimization:

avg Request throughput (req/s):       4.25
avg Input token throughput (tok/s):   1424.93
avg Output token throughput (tok/s):  902.92
avg Total token throughput (tok/s):   2327.86

The MMLU score remains stable at 71 before and after the optimization.

Summary of Gains:

Total token throughput improved by ~25 tokens/s, a 1.04% increase;
The performance gain is expected to be even more significant with more powerful GPUs, as faster GPUs are more likely to be bottlenecked by CPU launch overhead;
This is a low-cost optimization involving the removal of a single synchronization variable, yet it brings consistent and measurable improvements.

WANG-GH · 2025-03-28T03:43:27Z

It's quite strange. When I run the following command to test throughput:

python3 -m sglang.bench_offline_throughput \
    --model-path /home/wyy/models/llama-8B \
    --dataset-path /home/wyy/models/ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompt 100 \
    --attention-backend flashinfer

I get 25 tokens/s higher throughput on H100 GPUs.

However, when I use the following script to test throughput:

python3 -m unittest test_bench_serving.TestBenchServing.test_offline_throughput_default

The overall throughput drops by about 100 tokens/s on H100.

In theory, the original resolve_batch_result function contains the following lines:

copy_done.synchronize()
self.launch_done.wait()

If launch_done represents GPU-side synchronized inference completion, then removing launch_done.wait() should theoretically reduce the CPU-side bubble time, and shouldn't cause a drop in overall throughput.

I've added several logs to measure the function execution time, but I still haven't been able to pinpoint why these two different throughput testing methods produce such different results.

merrymercy · 2025-04-21T00:47:56Z

the diff seems small and within the error range

merrymercy · 2025-04-27T23:32:48Z

fixed by #5788 (review)

delete_launch_done

05d057f

WANG-GH requested review from merrymercy, Ying1123, hnyls2002 and xiezhq-hermann as code owners March 26, 2025 09:06

WANG-GH and others added 3 commits March 26, 2025 17:07

Merge branch 'main' into delete_launch_done

ad95bd2

Merge branch 'main' into delete_launch_done

3b58c29

Merge branch 'main' into delete_launch_done

fdb2a33

Merge branch 'main' into delete_launch_done

d4ca039

merrymercy closed this Apr 21, 2025

merrymercy reopened this Apr 27, 2025

hnyls2002 closed this Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving Total Token Throughput by 1%: Reducing CPU Overhead in Zero-Overhead Scheduling #4790

Improving Total Token Throughput by 1%: Reducing CPU Overhead in Zero-Overhead Scheduling #4790

Uh oh!

WANG-GH commented Mar 26, 2025 •

edited

Loading

Uh oh!

WANG-GH commented Mar 28, 2025

Uh oh!

merrymercy commented Apr 21, 2025

Uh oh!

merrymercy commented Apr 27, 2025

Uh oh!

Uh oh!

Improving Total Token Throughput by 1%: Reducing CPU Overhead in Zero-Overhead Scheduling #4790

Improving Total Token Throughput by 1%: Reducing CPU Overhead in Zero-Overhead Scheduling #4790

Uh oh!

Conversation

WANG-GH commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background & Motivation

Optimization Approach

Optimization Details

Experimental Results

Summary of Gains:

Uh oh!

WANG-GH commented Mar 28, 2025

Uh oh!

merrymercy commented Apr 21, 2025

Uh oh!

merrymercy commented Apr 27, 2025

Uh oh!

Uh oh!

WANG-GH commented Mar 26, 2025 •

edited

Loading