Improving Total Token Throughput by 1%: Reducing CPU Overhead in Zero-Overhead Scheduling #4790
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background & Motivation
In the Zero-Overhead Scheduling mechanism, the goal is to maximize the overlap between GPU forward computation and CPU scheduling, ensuring optimal utilization of compute resources. However, in practice, the GPU kernel launch handled by TpModelWorkerClient.forward_thread_func_ and the finish sync handled by resolve_batch_result cannot be overlapped with GPU execution, which introduces a performance bottleneck.
Therefore, by shortening the time TpModelWorkerClient spends, the GPU can start earlier and remain busy longer. This freed-up time can then be used to process more inference batches, increasing throughput.
Optimization Approach
In the original implementation of TpModelWorkerClient, a launch_done event was created and used for synchronization both before and after launching the GPU kernel. At the same time, a copy_done event was also used, which implicitly guarantees that the kernel has already been launched and that the data has been copied back to the CPU.
Optimization Details
In this optimization, we removed the redundant launch_done event and only kept copy_done, since:
Experimental Results
We conducted 10 rounds of throughput testing on an A800 GPU server using the following command:
Before Optimization:
After Optimization:
The MMLU score remains stable at 71 before and after the optimization.
Summary of Gains: