Skip to content

Support fine-grained control of requests that are run together #4699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Mar 23, 2025

Motivation

Currently, when submitting requests (e.g. engine.generate or HTTP call), we have no control which requests will be run together in a single batch and which will not, partially because the intrinsic undeterminism of IPC. However, in some scenarios, it would be great if we have more control. For example:

  • Benchmarking and profiling (e.g. we want to know the behavior when exactly having "1024 token x 8 req per GPU"; this is the primary reason why I made this PR)
  • Testing (e.g. in two-batch-overlap, we may want to test when "one card has 2 req while another card has 1 req, it should be disabled")

Thus this PR adds this feature. Since it is only used for benchmarking or testing, the code is not efficient (e.g. it calls torch.distributed that may be reducible to some extent), and may have rough edges.

Modifications

Checklist

@fzyzcjy fzyzcjy requested a review from merrymercy as a code owner March 23, 2025 12:17
@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Apr 1, 2025

Ping me when this PR is to be merged - currently I only resolve conflicts in #4068, and will port the resolve code back here when pinged.

@fzyzcjy fzyzcjy mentioned this pull request Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant