-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[TPU][V1] Fix padding recompilation when max-num-batched-tokens
is not even
#16726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TPU][V1] Fix padding recompilation when max-num-batched-tokens
is not even
#16726
Conversation
Signed-off-by: NickLucche <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
# Bucketed padding with max_token_size not a power of two. | ||
max_token_size = 317 | ||
expected_paddings = [16, 32, 64, 128, 192, 256, 320] | ||
actual_paddings = _get_token_paddings(min_token_size, max_token_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add a test when padding_gap == 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did in a previous PR it's below. That is for exponential padding.
Thanks for reviewing @yaochengji . |
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
…not even (vllm-project#16726) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Yang Wang <[email protected]>
…not even (vllm-project#16726) Signed-off-by: NickLucche <[email protected]>
…not even (vllm-project#16726) Signed-off-by: NickLucche <[email protected]>
…not even (vllm-project#16726) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
…not even (vllm-project#16726) Signed-off-by: NickLucche <[email protected]> Signed-off-by: Mu Huai <[email protected]>
max-num-batched-tokens
values that are not power of 2s (or simply not even when using bucketing) can silently cause recompiliations.This is due to the fact in such cases both bucketed and exponential padding will return a maximal padding value that will be >
max-num-batched-tokens
.Assume:
In turn, auxiliary data structures will be instantiated with the original uneven value (say 317)
Causing the following
To verify:
We can either have the last padding element be always exactly equal to
max-num-batched-tokens
or adjust for its "actual" padded value computed by_get_token_paddings
(a power of 2 or "bucketed" power of 2).This PR implements the latter because I ultimately think having these sizes be aligned can help with memory accesses here and there.
I am open to discuss better approaches if you have suggestions.