-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[V1][Spec Decode] Handle draft tokens beyond max_model_len #16087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Approve first to unblock the PR. Meanwhile it would be good to have a unit test for it. Also do we know the overhead of introduced ops (e.g., torch.where)?
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
vllm/v1/spec_decode/eagle.py
Outdated
attn_metadata.slot_mapping = torch.where( | ||
exceeds_max_model_len, | ||
PADDING_SLOT_ID, | ||
attn_metadata.slot_mapping, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that torch.where
will allocate an intermediate tensor and then assign it.
Is it possible to use attn_metadata.slot_mapping[exceeds_max_model_len] = PADDING_SLOT_ID
so that it's an in-place operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ekagra-ranjan Thanks for the suggestion. I changed it to masked_fill_
, which is also an in-place operation.
Overall, I think the performance impact will be small since the tensors here are small (shape of [batch_size]
).
# out-of-range access during the model execution. The draft tokens | ||
# generated with this adjustment should be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please point me to the logic of ignoring such draft tokens in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The scheduler handles it
vllm/vllm/v1/core/sched/scheduler.py
Lines 188 to 193 in af7462b
# Make sure the input position does not exceed the max model len. | |
# This is necessary when using spec decoding. | |
num_new_tokens = min( | |
num_new_tokens, | |
self.max_model_len - request.num_computed_tokens) | |
assert num_new_tokens > 0 |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
@comaniac Good point. Added a test. Also, I replaced |
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Frieda (Jingying) Huang <[email protected]>
…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]>
…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]>
…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Mu Huai <[email protected]>
…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: minpeter <[email protected]>
Implements 4. Handle the edge cases like when the draft model generates beyond max_pos_embeddings in #15901