[V1][Spec Decode] Handle draft tokens beyond max_model_len #16087

WoosukKwon · 2025-04-05T05:11:04Z

Implements 4. Handle the edge cases like when the draft model generates beyond max_pos_embeddings in #15901

Signed-off-by: Woosuk Kwon <[email protected]>

github-actions · 2025-04-05T05:11:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

comaniac

Overall LGTM. Approve first to unblock the PR. Meanwhile it would be good to have a unit test for it. Also do we know the overhead of introduced ops (e.g., torch.where)?

Signed-off-by: Woosuk Kwon <[email protected]>

ekagra-ranjan · 2025-04-07T17:54:22Z

vllm/v1/spec_decode/eagle.py

+            attn_metadata.slot_mapping = torch.where(
+                exceeds_max_model_len,
+                PADDING_SLOT_ID,
+                attn_metadata.slot_mapping,
+            )


My understanding is that torch.where will allocate an intermediate tensor and then assign it.
Is it possible to use attn_metadata.slot_mapping[exceeds_max_model_len] = PADDING_SLOT_ID so that it's an in-place operation?

@ekagra-ranjan Thanks for the suggestion. I changed it to masked_fill_, which is also an in-place operation.
Overall, I think the performance impact will be small since the tensors here are small (shape of [batch_size]).

ekagra-ranjan · 2025-04-07T19:07:22Z

vllm/v1/spec_decode/eagle.py

+            # out-of-range access during the model execution. The draft tokens
+            # generated with this adjustment should be ignored.


Can you please point me to the logic of ignoring such draft tokens in this PR?

Good question. The scheduler handles it

vllm/vllm/v1/core/sched/scheduler.py

Lines 188 to 193 in af7462b

# Make sure the input position does not exceed the max model len.

# This is necessary when using spec decoding.

num_new_tokens = min(

num_new_tokens,

self.max_model_len - request.num_computed_tokens)

assert num_new_tokens > 0

mergify · 2025-04-10T20:08:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon · 2025-04-21T02:49:14Z

@comaniac Good point. Added a test. Also, I replaced torch.where with masked_fill_ wherever possible, for better performance. Overall, I think the overhead will be very small because the masked tensors are all small (i.e., shape of [batch_size]).

Signed-off-by: Woosuk Kwon <[email protected]>

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Frieda (Jingying) Huang <[email protected]>

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]>

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Mu Huai <[email protected]>

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: minpeter <[email protected]>

[Spec Decode] Do not generate draft tokens beyond max_model_len

0931329

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon requested review from robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 5, 2025 05:11

WoosukKwon changed the title ~~[Spec Decode] Do not generate draft tokens beyond max_model_len~~ [V1][Spec Decode] Do not generate draft tokens beyond max_model_len Apr 5, 2025

mergify bot added the v1 label Apr 5, 2025

comaniac approved these changes Apr 5, 2025

View reviewed changes

comaniac added the needs-tests Tests needed for this PR label Apr 5, 2025

WoosukKwon added 2 commits April 5, 2025 10:02

Fix ngram

55b6e1d

Signed-off-by: Woosuk Kwon <[email protected]>

Update comments

f5c3af6

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon mentioned this pull request Apr 6, 2025

[SpecDecode] Support EAGLE in V1 #15901

Open

10 tasks

ekagra-ranjan reviewed Apr 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Apr 10, 2025

markmc added the speculative-decoding label Apr 16, 2025

Merge branch 'main' into eagle-max-len

628f2d4

mergify bot removed the needs-rebase label Apr 21, 2025

WoosukKwon added 2 commits April 20, 2025 19:42

Add test

d449ede

Signed-off-by: Woosuk Kwon <[email protected]>

Add comment

af7462b

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon added ready ONLY add when PR is ready to merge/full CI is needed and removed needs-tests Tests needed for this PR labels Apr 21, 2025

WoosukKwon added 4 commits April 20, 2025 21:00

Fix test

5a0646d

Signed-off-by: Woosuk Kwon <[email protected]>

Fix test

0c6b211

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into eagle-max-len

5959ebb

Merge branch 'main' into eagle-max-len

32e2ad6

WoosukKwon added 2 commits April 21, 2025 08:41

minor

2e1f95d

Signed-off-by: Woosuk Kwon <[email protected]>

fix ngram test

cfe9668

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon merged commit 3a0fba5 into main Apr 21, 2025
42 of 44 checks passed

WoosukKwon deleted the eagle-max-len branch April 21, 2025 19:38

WoosukKwon changed the title ~~[V1][Spec Decode] Do not generate draft tokens beyond max_model_len~~ [V1][Spec Decode] Handle draft tokens beyond max_model_len Apr 21, 2025

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[V1][Spec Decode] Handle draft tokens beyond max_model_len (vllm-proj…

033d756

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1][Spec Decode] Handle draft tokens beyond max_model_len (vllm-proj…

65db6aa

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]>

adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025

[V1][Spec Decode] Handle draft tokens beyond max_model_len (vllm-proj…

8fbe5c0

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1][Spec Decode] Handle draft tokens beyond max_model_len (vllm-proj…

68dfeff

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Mu Huai <[email protected]>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[V1][Spec Decode] Handle draft tokens beyond max_model_len (vllm-proj…

999f1f7

…ect#16087) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: minpeter <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Spec Decode] Handle draft tokens beyond max_model_len #16087

[V1][Spec Decode] Handle draft tokens beyond max_model_len #16087

WoosukKwon commented Apr 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 5, 2025

Uh oh!

comaniac left a comment

Uh oh!

ekagra-ranjan Apr 7, 2025 •

edited

Loading

Uh oh!

WoosukKwon Apr 21, 2025

Uh oh!

ekagra-ranjan Apr 7, 2025 •

edited

Loading

Uh oh!

WoosukKwon Apr 21, 2025

Uh oh!

mergify bot commented Apr 10, 2025

Uh oh!

WoosukKwon commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

		# out-of-range access during the model execution. The draft tokens
		# generated with this adjustment should be ignored.

	# Make sure the input position does not exceed the max model len.
	# This is necessary when using spec decoding.
	num_new_tokens = min(
	num_new_tokens,
	self.max_model_len - request.num_computed_tokens)
	assert num_new_tokens > 0

Uh oh!

[V1][Spec Decode] Handle draft tokens beyond max_model_len #16087

[V1][Spec Decode] Handle draft tokens beyond max_model_len #16087

Conversation

WoosukKwon commented Apr 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 5, 2025

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 10, 2025

Uh oh!

WoosukKwon commented Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented Apr 5, 2025 •

edited by github-actions bot

Loading

ekagra-ranjan Apr 7, 2025 •

edited

Loading

ekagra-ranjan Apr 7, 2025 •

edited

Loading