Skip to content

[V1][Perf] Faster incremental detokenization #15137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 17, 2025

Conversation

njhill
Copy link
Member

@njhill njhill commented Mar 19, 2025

Exploit tokenizers library's new DecodeStream functionality for "fast" HF tokenizers.

This won't help with mistral tokenizers unfortunately.

cc @robertgshaw2-redhat

Exploit tokenizers library's new DecodeStream functionality for "fast" HF tokenizers.

This won't help with mistral tokenizers unfortunately.

Signed-off-by: Nick Hill <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Mar 19, 2025
@njhill njhill changed the title [V1] Faster incremental detokenization [V1][Perf] Faster incremental detokenization Mar 19, 2025
@WoosukKwon
Copy link
Collaborator

@njhill Thanks for doing this! Do you have any performance numbers by any chance?

@njhill
Copy link
Member Author

njhill commented Mar 19, 2025

@WoosukKwon looks like it's noticeable for higher QPS:

Before:

============ Serving Benchmark Result ============
Successful requests:                     6000      
Benchmark duration (s):                  81.47     
Total input tokens:                      1322102   
Total generated tokens:                  1205048   
Request throughput (req/s):              73.65     
Output token throughput (tok/s):         14791.01  
Total Token throughput (tok/s):          31018.76  
---------------Time to First Token----------------
Mean TTFT (ms):                          207.24    
Median TTFT (ms):                        179.43    
P99 TTFT (ms):                           1081.34   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.78     
Median TPOT (ms):                        22.36     
P99 TPOT (ms):                           34.43     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.03     
Median ITL (ms):                         25.08     
P99 ITL (ms):                            48.39     
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     6000      
Benchmark duration (s):                  80.39     
Total input tokens:                      1322102   
Total generated tokens:                  1205048   
Request throughput (req/s):              74.64     
Output token throughput (tok/s):         14990.72  
Total Token throughput (tok/s):          31437.59  
---------------Time to First Token----------------
Mean TTFT (ms):                          166.37    
Median TTFT (ms):                        136.98    
P99 TTFT (ms):                           1054.43   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.49     
Median TPOT (ms):                        22.21     
P99 TPOT (ms):                           34.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.49     
Median ITL (ms):                         21.41     
P99 ITL (ms):                            44.67     
==================================================

@WoosukKwon
Copy link
Collaborator

@njhill The speedup looks promising! 🚀

Is this PR ready for review? If so, could you please add tests?

@njhill
Copy link
Member Author

njhill commented Mar 21, 2025

@WoosukKwon yes it's ready apart from some additional tests.

@WoosukKwon
Copy link
Collaborator

@njhill Got it. Could you please resolve the merge conflict?

@njhill
Copy link
Member Author

njhill commented Mar 21, 2025

@WoosukKwon I'm just updating tests, will push those and resolve conflict at the same time.

@njhill
Copy link
Member Author

njhill commented Mar 22, 2025

Ready now @WoosukKwon, sorry the tests turned out to be a bit of a can of worms 😅

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 22, 2025
@WoosukKwon
Copy link
Collaborator

@robertgshaw2-redhat would you be interested in reviewing this PR? I'm not sure if any reviewer is assigned.

@njhill
Copy link
Member Author

njhill commented Mar 24, 2025

I will push a fix for the remaining test failure asap.

prompt_suffix = suffix
break

# Prime the stream.
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice way to skip the priming.

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

@njhill
Copy link
Member Author

njhill commented Mar 25, 2025

Looks like the remaining test failures are "real", need to dig into the behavior differences that are causing them.

Copy link

mergify bot commented Apr 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 8, 2025
…rs-detok

# Conflicts:
#	requirements/common.txt
#	requirements/test.in
@mergify mergify bot removed the needs-rebase label Apr 16, 2025
@njhill
Copy link
Member Author

njhill commented Apr 17, 2025

Addressed the remaining test failures (due to incorrect behaviour of the existing incremental detokenization which doesn't skip some special tokens when it should). So should now be good to merge.

@vllm-bot vllm-bot merged commit 05fcd1b into vllm-project:main Apr 17, 2025
64 of 69 checks passed
@njhill njhill deleted the tokenizers-detok branch April 17, 2025 14:51
tarukumar pushed a commit to tarukumar/vllm that referenced this pull request Apr 17, 2025
tarukumar pushed a commit to tarukumar/vllm that referenced this pull request Apr 17, 2025
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build frontend ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants