Skip to content

[Misc] Make cached tokenizer pickle-compatible #17048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 27, 2025

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Apr 23, 2025

This is required to support running multi-modal processor in parallel using multi-processing (which is WIP).

I also changed all_special_tokens and similar methods to return a list instead of a set, since that's how it is in the HF-defined classes. Not sure why they were converted into sets in the first place in #2879.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 23, 2025
@DarkLight1337 DarkLight1337 requested review from njhill and ywang96 April 23, 2025 11:15
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@njhill
Copy link
Member

njhill commented Apr 23, 2025

I also changed all_special_tokens and similar methods to return a list instead of a set, since that's how it is in the HF-defined classes. Not sure why they were converted into sets in the first place in #2879.

@DarkLight1337 I think these are sets intentionally because we check whether they contain particular tokens and want that to be fast.

Do you have more context as to how/why this is needed for the MM processor multiprocessing? We are not going to be transferring tokenizers between procs at runtime right?

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 23, 2025

Do you have more context as to how/why this is needed for the MM processor multiprocessing? We are not going to be transferring tokenizers between procs at runtime right?

We need to transfer the tokenizers between processes because multi-modal processor calls the tokenizer during processing.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 23, 2025

I think these are sets intentionally because we check whether they contain particular tokens and want that to be fast.

Based on a quick Ctrl-F, the only place where we actually check the attribute directly is sample_tokens inside benchmarks/benchmark_prefix_caching.py. Edit: I have updated that function to construct a new set before iterating through each token.

While _adapt_tokenizer in vllm/model_executor/guided_decoding/outlines_logits_processors.py and _convert_tokens_to_string_with_added_encoders in vllm/transformers_utils/detokenizer_utils.py both check whether a token is contained in tokenizer.all_special_tokens, they make a new set from tokenizer.all_special_tokens, so there is no need to create a new set inside the tokenizer itself.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 sorry for delay looking at this.

On the set thing, is there any harm in us continuing to return sets for these? It just seems "safer" to me in case lookups are done in future.


def get_cached_tokenizer(tokenizer: AnyTokenizer) -> AnyTokenizer:
"""Get a wrapped tokenizer with cached properties."""
cached_tokenizer = copy.deepcopy(tokenizer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we modify in-place as was done before?

Copy link
Member Author

@DarkLight1337 DarkLight1337 Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the original tokenizer to be passed to __reduce__. Otherwise there may be infinite recursion

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 25, 2025

On the set thing, is there any harm in us continuing to return sets for these? It just seems "safer" to me in case lookups are done in future.

@hmellor do you know why those are lists in HF?

IMO given that we type hint this attribute as being a list in TokenizerBase, our code should take this into account and assume that the attribute is a list.

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@hmellor
Copy link
Member

hmellor commented Apr 26, 2025

do you know why those are lists in HF?

I don't unfortunately. My best guesses would be some combination of:

  • The tokenizer is stored as JSON which has no concept of sets
  • We want them to be mutable (in vLLM we don't care if they're mutable though)

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DarkLight1337, looks much better to me!

@DarkLight1337 DarkLight1337 merged commit 93a126f into vllm-project:main Apr 27, 2025
43 checks passed
@DarkLight1337 DarkLight1337 deleted the pickle-cached-tokenizer branch April 27, 2025 05:05
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Agata Dobrzyniewicz <[email protected]>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants