[Misc] Make cached tokenizer pickle-compatible #17048

DarkLight1337 · 2025-04-23T11:15:46Z

This is required to support running multi-modal processor in parallel using multi-processing (which is WIP).

I also changed all_special_tokens and similar methods to return a list instead of a set, since that's how it is in the HF-defined classes. Not sure why they were converted into sets in the first place in #2879.

Signed-off-by: DarkLight1337 <[email protected]>

github-actions · 2025-04-23T11:15:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: DarkLight1337 <[email protected]>

njhill · 2025-04-23T14:46:34Z

I also changed all_special_tokens and similar methods to return a list instead of a set, since that's how it is in the HF-defined classes. Not sure why they were converted into sets in the first place in #2879.

@DarkLight1337 I think these are sets intentionally because we check whether they contain particular tokens and want that to be fast.

Do you have more context as to how/why this is needed for the MM processor multiprocessing? We are not going to be transferring tokenizers between procs at runtime right?

DarkLight1337 · 2025-04-23T14:52:27Z

Do you have more context as to how/why this is needed for the MM processor multiprocessing? We are not going to be transferring tokenizers between procs at runtime right?

We need to transfer the tokenizers between processes because multi-modal processor calls the tokenizer during processing.

DarkLight1337 · 2025-04-23T14:56:00Z

I think these are sets intentionally because we check whether they contain particular tokens and want that to be fast.

Based on a quick Ctrl-F, the only place where we actually check the attribute directly is sample_tokens inside benchmarks/benchmark_prefix_caching.py. Edit: I have updated that function to construct a new set before iterating through each token.

While _adapt_tokenizer in vllm/model_executor/guided_decoding/outlines_logits_processors.py and _convert_tokens_to_string_with_added_encoders in vllm/transformers_utils/detokenizer_utils.py both check whether a token is contained in tokenizer.all_special_tokens, they make a new set from tokenizer.all_special_tokens, so there is no need to create a new set inside the tokenizer itself.

Signed-off-by: DarkLight1337 <[email protected]>

njhill

@DarkLight1337 sorry for delay looking at this.

On the set thing, is there any harm in us continuing to return sets for these? It just seems "safer" to me in case lookups are done in future.

njhill · 2025-04-25T17:15:04Z

vllm/transformers_utils/tokenizer.py

+
+def get_cached_tokenizer(tokenizer: AnyTokenizer) -> AnyTokenizer:
+    """Get a wrapped tokenizer with cached properties."""
+    cached_tokenizer = copy.deepcopy(tokenizer)


Why can't we modify in-place as was done before?

I need the original tokenizer to be passed to __reduce__. Otherwise there may be infinite recursion

vllm/transformers_utils/tokenizer.py

DarkLight1337 · 2025-04-25T22:15:48Z

On the set thing, is there any harm in us continuing to return sets for these? It just seems "safer" to me in case lookups are done in future.

@hmellor do you know why those are lists in HF?

IMO given that we type hint this attribute as being a list in TokenizerBase, our code should take this into account and assume that the attribute is a list.

Signed-off-by: DarkLight1337 <[email protected]>

hmellor · 2025-04-26T08:23:09Z

do you know why those are lists in HF?

I don't unfortunately. My best guesses would be some combination of:

The tokenizer is stored as JSON which has no concept of sets
We want them to be mutable (in vLLM we don't care if they're mutable though)

njhill

Thanks @DarkLight1337, looks much better to me!

vllm/transformers_utils/tokenizer.py

Signed-off-by: DarkLight1337 <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>

DarkLight1337 added 2 commits April 23, 2025 11:06

[Misc] Make cached tokenizer pickle-compatible

2614d63

Signed-off-by: DarkLight1337 <[email protected]>

Use a list

aa38396

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 23, 2025

DarkLight1337 requested review from njhill and ywang96 April 23, 2025 11:15

DarkLight1337 added 3 commits April 23, 2025 11:22

Update type annotations

b8f7598

Signed-off-by: DarkLight1337 <[email protected]>

Fix chatglm tokenizer

1b5393c

Signed-off-by: DarkLight1337 <[email protected]>

Rename

1fcebb1

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 added 3 commits April 23, 2025 16:00

Merge branch 'main' into pickle-cached-tokenizer

84769d2

Update attribute access

fbd688b

Signed-off-by: DarkLight1337 <[email protected]>

Merge branch 'main' into pickle-cached-tokenizer

385b4f4

njhill reviewed Apr 25, 2025

View reviewed changes

DarkLight1337 added 3 commits April 25, 2025 22:42

Simplify the code

8e0c95d

Signed-off-by: DarkLight1337 <[email protected]>

Rename

9aeeabc

Signed-off-by: DarkLight1337 <[email protected]>

Update

3d2007c

Signed-off-by: DarkLight1337 <[email protected]>

njhill reviewed Apr 26, 2025

View reviewed changes

vllm/transformers_utils/tokenizer.py Outdated Show resolved Hide resolved

vllm/transformers_utils/tokenizer.py Outdated Show resolved Hide resolved

DarkLight1337 added 2 commits April 27, 2025 02:47

Use copy instead of deepcopy

1239e1c

Signed-off-by: DarkLight1337 <[email protected]>

Remove unnecessary parenthesis

af6b11e

Signed-off-by: DarkLight1337 <[email protected]>

njhill approved these changes Apr 27, 2025

View reviewed changes

DarkLight1337 merged commit 93a126f into vllm-project:main Apr 27, 2025
43 checks passed

DarkLight1337 deleted the pickle-cached-tokenizer branch April 27, 2025 05:05

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Misc] Make cached tokenizer pickle-compatible (vllm-project#17048)

c243d4f

Signed-off-by: DarkLight1337 <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Misc] Make cached tokenizer pickle-compatible (vllm-project#17048)

9e95961

Signed-off-by: DarkLight1337 <[email protected]>

adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025

[Misc] Make cached tokenizer pickle-compatible (vllm-project#17048)

ee8aa2d

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Misc] Make cached tokenizer pickle-compatible (vllm-project#17048)

3e1116b

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Mu Huai <[email protected]>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Misc] Make cached tokenizer pickle-compatible (vllm-project#17048)

620312f

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Misc] Make cached tokenizer pickle-compatible (vllm-project#17048)

522e903

Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc] Make cached tokenizer pickle-compatible #17048

[Misc] Make cached tokenizer pickle-compatible #17048

Uh oh!

DarkLight1337 commented Apr 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

njhill commented Apr 23, 2025

Uh oh!

DarkLight1337 commented Apr 23, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented Apr 23, 2025 •

edited

Loading

Uh oh!

njhill left a comment

Uh oh!

njhill Apr 25, 2025

Uh oh!

DarkLight1337 Apr 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

DarkLight1337 commented Apr 25, 2025 •

edited

Loading

Uh oh!

hmellor commented Apr 26, 2025 •

edited

Loading

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Misc] Make cached tokenizer pickle-compatible #17048

[Misc] Make cached tokenizer pickle-compatible #17048

Uh oh!

Conversation

DarkLight1337 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 23, 2025

Uh oh!

njhill commented Apr 23, 2025

Uh oh!

DarkLight1337 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmellor commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Apr 23, 2025 •

edited

Loading

DarkLight1337 commented Apr 23, 2025 •

edited

Loading

DarkLight1337 commented Apr 23, 2025 •

edited

Loading

DarkLight1337 Apr 25, 2025 •

edited

Loading

DarkLight1337 commented Apr 25, 2025 •

edited

Loading

hmellor commented Apr 26, 2025 •

edited

Loading