-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[Misc] Make cached tokenizer pickle-compatible #17048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Make cached tokenizer pickle-compatible #17048
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 I think these are sets intentionally because we check whether they contain particular tokens and want that to be fast. Do you have more context as to how/why this is needed for the MM processor multiprocessing? We are not going to be transferring tokenizers between procs at runtime right? |
We need to transfer the tokenizers between processes because multi-modal processor calls the tokenizer during processing. |
Based on a quick Ctrl-F, the only place where we actually check the attribute directly is While |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 sorry for delay looking at this.
On the set
thing, is there any harm in us continuing to return sets for these? It just seems "safer" to me in case lookups are done in future.
vllm/transformers_utils/tokenizer.py
Outdated
|
||
def get_cached_tokenizer(tokenizer: AnyTokenizer) -> AnyTokenizer: | ||
"""Get a wrapped tokenizer with cached properties.""" | ||
cached_tokenizer = copy.deepcopy(tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we modify in-place as was done before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the original tokenizer to be passed to __reduce__
. Otherwise there may be infinite recursion
@hmellor do you know why those are lists in HF? IMO given that we type hint this attribute as being a list in |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
I don't unfortunately. My best guesses would be some combination of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DarkLight1337, looks much better to me!
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>
This is required to support running multi-modal processor in parallel using multi-processing (which is WIP).
I also changed
all_special_tokens
and similar methods to return a list instead of a set, since that's how it is in the HF-defined classes. Not sure why they were converted into sets in the first place in #2879.