-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[BugFix][Frontend] Fix LLM.chat()
tokenization
#16081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Chat templates generally include special tokens and so add_special_tokens=False should be used when tokenizing the template outputs. In particular this avoids the llama double-BOS token issue. Also, LoRA adapter-specific tokenizers weren't being used by LLM.chat(). Signed-off-by: Nick Hill <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@DarkLight1337 I guess this as-is bypasses one of the length checks, but is at least more consistent with the chat completions api behaviour. |
As with the other PRs, can you add tests to avoid regressions in the future? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get this landed since several users have run into this issue?
This pull request has merge conflicts that must be resolved before it can be |
…chat-bos # Conflicts: # vllm/entrypoints/llm.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: minpeter <[email protected]>
Chat templates generally include special tokens and so
add_special_tokens=False
should be used when tokenizing the template outputs. This is already the default behaviour of the openai chat completions endpoint.In particular this avoids the llama double-BOS token issue.
Also, LoRA adapter-specific tokenizers weren't being used by
LLM.chat()
.Fixes #16028
Fixes #16853
Also just found related issues: