-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
[Bugfix] Fix GLM4 model #16618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix GLM4 model #16618
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
The model works with |
Signed-off-by: intervitens <[email protected]>
Signed-off-by: intervitens <[email protected]>
cc @zRzRzRzRzRzRzR can you check? |
yes, I will do this |
This section should be retained. See here |
There seem to be some issues, I need to take a closer look. I found that the model cannot run normally now but it could before(as I pr). I need to spend some time checking it out. |
This PR caused the model output to be garbled, @intervitens, have you encountered this problem? I am using GLM-4-9B-0414. |
The 9b and 32b are not identical architecturally. 9b seems to have attention biases, unlike 32b. |
add |
I don’t think that’s the issue. The 9B and 32B models released by GLM do have differences in bias—9B has bias while 32B doesn’t. However, in the attention_bias, they have already been configured. |
I wasn’t successful. The error you mentioned does indeed exist.
|is correct, but strangely, I got completely different outputs under the same model compared to the PR I submitted back then. |
I tried reinstalling vLLM from source, and the issue was resolved. Under the current circumstances, your PR works correctly.
is not necessary and can you change
As we rename the model. There is no more |
cc @DarkLight1337 @intervitens Thank you so much for your support Again Also:
with out |
Signed-off-by: intervitens <[email protected]>
Removing
causes the model output to become significantly degraded and repetitive
still doesn't work for me, @zRzRzRzRzRzRzR did you figure out any changes to the PR that fixed it for you? |
In my scenario, both |
This might be related to the CUDA version. I tested it on H100 with CUDA 12.4, and I'm not sure if it's related to this. |
Has this issue been resolved? |
Signed-off-by: intervitens <[email protected]>
I fixed the error that made the model output garbage without eager mode or |
Can you verify again @zRzRzRzRzRzRzR ? |
GLM-Z1-9B-0414 is ok,but GLM-Z1-32B-0414 repeat with !!!!! |
It should be a very normal behavior that T4 does not support,
You can submit the prompt words corresponding to the "infinite output!" issue to the THUDM/GLM-4 repository, and the staff will record and try to reproduce and find the cause of the problem. |
It is working for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
Tested locally, works as expected. +1 |
Signed-off-by: intervitens <[email protected]>
OK, thanks. I'll try, at least now GLM-Z1-9B-0414 is correct. |
Also in v100, if dtype float16 is configured , all information output by the big model is !!!!!!!!! |
Yes, I also tried setting dtype float16 on A6000 and only output !!!!!!!!! |
It's wired, my colleague said it's also appeared with bf16. Only GLM-Z1-32B-0414. |
Signed-off-by: intervitens <[email protected]> Signed-off-by: Yang Wang <[email protected]>
It seems there is still a problem. I am using multiple large models for standalone generation, including Qwen, Mistral-Large, Llama 4, Command-A, and Gemma 3-27B. All of the above models are running normally, except for GLM4-32B.
One suspicious point is that I set the hyperparameter to enable the model to handle a longer context: import os
os.environ['VLLM_ALLOW_LONG_MAX_MODEL_LEN'] = '1'
llm = LLM(model=model_path,tensor_parallel_size=device_count(),enable_prefix_caching=True,task='generate',max_model_len=50000,dtype='bfloat16') |
@yangw-dev @DarkLight1337 the issue with !!!! output is happening to all models I am training with dtype = torch.float16, . I usually train with Llama 3.1 8B. Can you please look at this problem holistically I don't believe it is model specific. |
If the model is originally trained on bfloat16, then there may be numerical stability issues when using float16 for inference due to the narrower range float16 supports. |
@DarkLight1337 that could be true, thanks. Please consider adding better error outputs than !!! This would be super helpful |
Signed-off-by: intervitens <[email protected]>
Signed-off-by: intervitens <[email protected]>
Signed-off-by: intervitens <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: intervitens <[email protected]> Signed-off-by: Mu Huai <[email protected]>
FIX #16617
FIX #16655
FIX #16687
FIX #16740
Currently the GLM4 model does not work and fails to load at all.
This PR enables the model to load and makes the outputs mostly identical to outputs from HF transformers.