Skip to content

Can run THUDM/GLM-Z1-32B-0414 with --model-impl but not with --tensor-parallel-size 8 being added #751

Open
@tunglinwood

Description

@tunglinwood

System Info / 系統信息

vllm==0.8.4
transformers==4.51.3
torch==2.6.0
cuda==12.4

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

If you run,

vllm serve THUDM/GLM-Z1-32B-0414 --model-impl transformers --tensor-parallel-size 8

You will get,

ERROR 04-16 12:59:58 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-16 12:59:58 [core.py:387]   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-16 12:59:58 [core.py:387]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-16 12:59:58 [core.py:387]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387]   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 320, in __init__
ERROR 04-16 12:59:58 [core.py:387]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-16 12:59:58 [core.py:387]   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-16 12:59:58 [core.py:387]     self._initialize_kv_caches(vllm_config)
ERROR 04-16 12:59:58 [core.py:387]   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_cach                                                 es
ERROR 04-16 12:59:58 [core.py:387]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-16 12:59:58 [core.py:387]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387]   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_avai                                                 lable_memory
ERROR 04-16 12:59:58 [core.py:387]     output = self.collective_rpc("determine_available_memory")
ERROR 04-16 12:59:58 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387]   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in col                                                 lective_rpc
ERROR 04-16 12:59:58 [core.py:387]     raise e
ERROR 04-16 12:59:58 [core.py:387]   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in col                                                 lective_rpc
ERROR 04-16 12:59:58 [core.py:387]     raise RuntimeError(
ERROR 04-16 12:59:58 [core.py:387] RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause', 'Failed running call_                                                 method view(*(FakeTensor(..., device=\'cuda:0\', size=(1, s0, 32), dtype=torch.bfloat16), (1, s0, -1, 128)), **{}):\nshape \'[1, s0, -1, 128]\' is invalid fo                                                 r input of size 32*s0\n\nfrom user code:\n   File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/transformers.py", line                                                  425, in forward\n    model_output = self.model(input_ids, positions, intermediate_tensors,\n  File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages                                                 /vllm/model_executor/models/transformers.py", line 330, in forward\n    hidden_states = self.model(\n  File "/root/.miniconda3/envs/vllm/lib/python3.12/site-                                                 packages/transformers/utils/generic.py", line 965, in wrapper\n    output = func(self, *args, **kwargs)\n  File "/root/.miniconda3/envs/vllm/lib/python3.12/s                                                 ite-packages/transformers/models/glm4/modeling_glm4.py", line 589, in forward\n    layer_outputs = decoder_layer(\n  File "/root/.miniconda3/envs/vllm/lib/py                                                 thon3.12/site-packages/transformers/models/glm4/modeling_glm4.py", line 115, in forward\n    hidden_states, self_attn_weights = self.self_attn(\n  File "/roo                                                 t/.miniconda3/envs/vllm/lib/python3.12/site-packages/transformers/models/glm4/modeling_glm4.py", line 268, in forward\n    key_states = self.k_proj(hidden_st                                                 ates).view(hidden_shape).transpose(1, 2)\n\nSet TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information\n\n\nYou can suppress this exception and                                                  fall back to eager by setting:\n    import torch._dynamo\n    torch._dynamo.config.suppress_errors = True\n')
ERROR 04-16 12:59:58 [core.py:387]

Expected behavior / 期待表现

vllm serve works for single gpu running but fails to run on multiple gpus AFTER the safetensors are loaded.

p.s. I tried with --tensor-parallel-size 2 and it works but can't work with size of 4.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions