Open
Description
System Info / 系統信息
vllm==0.8.4
transformers==4.51.3
torch==2.6.0
cuda==12.4
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- The official example scripts / 官方的示例脚本
- My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
If you run,
vllm serve THUDM/GLM-Z1-32B-0414 --model-impl transformers --tensor-parallel-size 8
You will get,
ERROR 04-16 12:59:58 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-16 12:59:58 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-16 12:59:58 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 320, in __init__
ERROR 04-16 12:59:58 [core.py:387] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-16 12:59:58 [core.py:387] self._initialize_kv_caches(vllm_config)
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_cach es
ERROR 04-16 12:59:58 [core.py:387] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-16 12:59:58 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_avai lable_memory
ERROR 04-16 12:59:58 [core.py:387] output = self.collective_rpc("determine_available_memory")
ERROR 04-16 12:59:58 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in col lective_rpc
ERROR 04-16 12:59:58 [core.py:387] raise e
ERROR 04-16 12:59:58 [core.py:387] File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in col lective_rpc
ERROR 04-16 12:59:58 [core.py:387] raise RuntimeError(
ERROR 04-16 12:59:58 [core.py:387] RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause', 'Failed running call_ method view(*(FakeTensor(..., device=\'cuda:0\', size=(1, s0, 32), dtype=torch.bfloat16), (1, s0, -1, 128)), **{}):\nshape \'[1, s0, -1, 128]\' is invalid fo r input of size 32*s0\n\nfrom user code:\n File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/transformers.py", line 425, in forward\n model_output = self.model(input_ids, positions, intermediate_tensors,\n File "/root/.miniconda3/envs/vllm/lib/python3.12/site-packages /vllm/model_executor/models/transformers.py", line 330, in forward\n hidden_states = self.model(\n File "/root/.miniconda3/envs/vllm/lib/python3.12/site- packages/transformers/utils/generic.py", line 965, in wrapper\n output = func(self, *args, **kwargs)\n File "/root/.miniconda3/envs/vllm/lib/python3.12/s ite-packages/transformers/models/glm4/modeling_glm4.py", line 589, in forward\n layer_outputs = decoder_layer(\n File "/root/.miniconda3/envs/vllm/lib/py thon3.12/site-packages/transformers/models/glm4/modeling_glm4.py", line 115, in forward\n hidden_states, self_attn_weights = self.self_attn(\n File "/roo t/.miniconda3/envs/vllm/lib/python3.12/site-packages/transformers/models/glm4/modeling_glm4.py", line 268, in forward\n key_states = self.k_proj(hidden_st ates).view(hidden_shape).transpose(1, 2)\n\nSet TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n import torch._dynamo\n torch._dynamo.config.suppress_errors = True\n')
ERROR 04-16 12:59:58 [core.py:387]
Expected behavior / 期待表现
vllm serve
works for single gpu running but fails to run on multiple gpus AFTER the safetensors are loaded.
p.s. I tried with --tensor-parallel-size 2
and it works but can't work with size of 4.
Metadata
Metadata
Assignees
Labels
No labels