Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
# os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"
# os.environ["VLLM_USE_V1"] = "1"
# os.environ["TORCHDYNAMO_DISABLE"] = "1"
# Sample prompts.
prompt = "The capital of France is"
# Create a sampling params object.
sampling_params = SamplingParams()
def main():
llm = LLM(
model="/root/paddlejob/ERNIE-4.5-Turbo/baidu/paddle_internal/ERNIE-4.5-21B-A3B-PT",
# model="/root/paddlejob/vllm/Qwen3-30B-A3B",
dtype="bfloat16",
tensor_parallel_size=1,
# enforce_eager=True,
trust_remote_code=True
)
# outputs = llm.generate(prompts, sampling_params)
conversation = [
{"role": "system", "content": "你是一个乐于助人的助手。"},
{"role": "user", "content": f"{prompt}"},
]
outputs = llm.chat(
[conversation],
sampling_params,
chat_template_kwargs={
"enable_thinking": False,
"add_generation_prompt": True,
}
)
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for output in outputs:
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print("-" * 60)
if __name__ == "__main__":
main()
1. main branch
issue1: garbled text
baidu/ERNIE-4.5-21B-A3B-PT (tensor_parallel_size=1 or tensor_parallel_size=2)
Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:37<00:00, 1.79it/s]
INFO 07-02 21:24:09 [gpu_model_runner.py:2280] Graph capturing finished in 37 secs, took 3.17 GiB
INFO 07-02 21:24:09 [core.py:172] init engine (profile, create kv cache, warmup model) took 54.31 seconds
WARNING 07-02 21:24:10 [tokenizer.py:262] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-02 21:24:10 [chat_utils.py:421] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1240.92it/s]
Processed prompts: 100%|███| 1/1 [00:00<00:00, 9.36it/s, est. speed input: 187.60 toks/s, output: 150.04 toks/s]
Generated Outputs:
------------------------------------------------------------
Prompt: 'The capital of France is'
Output: '))));..。</。.}}}) problemen称 oaserJ.}】。)].)],'
------------------------------------------------------------
enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1551.15it/s]
Processed prompts: 100%|█████| 1/1 [00:00<00:00, 1.89it/s, est. speed input: 37.91 toks/s, output: 30.33 toks/s]
Generated Outputs:
------------------------------------------------------------
Prompt: 'The capital of France is'
Output: 'The capital of France is Paris. It is a vibrant and historic city known'
------------------------------------------------------------
Qwen/Qwen3-30B-A3B (only tensor_parallel_size=2 garbled text, but tensor_parallel_size=1 is normal)
Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:47<00:00, 1.41it/s]
Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:47<00:00, 1.41it/s]
(VllmWorker rank=0 pid=39026) INFO 07-02 21:32:30 [custom_all_reduce.py:196] Registering 6499 cuda graph addresses
(VllmWorker rank=1 pid=39029) INFO 07-02 21:32:30 [custom_all_reduce.py:196] Registering 6499 cuda graph addresses
(VllmWorker rank=1 pid=39029) INFO 07-02 21:32:31 [gpu_model_runner.py:2280] Graph capturing finished in 48 secs, took 5.55 GiB
(VllmWorker rank=0 pid=39026) INFO 07-02 21:32:31 [gpu_model_runner.py:2280] Graph capturing finished in 48 secs, took 5.55 GiB
INFO 07-02 21:32:31 [core.py:172] init engine (profile, create kv cache, warmup model) took 76.86 seconds
INFO 07-02 21:32:32 [chat_utils.py:421] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1380.61it/s]
Processed prompts: 100%|███| 1/1 [00:00<00:00, 8.00it/s, est. speed input: 240.36 toks/s, output: 128.17 toks/s]
Generated Outputs:
------------------------------------------------------------
Prompt: 'The capital of France is'
Output: 'Lon")). exem newValue deleting gap每一位peated.Fieldimest� ))\n\n⁅𝔱 Jwtfulness'
------------------------------------------------------------
enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1392.07it/s]
Processed prompts: 100%|█████| 1/1 [00:01<00:00, 1.07s/it, est. speed input: 28.13 toks/s, output: 14.07 toks/s]
Generated Outputs:
------------------------------------------------------------
Prompt: 'The capital of France is'
Output: 'The capital of France is **Paris**. 🇫🇷✨'
------------------------------------------------------------
issue2: CUDA error: out of memory
ERNIE-4.5-300B-A47B-PT + quantization="fp8" + tensor_parallel_size=8 + 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」==> run success
ERNIE-4.5-300B-A47B-PT + quantization="fp8" + tensor_parallel_size=8 ==> run failed CUDA error: out of memory
Qwen3-235B-A22B-FP8 + tensor_parallel_size=4 + 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」 ==> run success
Qwen3-235B-A22B-FP8 + tensor_parallel_size=4 ==> run failed CUDA error: out of memory
2. Old version main branch code(May 29 version commit id fd7bb88)
The above problems will not occur, regardless of whether there is 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」, it is normal
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.