Skip to content

[Bug]: Ernie45-21B and Qwen3-30B have garbled text when start CUDA graph, main branch #20376

Open
@CSWYF3634076

Description

@CSWYF3634076

Your current environment

The output of python collect_env.py
Your output of `python collect_env.py` here

🐛 Describe the bug

# SPDX-License-Identifier: Apache-2.0

from vllm import LLM, SamplingParams

# os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"
# os.environ["VLLM_USE_V1"] = "1"
# os.environ["TORCHDYNAMO_DISABLE"] = "1"

# Sample prompts.
prompt = "The capital of France is"

# Create a sampling params object.
sampling_params = SamplingParams()


def main():
    llm = LLM(
        model="/root/paddlejob/ERNIE-4.5-Turbo/baidu/paddle_internal/ERNIE-4.5-21B-A3B-PT",
        # model="/root/paddlejob/vllm/Qwen3-30B-A3B",
        dtype="bfloat16",
        tensor_parallel_size=1,
        # enforce_eager=True,
        trust_remote_code=True
        )

    # outputs = llm.generate(prompts, sampling_params)
    
    conversation = [
        {"role": "system", "content": "你是一个乐于助人的助手。"},
        {"role": "user", "content": f"{prompt}"},
    ]
    
    outputs = llm.chat(
        [conversation],
        sampling_params,
        chat_template_kwargs={
        "enable_thinking": False,
        "add_generation_prompt": True, 
        }
    )
    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)

if __name__ == "__main__":
    main()

1. main branch

issue1: garbled text

baidu/ERNIE-4.5-21B-A3B-PT (tensor_parallel_size=1 or tensor_parallel_size=2)

Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:37<00:00,  1.79it/s]
INFO 07-02 21:24:09 [gpu_model_runner.py:2280] Graph capturing finished in 37 secs, took 3.17 GiB
INFO 07-02 21:24:09 [core.py:172] init engine (profile, create kv cache, warmup model) took 54.31 seconds
WARNING 07-02 21:24:10 [tokenizer.py:262] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-02 21:24:10 [chat_utils.py:421] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1240.92it/s]
Processed prompts: 100%|███| 1/1 [00:00<00:00,  9.36it/s, est. speed input: 187.60 toks/s, output: 150.04 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    '))));..。</。.}}}) problemen称 oaserJ.}】。)].)],'
------------------------------------------------------------

enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"

Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1551.15it/s]
Processed prompts: 100%|█████| 1/1 [00:00<00:00,  1.89it/s, est. speed input: 37.91 toks/s, output: 30.33 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    'The capital of France is Paris. It is a vibrant and historic city known'
------------------------------------------------------------

Qwen/Qwen3-30B-A3B (only tensor_parallel_size=2 garbled text, but tensor_parallel_size=1 is normal)

Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:47<00:00,  1.41it/s]
Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:47<00:00,  1.41it/s]
(VllmWorker rank=0 pid=39026) INFO 07-02 21:32:30 [custom_all_reduce.py:196] Registering 6499 cuda graph addresses
(VllmWorker rank=1 pid=39029) INFO 07-02 21:32:30 [custom_all_reduce.py:196] Registering 6499 cuda graph addresses
(VllmWorker rank=1 pid=39029) INFO 07-02 21:32:31 [gpu_model_runner.py:2280] Graph capturing finished in 48 secs, took 5.55 GiB
(VllmWorker rank=0 pid=39026) INFO 07-02 21:32:31 [gpu_model_runner.py:2280] Graph capturing finished in 48 secs, took 5.55 GiB
INFO 07-02 21:32:31 [core.py:172] init engine (profile, create kv cache, warmup model) took 76.86 seconds
INFO 07-02 21:32:32 [chat_utils.py:421] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1380.61it/s]
Processed prompts: 100%|███| 1/1 [00:00<00:00,  8.00it/s, est. speed input: 240.36 toks/s, output: 128.17 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    'Lon")). exem newValue deleting gap每一位peated.Fieldimest� ))\n\n⁅𝔱 Jwtfulness'
------------------------------------------------------------

enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"

Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1392.07it/s]
Processed prompts: 100%|█████| 1/1 [00:01<00:00,  1.07s/it, est. speed input: 28.13 toks/s, output: 14.07 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    'The capital of France is **Paris**. 🇫🇷✨'
------------------------------------------------------------

issue2: CUDA error: out of memory

ERNIE-4.5-300B-A47B-PT + quantization="fp8" + tensor_parallel_size=8 + 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」==> run success
ERNIE-4.5-300B-A47B-PT + quantization="fp8" + tensor_parallel_size=8 ==> run failed CUDA error: out of memory

Qwen3-235B-A22B-FP8 + tensor_parallel_size=4 + 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」 ==> run success
Qwen3-235B-A22B-FP8 + tensor_parallel_size=4 ==> run failed CUDA error: out of memory

2. Old version main branch code(May 29 version commit id fd7bb88)

The above problems will not occur, regardless of whether there is 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」, it is normal

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions