[Bug]: Ernie45-21B and Qwen3-30B have garbled text when start CUDA graph, main branch

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Your output of `python collect_env.py` here
```

</details>


### 🐛 Describe the bug

```python
# SPDX-License-Identifier: Apache-2.0

from vllm import LLM, SamplingParams

# os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"
# os.environ["VLLM_USE_V1"] = "1"
# os.environ["TORCHDYNAMO_DISABLE"] = "1"

# Sample prompts.
prompt = "The capital of France is"

# Create a sampling params object.
sampling_params = SamplingParams()


def main():
    llm = LLM(
        model="/root/paddlejob/ERNIE-4.5-Turbo/baidu/paddle_internal/ERNIE-4.5-21B-A3B-PT",
        # model="/root/paddlejob/vllm/Qwen3-30B-A3B",
        dtype="bfloat16",
        tensor_parallel_size=1,
        # enforce_eager=True,
        trust_remote_code=True
        )

    # outputs = llm.generate(prompts, sampling_params)
    
    conversation = [
        {"role": "system", "content": "你是一个乐于助人的助手。"},
        {"role": "user", "content": f"{prompt}"},
    ]
    
    outputs = llm.chat(
        [conversation],
        sampling_params,
        chat_template_kwargs={
        "enable_thinking": False,
        "add_generation_prompt": True, 
        }
    )
    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)

if __name__ == "__main__":
    main()
```

## 1. main branch
### issue1: garbled text
baidu/ERNIE-4.5-21B-A3B-PT (tensor_parallel_size=1 or tensor_parallel_size=2)
```text
Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:37<00:00,  1.79it/s]
INFO 07-02 21:24:09 [gpu_model_runner.py:2280] Graph capturing finished in 37 secs, took 3.17 GiB
INFO 07-02 21:24:09 [core.py:172] init engine (profile, create kv cache, warmup model) took 54.31 seconds
WARNING 07-02 21:24:10 [tokenizer.py:262] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-02 21:24:10 [chat_utils.py:421] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1240.92it/s]
Processed prompts: 100%|███| 1/1 [00:00<00:00,  9.36it/s, est. speed input: 187.60 toks/s, output: 150.04 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    '))));..。</。.}}}) problemen称 oaserJ.}】。)].)],'
------------------------------------------------------------
```
enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"
```text
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1551.15it/s]
Processed prompts: 100%|█████| 1/1 [00:00<00:00,  1.89it/s, est. speed input: 37.91 toks/s, output: 30.33 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    'The capital of France is Paris. It is a vibrant and historic city known'
------------------------------------------------------------
```
Qwen/Qwen3-30B-A3B (only tensor_parallel_size=2 garbled text, but tensor_parallel_size=1 is normal)
```text
Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:47<00:00,  1.41it/s]
Capturing CUDA graphs: 100%|█████████████████████████████████████████████████████| 67/67 [00:47<00:00,  1.41it/s]
(VllmWorker rank=0 pid=39026) INFO 07-02 21:32:30 [custom_all_reduce.py:196] Registering 6499 cuda graph addresses
(VllmWorker rank=1 pid=39029) INFO 07-02 21:32:30 [custom_all_reduce.py:196] Registering 6499 cuda graph addresses
(VllmWorker rank=1 pid=39029) INFO 07-02 21:32:31 [gpu_model_runner.py:2280] Graph capturing finished in 48 secs, took 5.55 GiB
(VllmWorker rank=0 pid=39026) INFO 07-02 21:32:31 [gpu_model_runner.py:2280] Graph capturing finished in 48 secs, took 5.55 GiB
INFO 07-02 21:32:31 [core.py:172] init engine (profile, create kv cache, warmup model) took 76.86 seconds
INFO 07-02 21:32:32 [chat_utils.py:421] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1380.61it/s]
Processed prompts: 100%|███| 1/1 [00:00<00:00,  8.00it/s, est. speed input: 240.36 toks/s, output: 128.17 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    'Lon")). exem newValue deleting gap每一位peated.Fieldimest� ))\n\n⁅𝔱 Jwtfulness'
------------------------------------------------------------
```

enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"
```text
Adding requests: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1392.07it/s]
Processed prompts: 100%|█████| 1/1 [00:01<00:00,  1.07s/it, est. speed input: 28.13 toks/s, output: 14.07 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    'The capital of France is **Paris**. 🇫🇷✨'
------------------------------------------------------------
```

### issue2: CUDA error: out of memory

ERNIE-4.5-300B-A47B-PT + quantization="fp8" +  tensor_parallel_size=8 + 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」==> run success
ERNIE-4.5-300B-A47B-PT + quantization="fp8" +  tensor_parallel_size=8 ==> run failed CUDA error: out of memory

Qwen3-235B-A22B-FP8 + tensor_parallel_size=4 + 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」 ==> run success 
Qwen3-235B-A22B-FP8 + tensor_parallel_size=4 ==> run failed CUDA error: out of memory

## 2. Old version main branch code(May 29 version commit id fd7bb88d72ba721d6eb4f9d34198ad930c36c177)
The above problems will not occur, regardless of whether there is 「enforce_eager=True or os.environ["TORCHDYNAMO_DISABLE"] = "1"」， it is normal




### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Ernie45-21B and Qwen3-30B have garbled text when start CUDA graph, main branch #20376

Your current environment

🐛 Describe the bug

1. main branch

issue1: garbled text

issue2: CUDA error: out of memory

2. Old version main branch code(May 29 version commit id `fd7bb88`)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Ernie45-21B and Qwen3-30B have garbled text when start CUDA graph, main branch #20376

Description

Your current environment

🐛 Describe the bug

1. main branch

issue1: garbled text

issue2: CUDA error: out of memory

2. Old version main branch code(May 29 version commit id fd7bb88)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Old version main branch code(May 29 version commit id `fd7bb88`)