Skip to content

[Bug]: DSR1 with DEP OOM during initialization on 32xH100 #20441

Open
@ptarasiewiczNV

Description

@ptarasiewiczNV

Your current environment

4 nodes of DGX 8xH100 each.
4 nodes of DGX 8xH100 each. Current main of vLLM.

🐛 Describe the bug

I try to run DSR1 with DEP32 with HT kernels, my run cmd looks like this (on node 2):

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" \
  VLLM_USE_DEEP_GEMM=1 \
  VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 \
  vllm serve deepseek-ai/DeepSeek-R1 \
  --data_parallel_size 32 \
  --data-parallel-size-local 8 \
  --enable-expert-parallel \
  --max-model-len 10240 \
  --enforce-eager \
  --data-parallel-address eos0391 \
  --data-parallel-rpc-port 13345 \
  --data-parallel-start-rank 8 \
  --headless \
  | tee ./dsr1_dep32_node2.log

I am able to run with max model len 128, but not even 10240 on 32 H100s due to OOM, which doesn't seem right. Low latency kernel works fine. Is that expected?

I have seen #19298 was addressing this issue but I still get OOM errors @varun-sundar-rabindranath .

I have attached the logs. OOM error in logs from node 4.

dsr1_dep32_node4.log
dsr1_dep32_node3.log
dsr1_dep32_node2.log
dsr1_dep32_node1.log

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions