SpeechLM2: OOMptimizer

**Describe the bug**

Hi @pzelasko,
following your [comment ](https://github.com/NVIDIA-NeMo/NeMo/issues/15575#issuecomment-4180127993) in #15575 . I run [the estimate_tokens_bins script](https://github.com/NVIDIA-NeMo/NeMo/blob/main/scripts/speech_llm/estimate_token_bins.pyl) and it gave me : 
```
bucket_duration_bins=[83,100,114,128,147,164,196,222,244,267,290,312,334,369,410,417,422,427,432,439,449,476,526,583,644,668,688,709,736]
```

So I run [the oomptimizer](https://github.com/NVIDIA-NeMo/NeMo/blob/main/scripts/speechlm2/oomptimizer.py) and it gave me these results :
```
The 1st stage profile is:
Bucket=83 (input=83 output=83) => max_batch_size=241
Bucket=100 (input=100 output=100) => max_batch_size=203
Bucket=114 (input=114 output=114) => max_batch_size=171
Bucket=128 (input=128 output=128) => max_batch_size=152
Bucket=147 (input=147 output=147) => max_batch_size=135
Bucket=164 (input=164 output=164) => max_batch_size=120
Bucket=196 (input=196 output=196) => max_batch_size=101
Bucket=222 (input=222 output=222) => max_batch_size=90
Bucket=244 (input=244 output=244) => max_batch_size=80
Bucket=267 (input=267 output=267) => max_batch_size=74
Bucket=290 (input=290 output=290) => max_batch_size=68
Bucket=312 (input=312 output=312) => max_batch_size=62
Bucket=334 (input=334 output=334) => max_batch_size=58
Bucket=369 (input=369 output=369) => max_batch_size=54
Bucket=410 (input=410 output=410) => max_batch_size=48
Bucket=417 (input=417 output=417) => max_batch_size=48
Bucket=422 (input=422 output=422) => max_batch_size=46
Bucket=427 (input=427 output=427) => max_batch_size=46
Bucket=432 (input=432 output=432) => max_batch_size=46
Bucket=439 (input=439 output=439) => max_batch_size=44
Bucket=449 (input=449 output=449) => max_batch_size=44
Bucket=476 (input=476 output=476) => max_batch_size=42
Bucket=526 (input=526 output=526) => max_batch_size=38
Bucket=583 (input=583 output=583) => max_batch_size=34
Bucket=644 (input=644 output=644) => max_batch_size=31
Bucket=668 (input=668 output=668) => max_batch_size=30
Bucket=688 (input=688 output=688) => max_batch_size=29
Bucket=709 (input=709 output=709) => max_batch_size=28
Bucket=736 (input=736 output=736) => max_batch_size=27
Bucket merging stage...
Merging bucket 15 with bucket 14 due to identical batch sizes.
Merging bucket 17 with bucket 16 due to identical batch sizes.
Merging bucket 18 with bucket 17 due to identical batch sizes.
Merging bucket 20 with bucket 19 due to identical batch sizes.
The profile was created with the following settings:
* using 90.0% of available GPU RAM.
* simulating DDP memory overhead.
* using AMP with dtype=torch.bfloat16.
```

The batch size of 27 for the last bucket seems enormous for me since we have been using `batch_tokens: 4096` for a while and when looking at tensoboard, it never reached a batch_size of 20 (see picture):

<img width="382" height="323" alt="Image" src="https://github.com/user-attachments/assets/903d8e54-103a-4249-a5a9-98f4baade1c7" />

And yet it crashed with an OOM, so I'm pretty sure 27 is too big.

```
[rank3]:     return torch._C._nn.cross_entropy_loss(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.20 GiB. GPU 3 has a total capacity of 79.18 GiB of which 2.41 GiB is free. Including non-PyTorch memory, this process has 76.76 GiB memory in use. Of the allocated memory 74.94 GiB is allocated by PyTorch, and 205.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

**Steps/Code to reproduce bug**

```
torchrun ${DISTRIBUTED_ARGS} \
    scripts/speechlm2/oomptimizer.py \
    --module-name nemo.collections.speechlm2.SALM \
    --config-path config.yaml \
    --buckets "[83,100,114,128,147,164,196,222,244,267,290,312,334,369,410,417,422,427,432,439,449,476,526,583,644,668,688,709,736]" \
    --dtype bfloat16 \
    --start-batch-size 16
```

**Expected behavior**


**Environment details**

Nemo 2.8.0rc0

**Additional context**

H100


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpeechLM2: OOMptimizer #15578

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SpeechLM2: OOMptimizer #15578

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions