Skip to content

SpeechLM2: OOMptimizer #15578

@AudranBert

Description

@AudranBert

Describe the bug

Hi @pzelasko,
following your comment in #15575 . I run the estimate_tokens_bins script and it gave me :

bucket_duration_bins=[83,100,114,128,147,164,196,222,244,267,290,312,334,369,410,417,422,427,432,439,449,476,526,583,644,668,688,709,736]

So I run the oomptimizer and it gave me these results :

The 1st stage profile is:
Bucket=83 (input=83 output=83) => max_batch_size=241
Bucket=100 (input=100 output=100) => max_batch_size=203
Bucket=114 (input=114 output=114) => max_batch_size=171
Bucket=128 (input=128 output=128) => max_batch_size=152
Bucket=147 (input=147 output=147) => max_batch_size=135
Bucket=164 (input=164 output=164) => max_batch_size=120
Bucket=196 (input=196 output=196) => max_batch_size=101
Bucket=222 (input=222 output=222) => max_batch_size=90
Bucket=244 (input=244 output=244) => max_batch_size=80
Bucket=267 (input=267 output=267) => max_batch_size=74
Bucket=290 (input=290 output=290) => max_batch_size=68
Bucket=312 (input=312 output=312) => max_batch_size=62
Bucket=334 (input=334 output=334) => max_batch_size=58
Bucket=369 (input=369 output=369) => max_batch_size=54
Bucket=410 (input=410 output=410) => max_batch_size=48
Bucket=417 (input=417 output=417) => max_batch_size=48
Bucket=422 (input=422 output=422) => max_batch_size=46
Bucket=427 (input=427 output=427) => max_batch_size=46
Bucket=432 (input=432 output=432) => max_batch_size=46
Bucket=439 (input=439 output=439) => max_batch_size=44
Bucket=449 (input=449 output=449) => max_batch_size=44
Bucket=476 (input=476 output=476) => max_batch_size=42
Bucket=526 (input=526 output=526) => max_batch_size=38
Bucket=583 (input=583 output=583) => max_batch_size=34
Bucket=644 (input=644 output=644) => max_batch_size=31
Bucket=668 (input=668 output=668) => max_batch_size=30
Bucket=688 (input=688 output=688) => max_batch_size=29
Bucket=709 (input=709 output=709) => max_batch_size=28
Bucket=736 (input=736 output=736) => max_batch_size=27
Bucket merging stage...
Merging bucket 15 with bucket 14 due to identical batch sizes.
Merging bucket 17 with bucket 16 due to identical batch sizes.
Merging bucket 18 with bucket 17 due to identical batch sizes.
Merging bucket 20 with bucket 19 due to identical batch sizes.
The profile was created with the following settings:
* using 90.0% of available GPU RAM.
* simulating DDP memory overhead.
* using AMP with dtype=torch.bfloat16.

The batch size of 27 for the last bucket seems enormous for me since we have been using batch_tokens: 4096 for a while and when looking at tensoboard, it never reached a batch_size of 20 (see picture):

Image

And yet it crashed with an OOM, so I'm pretty sure 27 is too big.

[rank3]:     return torch._C._nn.cross_entropy_loss(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.20 GiB. GPU 3 has a total capacity of 79.18 GiB of which 2.41 GiB is free. Including non-PyTorch memory, this process has 76.76 GiB memory in use. Of the allocated memory 74.94 GiB is allocated by PyTorch, and 205.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Steps/Code to reproduce bug

torchrun ${DISTRIBUTED_ARGS} \
    scripts/speechlm2/oomptimizer.py \
    --module-name nemo.collections.speechlm2.SALM \
    --config-path config.yaml \
    --buckets "[83,100,114,128,147,164,196,222,244,267,290,312,334,369,410,417,422,427,432,439,449,476,526,583,644,668,688,709,736]" \
    --dtype bfloat16 \
    --start-batch-size 16

Expected behavior

Environment details

Nemo 2.8.0rc0

Additional context

H100

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions