-
Notifications
You must be signed in to change notification settings - Fork 3.4k
SpeechLM2: OOMptimizer #15578
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestneeds-follow-upIssue needs follow-upIssue needs follow-up
Description
Describe the bug
Hi @pzelasko,
following your comment in #15575 . I run the estimate_tokens_bins script and it gave me :
bucket_duration_bins=[83,100,114,128,147,164,196,222,244,267,290,312,334,369,410,417,422,427,432,439,449,476,526,583,644,668,688,709,736]
So I run the oomptimizer and it gave me these results :
The 1st stage profile is:
Bucket=83 (input=83 output=83) => max_batch_size=241
Bucket=100 (input=100 output=100) => max_batch_size=203
Bucket=114 (input=114 output=114) => max_batch_size=171
Bucket=128 (input=128 output=128) => max_batch_size=152
Bucket=147 (input=147 output=147) => max_batch_size=135
Bucket=164 (input=164 output=164) => max_batch_size=120
Bucket=196 (input=196 output=196) => max_batch_size=101
Bucket=222 (input=222 output=222) => max_batch_size=90
Bucket=244 (input=244 output=244) => max_batch_size=80
Bucket=267 (input=267 output=267) => max_batch_size=74
Bucket=290 (input=290 output=290) => max_batch_size=68
Bucket=312 (input=312 output=312) => max_batch_size=62
Bucket=334 (input=334 output=334) => max_batch_size=58
Bucket=369 (input=369 output=369) => max_batch_size=54
Bucket=410 (input=410 output=410) => max_batch_size=48
Bucket=417 (input=417 output=417) => max_batch_size=48
Bucket=422 (input=422 output=422) => max_batch_size=46
Bucket=427 (input=427 output=427) => max_batch_size=46
Bucket=432 (input=432 output=432) => max_batch_size=46
Bucket=439 (input=439 output=439) => max_batch_size=44
Bucket=449 (input=449 output=449) => max_batch_size=44
Bucket=476 (input=476 output=476) => max_batch_size=42
Bucket=526 (input=526 output=526) => max_batch_size=38
Bucket=583 (input=583 output=583) => max_batch_size=34
Bucket=644 (input=644 output=644) => max_batch_size=31
Bucket=668 (input=668 output=668) => max_batch_size=30
Bucket=688 (input=688 output=688) => max_batch_size=29
Bucket=709 (input=709 output=709) => max_batch_size=28
Bucket=736 (input=736 output=736) => max_batch_size=27
Bucket merging stage...
Merging bucket 15 with bucket 14 due to identical batch sizes.
Merging bucket 17 with bucket 16 due to identical batch sizes.
Merging bucket 18 with bucket 17 due to identical batch sizes.
Merging bucket 20 with bucket 19 due to identical batch sizes.
The profile was created with the following settings:
* using 90.0% of available GPU RAM.
* simulating DDP memory overhead.
* using AMP with dtype=torch.bfloat16.
The batch size of 27 for the last bucket seems enormous for me since we have been using batch_tokens: 4096 for a while and when looking at tensoboard, it never reached a batch_size of 20 (see picture):
And yet it crashed with an OOM, so I'm pretty sure 27 is too big.
[rank3]: return torch._C._nn.cross_entropy_loss(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.20 GiB. GPU 3 has a total capacity of 79.18 GiB of which 2.41 GiB is free. Including non-PyTorch memory, this process has 76.76 GiB memory in use. Of the allocated memory 74.94 GiB is allocated by PyTorch, and 205.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Steps/Code to reproduce bug
torchrun ${DISTRIBUTED_ARGS} \
scripts/speechlm2/oomptimizer.py \
--module-name nemo.collections.speechlm2.SALM \
--config-path config.yaml \
--buckets "[83,100,114,128,147,164,196,222,244,267,290,312,334,369,410,417,422,427,432,439,449,476,526,583,644,668,688,709,736]" \
--dtype bfloat16 \
--start-batch-size 16
Expected behavior
Environment details
Nemo 2.8.0rc0
Additional context
H100
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestneeds-follow-upIssue needs follow-upIssue needs follow-up