-
Notifications
You must be signed in to change notification settings - Fork 647
update memory optimization tutorial #1948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update memory optimization tutorial #1948
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1948
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1e4a198 with merge base 9eced21 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1948 +/- ##
==========================================
- Coverage 68.40% 65.74% -2.66%
==========================================
Files 311 311
Lines 16973 16973
==========================================
- Hits 11610 11159 -451
- Misses 5363 5814 +451 ☔ View full report in Codecov by Sentry. |
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and want to use a larger model, batch size or context lengths. Be aware that it will slow down training speed." | ||
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed due to the overhead of moving tensors between GPU VRAM and CPU. We minimize it by using a different stream, so you may not experience any slow down. This **should** be used alongside activation checkpointing." | ||
":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. However, it is not compatible with optimizer in backward. Use it when training adapters, e.g. LoRA, which don't benefit from optimizer in backward, or when you can already fit a batch with you memory, but not enough of them. | ||
":ref:`glossary_low_precision_opt`", "When you have a large model (optimizer state is 2x model size) and need to further reduce memory. Note that lower precision optimizers may reduce training stability/accuracy." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we are making these so-called summaries longer rather than shorter, which I do not like. Like honestly I have half a mind to just give columns like "memory", "tokens per second", "when to use" (maybe "constraints" too?) with up/down arrows for the first two columns and < 10 words on "when to use" (e.g. long sequence length, increase effective batch size, etc). The individual items further down should be responsible for getting more into the weeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the offloading blurb is too handwavy. (Again would benefit from being moved down to the more detailed section to give it the proper treatment, as a first-time reader just trying to understand my options I definitely do not care that it's using a separate stream)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like honestly I have half a mind to just give columns like "memory", "tokens per second", "when to use" (maybe "constraints" too?) with up/down arrows for the first two columns and < 10 words on "when to use" (e.g. long sequence length, increase effective batch size, etc).
I'd be up for this, happy to do it in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SalmanMohammadi , i am gathering data to write a blog and add a table. After i have a draft, i would love to have your input. I should have it prob in a week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i simplified and removed info
":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware." | ||
":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance. Prioritize using it only if the other techniques are not enough." | ||
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy" | ||
":ref:`glossary_qlora`", "When you are training a large model and quantizing it will save significant memory, at the potential cost of some training speed and accuracy." | ||
":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was already here but I think we may wanna take it out. Like DoRA will increas memory and reduce training speed compared to LoRA, I think it does not fit in as well with the other techniques mentioned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to say we should take DoRA from this whole doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the more I think about it I think we should. Since it's more of a modeling improvement on top of an existing memory saving technique. But lmk if you strongly disagree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like to see features that we offer documented somewhere rather than not documented, so whether we put this here or e.g. move the section to the LoRA recipe docs to let users know they can use it, or as another option under the LoRA section, I'm easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i left DoRA as "a variant of LoRA" as a middle ground
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh I like this even less. Now it is both (a) there and (b) uninformative. Sorry I am being a PITA. Let's say "a variant of LoRA that can improve model performance at the cost of slightly more memory"?
* Gradient accumulation should always be set to 1 when ``offload_gradients=True``, as gradients are cleared on GPU every backward pass. | ||
* This optimizer works by keeping a copy of parameters and pre-allocating gradient memory on CPU. Therefore, expect your RAM usage to increase by 4x model size. | ||
* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` in any distributed recipe. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details | ||
* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` instead. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is :class:torch.distributed.fsdp.FullyShardedDataParallel
still the best API ref here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will add this comparing fsdp1 and 2 https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md
When considering using QLoRA to reduce memory usage, it's worth noting that a) QLoRA is slower than LoRA and may not be worth it if | ||
the model you are finetuning is small; b) QLoRA prevents accuracy degradation during quantization by up-casting quantized parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this inline enumeration will not look great in the live docs. Also you can probably be more explicit about the memory savings, right? Doesn't it save roughly a constant 1.5 bytes * (# of model parameters) compared to LoRA? In general would try to be less handwavy about memory/perf statements in this doc wherever possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the a/b and added 1.5 bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like doc build job is failing
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: ebsmothers <[email protected]>
Co-authored-by: ebsmothers <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates." | ||
":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter, half of fp32." | ||
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and want to use a larger model, batch size or context lengths. Be aware that it will slow down training speed." | ||
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? You should always activation offload when using activation checkpointing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other way around, you always need ckpt when using offloading. Yes, otherwise its painfully slow.
":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, performing optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough." | ||
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy" | ||
":ref:`glossary_qlora`", "When you are training a large model, since quantization will save 1.5 bytes * (# of model parameters), at the potential cost of some training speed and accuracy." | ||
":ref:`glossary_dora`", "A variant of LoRA." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which may use slightly more memory and reduce training speed than LoRA? idk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is ok to keep in the section with the details?
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your patience here : )
All of our recipes support lower-precision optimizers from the `torchao <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim>`_ library. | ||
For single device recipes, we also support `bitsandbytes <https://huggingface.co/docs/bitsandbytes/main/en/index>`_ . | ||
|
||
A good place to start might be the :class:`torchao.prototype.low_bit_optim.torchao.AdamW8bit` and :class:`bitsandbytes.optim.PagedAdamW8bit` optimizers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge deal, but these :class:
annotations don't actually render. If we wanna link out to something prob better to just use a hyperlink with Github URL pinned to a specific commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more small comments, otherwise looks great. Thanks for making these changes!
Co-authored-by: ebsmothers <[email protected]>
Co-authored-by: ebsmothers <[email protected]>
Co-authored-by: ebsmothers <[email protected]>
Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Salman Mohammadi <[email protected]> Co-authored-by: ebsmothers <[email protected]>
Context
What is the purpose of this PR? Is it to