update memory optimization tutorial #1948

felipemello1 · 2024-11-04T22:28:24Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

pytorch-bot · 2024-11-04T22:28:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1948

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1e4a198 with merge base 9eced21 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-11-04T22:32:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.74%. Comparing base (9eced21) to head (6ec2c8f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1948      +/-   ##
==========================================
- Coverage   68.40%   65.74%   -2.66%     
==========================================
  Files         311      311              
  Lines       16973    16973              
==========================================
- Hits        11610    11159     -451     
- Misses       5363     5814     +451

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers · 2024-11-04T23:50:46Z

docs/source/tutorials/memory_optimizations.rst

+   ":ref:`glossary_act_ckpt`", "Use when you're memory constrained and want to use a larger model, batch size or context lengths. Be aware that it will slow down training speed."
+   ":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed due to the overhead of moving tensors between GPU VRAM and CPU. We minimize it by using a different stream, so you may not experience any slow down. This **should** be used alongside activation checkpointing."
+   ":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. However, it is not compatible with optimizer in backward. Use it when training adapters, e.g. LoRA, which don't benefit from optimizer in backward, or when you can already fit a batch with you memory, but not enough of them.
+   ":ref:`glossary_low_precision_opt`", "When you have a large model (optimizer state is 2x model size) and need to further reduce memory. Note that lower precision optimizers may reduce training stability/accuracy."


I feel like we are making these so-called summaries longer rather than shorter, which I do not like. Like honestly I have half a mind to just give columns like "memory", "tokens per second", "when to use" (maybe "constraints" too?) with up/down arrows for the first two columns and < 10 words on "when to use" (e.g. long sequence length, increase effective batch size, etc). The individual items further down should be responsible for getting more into the weeds.

Also the offloading blurb is too handwavy. (Again would benefit from being moved down to the more detailed section to give it the proper treatment, as a first-time reader just trying to understand my options I definitely do not care that it's using a separate stream)

Like honestly I have half a mind to just give columns like "memory", "tokens per second", "when to use" (maybe "constraints" too?) with up/down arrows for the first two columns and < 10 words on "when to use" (e.g. long sequence length, increase effective batch size, etc).

I'd be up for this, happy to do it in a follow up.

@SalmanMohammadi , i am gathering data to write a blog and add a table. After i have a draft, i would love to have your input. I should have it prob in a week.

i simplified and removed info

ebsmothers · 2024-11-04T23:54:45Z

docs/source/tutorials/memory_optimizations.rst

-   ":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware."
+   ":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance. Prioritize using it only if the other techniques are not enough."
+   ":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy"
+   ":ref:`glossary_qlora`", "When you are training a large model and quantizing it will save significant memory, at the potential cost of some training speed and accuracy."
   ":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates."


I know this was already here but I think we may wanna take it out. Like DoRA will increas memory and reduce training speed compared to LoRA, I think it does not fit in as well with the other techniques mentioned here.

Is this to say we should take DoRA from this whole doc?

Yeah the more I think about it I think we should. Since it's more of a modeling improvement on top of an existing memory saving technique. But lmk if you strongly disagree

In general I like to see features that we offer documented somewhere rather than not documented, so whether we put this here or e.g. move the section to the LoRA recipe docs to let users know they can use it, or as another option under the LoRA section, I'm easy.

i left DoRA as "a variant of LoRA" as a middle ground

Eh I like this even less. Now it is both (a) there and (b) uninformative. Sorry I am being a PITA. Let's say "a variant of LoRA that can improve model performance at the cost of slightly more memory"?

docs/source/tutorials/memory_optimizations.rst

ebsmothers · 2024-11-04T23:57:57Z

docs/source/tutorials/memory_optimizations.rst

 * Gradient accumulation should always be set to 1 when ``offload_gradients=True``, as gradients are cleared on GPU every backward pass.
 * This optimizer works by keeping a copy of parameters and pre-allocating gradient memory on CPU. Therefore, expect your RAM usage to increase by 4x model size.
-* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` in any distributed recipe. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details
+* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` instead. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details


Is :class:torch.distributed.fsdp.FullyShardedDataParallel still the best API ref here?

i will add this comparing fsdp1 and 2 https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md

ebsmothers · 2024-11-05T00:03:06Z

docs/source/tutorials/memory_optimizations.rst

+When considering using QLoRA to reduce memory usage, it's worth noting that a) QLoRA is slower than LoRA and may not be worth it if
+the model you are finetuning is small; b) QLoRA prevents accuracy degradation during quantization by up-casting quantized parameters


I think this inline enumeration will not look great in the live docs. Also you can probably be more explicit about the memory savings, right? Doesn't it save roughly a constant 1.5 bytes * (# of model parameters) compared to LoRA? In general would try to be less handwavy about memory/perf statements in this doc wherever possible

removed the a/b and added 1.5 bytes

ebsmothers

Looks like doc build job is failing

docs/source/tutorials/memory_optimizations.rst

Co-authored-by: Salman Mohammadi <[email protected]>

Co-authored-by: ebsmothers <[email protected]>

Co-authored-by: Salman Mohammadi <[email protected]>

docs/source/tutorials/memory_optimizations.rst

SalmanMohammadi · 2024-11-06T12:32:53Z

docs/source/tutorials/memory_optimizations.rst

-   ":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates."
+   ":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter, half of fp32."
+   ":ref:`glossary_act_ckpt`", "Use when you're memory constrained and want to use a larger model, batch size or context lengths. Be aware that it will slow down training speed."
+   ":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing."


Is this right? You should always activation offload when using activation checkpointing?

the other way around, you always need ckpt when using offloading. Yes, otherwise its painfully slow.

docs/source/tutorials/memory_optimizations.rst

SalmanMohammadi · 2024-11-06T14:14:04Z

docs/source/tutorials/memory_optimizations.rst

+   ":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, performing optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough."
+   ":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy"
+   ":ref:`glossary_qlora`", "When you are training a large model, since quantization will save 1.5 bytes * (# of model parameters), at the potential cost of some training speed and accuracy."
+   ":ref:`glossary_dora`", "A variant of LoRA."


Which may use slightly more memory and reduce training speed than LoRA? idk

Maybe this is ok to keep in the section with the details?

Co-authored-by: Salman Mohammadi <[email protected]>

SalmanMohammadi

thanks for your patience here : )

docs/source/tutorials/memory_optimizations.rst

ebsmothers · 2024-11-06T23:09:16Z

docs/source/tutorials/memory_optimizations.rst

+All of our recipes support lower-precision optimizers from the `torchao <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim>`_ library.
+For single device recipes, we also support `bitsandbytes <https://huggingface.co/docs/bitsandbytes/main/en/index>`_ .
+
+A good place to start might be the :class:`torchao.prototype.low_bit_optim.torchao.AdamW8bit` and :class:`bitsandbytes.optim.PagedAdamW8bit` optimizers.


Not a huge deal, but these :class: annotations don't actually render. If we wanna link out to something prob better to just use a hyperlink with Github URL pinned to a specific commit

docs/source/tutorials/memory_optimizations.rst

ebsmothers

Left a few more small comments, otherwise looks great. Thanks for making these changes!

Co-authored-by: ebsmothers <[email protected]>

docs/source/tutorials/memory_optimizations.rst

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Salman Mohammadi <[email protected]> Co-authored-by: ebsmothers <[email protected]>

update

6ec2c8f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 4, 2024

updates

616fccd

felipemello1 requested review from SalmanMohammadi and joecummings November 4, 2024 22:44

ebsmothers reviewed Nov 4, 2024

View reviewed changes

docs/source/tutorials/memory_optimizations.rst Outdated Show resolved Hide resolved

ebsmothers reviewed Nov 4, 2024

View reviewed changes

docs/source/tutorials/memory_optimizations.rst Outdated Show resolved Hide resolved

ebsmothers reviewed Nov 4, 2024

View reviewed changes

ebsmothers reviewed Nov 5, 2024

View reviewed changes

SalmanMohammadi reviewed Nov 5, 2024

View reviewed changes