Replies: 1 comment
-
Use a smaller batch size (easiest and most obvious way), Use a different activation checkpointing configuration (try changing the contiguous_memory_optimization parameter to false to see if that helps), and Use a different optimizer (Some optimizers, such as AdamW, use more memory than others) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Here is my config
I'm using LORA (https://arxiv.org/abs/2106.09685) so the gradient and optimizer shouldn't take too much memory since the trainable parameter number is very small.
But I got OOM during forward, it's already ZERO3 with offload. Is there ways to reduce the forward memory usage? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions