Add per-layer compile support to recipes #1419

yf225 · 2024-08-27T17:05:38Z

Enabling per-layer compile for our single-device LoRA, single-device full finetune, and FSDP2 LoRA recipes. FSDP2 full finetune will be done in a follow-up.

Results

All recipes were run with three different configurations: (1) per-layer compile (this PR), (2) full-model compile (i.e. compile=True on main), (3) no compile.

QLoRA single-device

WandB results

Repro

TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CUDA_VISIBLE_DEVICES=0 tune run lora_finetune_single_device \
--config llama3/8B_qlora_single_device model.lora_rank=16 optimizer=bitsandbytes.optim.AdamW8bit \
 gradient_accumulation_steps=4 tokenizer.max_seq_len=2048 max_steps_per_epoch=100 \
 metric_logger=torchtune.utils.metric_logging.WandBLogger metric_logger.project=pr-1419 compile=True \
 log_peak_memory_stats=True metric_logger.name=qlora-per-layer-compile

LoRA

WandB results

Repro

TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CUDA_VISIBLE_DEVICES=1 tune run \
lora_finetune_single_device --config llama3/8B_lora_single_device model.lora_rank=16 \
 optimizer=bitsandbytes.optim.AdamW8bit gradient_accumulation_steps=4 tokenizer.max_seq_len=2048 \
 max_steps_per_epoch=100 model.lora_attn_modules=['q_proj','k_proj','v_proj','output_proj'] \
 model.apply_lora_to_mlp=True metric_logger=torchtune.utils.metric_logging.WandBLogger \
metric_logger.project=pr-1419 compile=True log_peak_memory_stats=True metric_logger.name=lora-per-layer-compile

FFT

WandB results

Repro

TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CUDA_VISIBLE_DEVICES=2 \
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.alpaca_cleaned_dataset optimizer=bitsandbytes.optim.AdamW8bit \
gradient_accumulation_steps=4 tokenizer.max_seq_len=2048 max_steps_per_epoch=100 epochs=1 \
optimizer_in_bwd=False metric_logger=torchtune.utils.metric_logging.WandBLogger metric_logger.project=pr-1419 \
metric_logger.name=fft-per-layer-compile compile=True log_peak_memory_stats=True

LoRA FSDP2

Repro only to ensure things aren't broken, will work on perf in a follow-up

Repro

First change dev/llama2/70B_qlora_fsdp2.yaml recipe to this, then run

tune run --nproc_per_node 2 lora_finetune_fsdp2 --config llama2/70B_qlora max_steps_per_epoch=10
1|10|Loss: 1.7545080184936523: 100%|█████████████| 10/10 [06:03<00:00, 36.29s/it]

Also consolidated E2E time over 100 steps and compile time for QLoRA, LoRA, and full finetunes of Llama3 8B on single device. Note that compile on main currently OOMs due to some kind of memory leak, so we don't have E2E times for the models compiled on main.

pytorch-bot · 2024-08-27T17:05:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1419

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4678ce9 with merge base 77fbb4f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1

Thanks, Will!

a couple of thoughts: I personally dont like that we have two flags for compiling, and would prefer to avoid cases like:

compile: False
per_layer_compile: True

IMO, we should either:

If compile is true, have per_layer_compile by default
make compile a nested config, like:

compile:
	per_layer: True
	loss: True

I prefer 1, as I dont think that the user needs this level of control through configs. Any thoughts?

Second:
Do you think you could test it for our distributed recipe that already uses FSDP2? Or should it be another PR?

SalmanMohammadi · 2024-08-27T17:19:33Z

Would this also enable optimizer in bwd 🤝 compile?

felipemello1 · 2024-08-27T22:29:37Z

recipes/full_finetune_single_device.py

+        if self._model_compile:
+            log.info("Compiling loss with torch.compile...")
+            backend = os.environ.get("TORCH_COMPILE_BACKEND", "inductor")
+            self._loss_fn = torch.compile(self._loss_fn, backend=backend)


chunked CE can only have compile on the CE + upcasting part. If the chunking is compiled with it, it loses the benefit :/

I think we can leave the loss compile outside of the PR, if chunked CE will be the default

edit: scratch that. Will add to Chuncked CE PR something like

loss_fn = instantiate(cfg.loss) if isinstance(loss_fn, ChunkedCrossEntropy): loss_fn._cross_entropy.compile() else: loss_fn.compile()

gau-nernst · 2024-08-28T13:49:12Z

Came across this PR. @yf225 Am I seeing memory leak for "qlora-compile-main"? If you are using torch nightly after 20240824, might be the same as what I'm seeing here pytorch/pytorch#134642

ebsmothers · 2024-08-28T14:40:27Z

Came across this PR. @yf225 Am I seeing memory leak for "qlora-compile-main"? If you are using torch nightly after 20240824, might be the same as what I'm seeing here pytorch/pytorch#134642

@gau-nernst yeah same here. I'm not sure what's going on because I've been compiling models with no problems for a while. But as of a couple days ago I also started seeing what looks like a memory leak. Attaching a memory viz I ran on a full finetune; seems like a bunch of memory is not getting freed after each step.

felipemello1 · 2024-08-28T14:50:48Z

recipes/dev/lora_finetune_fsdp2.py

@@ -230,6 +231,10 @@ def setup(self, cfg: DictConfig) -> None:
        )

        self._loss_fn = config.instantiate(cfg.loss)
+        if self._model_compile:


nit: should we rename "model_compile" to just "compile", since this command will control all loss compile and flexattention compile?

not a priority, i am fine to keep as it for now, and maybe add a todo to rename

Yeah I think our naming here is not great. Let's do in a follow-up; rather than add a todo in the code I may just create an issue.

yf225 · 2024-08-28T23:24:44Z

@gau-nernst @ebsmothers For the full-model compile memory leak issue, this should now be resolved by reverting pytorch/pytorch#134272.

In this PR we've also switched to using per-layer compile (#1419), which also won't have the memory leak issue.

yf225 added 9 commits August 26, 2024 10:40

wip

903abf8

update

ada2581

QLoRA single-device per-layer-compile working

ff60094

also compile loss_fn

2e8ddfb

clean up

e4fd77c

clean up

3b2387b

add per-layer-compile to full-finetune single-device

7ffbf65

support per-layer-compile for llama3-70B QLoRA multi-gpu

ac7339a

cleanup

8badcac

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 27, 2024

felipemello1 reviewed Aug 27, 2024

View reviewed changes

per-layer compile by default

33adaae

felipemello1 reviewed Aug 27, 2024

View reviewed changes

add flag to recipe test

4678ce9

felipemello1 reviewed Aug 28, 2024

View reviewed changes

felipemello1 approved these changes Aug 28, 2024

View reviewed changes

ebsmothers merged commit 9629a36 into pytorch:main Aug 28, 2024
20 checks passed

SalmanMohammadi mentioned this pull request Aug 28, 2024

[RFC] PPO Performance Optimizations (or: PPOPO) #1425

Closed

yf225 mentioned this pull request Aug 28, 2024

Memory leak starting with torch==2.5.0dev20240824 during training pytorch/pytorch#134642

Closed

yf225 mentioned this pull request Aug 29, 2024

Reduce compile time for single-device and multi-device recipes #1445

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add per-layer compile support to recipes #1419

Add per-layer compile support to recipes #1419

Uh oh!

yf225 commented Aug 27, 2024 •

edited by ebsmothers

Loading

Uh oh!

pytorch-bot bot commented Aug 27, 2024 •

edited

Loading

Uh oh!

felipemello1 left a comment •

edited

Loading

Uh oh!

SalmanMohammadi commented Aug 27, 2024 •

edited

Loading

Uh oh!

felipemello1 Aug 27, 2024 •

edited

Loading

Uh oh!

gau-nernst commented Aug 28, 2024

Uh oh!

ebsmothers commented Aug 28, 2024

Uh oh!

felipemello1 Aug 28, 2024

Uh oh!

felipemello1 Aug 28, 2024

Uh oh!

ebsmothers Aug 28, 2024

Uh oh!

SalmanMohammadi Aug 28, 2024 •

edited

Loading

Uh oh!

ebsmothers Aug 28, 2024

Uh oh!

Uh oh!

yf225 commented Aug 28, 2024

Uh oh!

Uh oh!

Add per-layer compile support to recipes #1419

Add per-layer compile support to recipes #1419

Uh oh!

Conversation

yf225 commented Aug 27, 2024 • edited by ebsmothers Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

QLoRA single-device

WandB results

Repro

LoRA

WandB results

Repro

FFT

WandB results

Repro

LoRA FSDP2

Repro

Uh oh!

pytorch-bot bot commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1419

✅ No Failures

Uh oh!

felipemello1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipemello1 Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gau-nernst commented Aug 28, 2024

Uh oh!

ebsmothers commented Aug 28, 2024

Uh oh!

felipemello1 Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

felipemello1 Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

ebsmothers Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yf225 commented Aug 28, 2024

Uh oh!

Uh oh!

yf225 commented Aug 27, 2024 •

edited by ebsmothers

Loading

pytorch-bot bot commented Aug 27, 2024 •

edited

Loading

felipemello1 left a comment •

edited

Loading

SalmanMohammadi commented Aug 27, 2024 •

edited

Loading

felipemello1 Aug 27, 2024 •

edited

Loading

SalmanMohammadi Aug 28, 2024 •

edited

Loading