Add Selective Activation Checkpointing #785

lessw2020 · 2024-04-17T17:17:45Z

Context

This PR updates activation checkpointing (ac) to support selective layer and selective op activation checkpointing.
It preserves the previous options enabled of full or None.
This is controlled in the yaml file via:
enable_activation_checkpointing: bool
ac_mode: ['full', 'selective']
ac_option: [int, 'op']

if ac_mode is selective then the type of selective is determined based on ac_option.
An integer represents checkpoint every x'th layer (i.e. 2 = checkpoint every other layer, 3 = every third, etc).
'op' means to run selective op ac, where the ac is filtered by the op policy.

Generically on llama-13B, selective AC 2 (every other layer) improved throughput +10% over No AC.

I updated the testing for llama3-8B where I tried to adjust the batch size under each setting to hit around 91GB. I used 8 gpus with the idea of having less impact from model params and more finer grained tuning of the bs size and thus activations. This is not always perfectly possible as activations are chunky, but the net was selective AC 3 was the highest throughput followed by No AC. Sel AC3 was +9% better throughput vs the original Full only option.
For A100, 4090s etc. the actual best combo will vary but the point here is that selective AC provides generally better throughput options over the simple binary of Full (True/False).

Changelog

...

Test plan

This code is largely a port from original source in torchtitan where it has already been tested. However, I ran all 4 styles (none, full, sel ac op, sel ac 2) as shown above.
I also verified that the new impl of full matched the memory savings of the previous impl of full.

pytorch-bot · 2024-04-17T17:17:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/785

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1412378 with merge base a79554e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchtune/utils/memory.py

kartikayk

Some noob questions, but starting to look good!

logs-Llama-2-7b-hf/config.json

recipes/configs/llama2/7B_full.yaml

recipes/full_finetune_distributed.py

torchtune/utils/activations.py

kartikayk

Thanks so much for patiently addressing all of the comments!

rohan-varma · 2024-04-24T22:00:00Z

recipes/configs/dev/8B_full_experimental.yaml

+#   tune download meta-llama/Meta-Llama-3-8B --output-dir /tmp/Meta-Llama-3-8B --hf-token <HF_TOKEN>
+#
+# To launch on 4 devices, run the following command from root:
+#   tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full


Launch command does not seem correct

rohan-varma · 2024-04-24T22:00:07Z

recipes/configs/dev/8B_full_experimental.yaml

+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+#   tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>


same comment as (9)

rohan-varma · 2024-04-24T22:03:23Z

recipes/configs/dev/8B_full_experimental.yaml

+# Memory management
+enable_activation_checkpointing: False
+ac_mode: 'selective'  # ['selective', 'full']
+ac_option: 2 # [int] = ac every positive int layer


This is an interesting way to configure and IMO seems a little brittle. Is there any guidance around choosing this setting for users, and what if user selects an integer that exceeds the number of layers in the model?

rohan-varma · 2024-04-24T22:03:40Z

recipes/full_finetune_distributed.py

+        # the older version of AC and this behavior is unchanged
+        # ac_mode and ac_option together control selective AC. This is only enabled
+        # when these are set AND ``enable_activation_checkpointing`` is set to False
+        # We'll clean this up as soon as testing of AC is complete


Is there an issue and an owner?

rohan-varma · 2024-04-24T22:04:16Z

recipes/full_finetune_distributed.py

+        ac_mode = ac_mode
+        ac_option = ac_option
+
+        if (not enable_activation_checkpointing) and (ac_mode is not None):


I understand why its this way but still reads a little weird. Like we have a check for not enable_activation_checkpointing, and underneath the check we apply AC (granted it is selective).

rohan-varma · 2024-04-24T22:10:09Z

torchtune/utils/activations.py

+        model (nn.Module): Model to setup activation checkpointing.
+        ac_mode (str): Activation checkpointing mode. ['none', 'full', 'selective']
+        ac_option (Optional[Union[int, str]]): Activation checkpointing option.
+            - If ac_mode is 'selective', ac_option can be an integer or a string


Why do we need both int and string?

rohan-varma · 2024-04-24T22:12:21Z

torchtune/utils/activations.py

+        ac_mode (str): Activation checkpointing mode. ['none', 'full', 'selective']
+        ac_option (Optional[Union[int, str]]): Activation checkpointing option.
+            - If ac_mode is 'selective', ac_option can be an integer or a string
+              representing the number of layers to checkpoint.


Is it number of layers to checkpoint or skip # of layers in between checkpointed layers?

rohan-varma · 2024-04-24T22:12:39Z

torchtune/utils/activations.py

+            - If ac_mode is 'none' or 'full, ac_option is ignored.
+    """
+
+    for layer_id, transformer_block in enumerate(model.layers):


We should document if we're assuming the passed in module requires a layers attribute.

rohan-varma · 2024-04-24T22:13:26Z

torchtune/utils/activations.py

+    for layer_id, transformer_block in enumerate(model.layers):
+        if ac_mode in ("full", "selective"):
+
+            transformer_block = checkpoint_wrapper(


This in place modification is quite risky and brittle in general - for example, will have to carefully use it with FSDP and other wrapping APIs. Ideally we should do this via hooks, but of course this will require a lot of refactoring around our activation checkpointing infra.

rohan-varma · 2024-04-24T22:13:55Z

torchtune/utils/activations.py

+    """
+
+    for layer_id, transformer_block in enumerate(model.layers):
+        if ac_mode in ("full", "selective"):


Silently not applying AC if ac_mode is not within this tuple seems bad, especially if its undocumented?

rohan-varma · 2024-04-24T22:15:41Z

torchtune/utils/activations.py

+    checkpoint_wrapper as ptd_checkpoint_wrapper,
+    CheckpointImpl,
+)
+from torch.utils.checkpoint import checkpoint


Unused I think? Not sure why linter did not pick this up

rohan-varma · 2024-04-24T22:16:23Z

torchtune/utils/activations.py

+        """
+        every_x_layer = int(ac_style)
+
+        if not (every_x_layer >= 0):


every_x_layer < 0 is simpler?

rohan-varma · 2024-04-24T22:16:55Z

torchtune/utils/activations.py

+
+        checkpoint_wrapper.__dict__.setdefault("_count", 0)
+
+        checkpoint_wrapper._count += 1


Think the bump should be done after the check

rohan-varma · 2024-04-24T22:17:31Z

torchtune/utils/activations.py

+        checkpoint_wrapper.__dict__.setdefault("_count", 0)
+
+        checkpoint_wrapper._count += 1
+        if not every_x_layer or checkpoint_wrapper._count % every_x_layer == 0:


every_x_layer == 0 seems better than if not every_x_layer to guard against every_x_layer being None which is an unexpected state for this variable

lessw2020 added 6 commits April 16, 2024 13:42

initial code block to support sel ac

c87dd71

sel ac being applied

5d33632

sel ac being applied

5053ade

sel ac all working

4907598

remove debug prints

90562d5

ac integrated with yaml configs

e296edb

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 17, 2024

lessw2020 added 3 commits April 17, 2024 10:27

linting

bc15cae

more linting

ab6927e

more linting

0983df8

kartikayk reviewed Apr 17, 2024

View reviewed changes

lessw2020 added 6 commits April 17, 2024 15:44

start moving to activations.py

c46ac0f

add 7b_full_experimental, revert other configs back to original state

a31ea0d

add 7b_full_exp yaml and register in recipe registry

19bda29

integrate activations.py and full_finetune_dist.py

383d639

replace assert with if raise block

b060b6a

Merge branch 'pytorch:main' into selective_ac

09bd47f

kartikayk reviewed Apr 19, 2024

View reviewed changes

lessw2020 added 7 commits April 19, 2024 17:13

remove selective op checkpointing

2124230

misc cleanup

70791c1

switch to 1e9 for gb

90d8d82

reset 7B yaml file logging dirs

dfb2fef

reset 7B yaml file logging dirs

5866455

reset 7B yaml file logging dirs, logging steps to null

b54a7db

linting

8314b47

kartikayk approved these changes Apr 22, 2024

View reviewed changes

Kartikay Khandelwal added 2 commits April 22, 2024 20:19

simple fixes

2070a20

cleanup

1412378

kartikayk merged commit 68f2538 into pytorch:main Apr 23, 2024

rohan-varma reviewed Apr 24, 2024

View reviewed changes


		checkpoint_wrapper.__dict__.setdefault("_count", 0)

		checkpoint_wrapper._count += 1

Add Selective Activation Checkpointing #785

Add Selective Activation Checkpointing #785

Uh oh!

Conversation

lessw2020 commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

Uh oh!

pytorch-bot bot commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/785

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kartikayk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kartikayk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lessw2020 commented Apr 17, 2024 •

edited

Loading

pytorch-bot bot commented Apr 17, 2024 •

edited

Loading