Normalize CE loss by total number of (non-padding) tokens #1875

ebsmothers · 2024-10-20T22:45:36Z

In honor of the day the ML community first discovered the fact that (x1 / n1) + (x2 / n2) != (x1 + x2) / (n1 + n2)

This PR changes how we calculate the loss when gradient accumulation is enabled. This way we'll get an exact match in loss curves with and without gradient accumulation.

The approach

Keep a running tally of the number of unmasked tokens in the recipe. Don't actually change our loss implementations (so they are still normalized by number of non-padding tokens in a batch), but when we get a batch's loss in the recipe we now just multiply by the number of non-padding tokens in that batch to get the unnormalized loss.

Previously we called .backward() after every batch (after dividing loss by # of grad accumulation steps). Now we can't do that because we need to accumulate all losses to do proper normalization. So we instead wait until it's time to step, divide our accumulated loss by the total number of tokens seen across all batches in the step, then call loss.backward().

Note: as a side effect our tokens/sec now logs only non-padding tokens. So yes the tokens/sec we see in our logs will decrease but it will also now be more representative of meaningful throughput (and you won't have to listen to me complaining about misleading tokens/sec anymore).

Test plan

Updated a bunch of recipe tests to explicitly test with gradient accumulation enabled. Note that previously these tests would fail as the loss values would not match (see e.g. this comment in a test that we added specifically to check parity of gradient accumulation when all samples have the same sequence length), but now we get the same loss values regardless of whether or not gradient accumulation is enabled.

E2E tests

For all E2E tests, we compare the following four cases:

batch size N, gradient accumulation disabled on main
batch size 1, N gradient accumulation steps on main
batch size N, gradient accumulation disabled on this PR
batch size 1, N gradient accumulation steps on this PR

We also change the logging of num_tokens_per_second on main to match what's in this PR for a fair comparison.

Llama 3 8B full finetune, single device

TLDR: we get the same loss curves for cases (1), (3), and (4). The updated gradient accumulation logic also increases tokens/second and reduces peak allocated memory. Full wandb workspace here

Loss curves:

Peak allocated memory:

Tokens/sec:

Qwen 2 1.5B with LoRA on two devices

TLDR: same loss curves for (1), (3) and (4). There is a slight increase in peak allocated memory, but also a pretty big jump in tokens/sec. Full wandb workspace

Loss curves:

Peak allocated memory:

Tokens/sec:

pytorch-bot · 2024-10-20T22:45:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1875

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d5ff9ec with merge base e030626 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-10-21T14:41:07Z

torchtune/modules/loss/ce_chunked_output_loss.py

@@ -80,4 +80,4 @@ def forward(self, logits: List[torch.Tensor], labels: torch.Tensor) -> torch.Ten
        for logits_chunk, labels_chunk in zip(logits, labels):
            total_loss += self.compute_cross_entropy(logits_chunk, labels_chunk)

-        return total_loss / total_elements
+        return total_loss


Isn't this unnormalized?

Yes, we need to decide where to divide by the number of tokens. This version of the PR does it all in the recipe. Even if we continue normalizing it here, we would then need to do something like running_loss += self._loss_step(batch) * current_num_tokens in the recipe, which is also a bit awkward

felipemello1

lgtm! thanks for fixing that

ebsmothers added 2 commits October 20, 2024 15:39

[WIP] Explicitly normalize CE loss by # of tokens

77bacb7

remove print statement

9dc32ba

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 20, 2024

joecummings reviewed Oct 21, 2024

View reviewed changes

joecummings linked an issue Oct 21, 2024 that may be closed by this pull request

Grad acc "fix" #1842

Closed

This was referenced Oct 22, 2024

Fix gradient accumulation in trainings recipes #1878

Closed

Add KD distributed recipe #1631

Merged

ebsmothers added 2 commits October 25, 2024 11:17

merge

d08e0d6

keep normalization in loss, update KD and QAT recipes

d5ff9ec

ebsmothers requested review from pbontrager, felipemello1 and RdoubleA October 25, 2024 21:05

ebsmothers changed the title ~~[WIP] Explicitly normalize CE loss by # of tokens~~ Normalize CE loss by total number of tokens Oct 25, 2024

ebsmothers changed the title ~~Normalize CE loss by total number of tokens~~ Normalize CE loss by total number of (non-padding) tokens Oct 25, 2024

felipemello1 approved these changes Oct 25, 2024

View reviewed changes

ebsmothers merged commit 23c8829 into pytorch:main Oct 25, 2024
17 checks passed

ebsmothers deleted the grad-accum branch October 25, 2024 21:41

ebsmothers mentioned this pull request Oct 30, 2024

Restore backward after each batch for grad accum #1917

Merged

SalmanMohammadi mentioned this pull request Jan 27, 2025

Apply gradient accumulation fix to DPO/PPO recipes #2037

Closed

bogdansalyp mentioned this pull request Feb 11, 2025

Apply gradient accumulation fix to DPO/PPO recipes #2334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize CE loss by total number of (non-padding) tokens #1875

Normalize CE loss by total number of (non-padding) tokens #1875

Uh oh!

ebsmothers commented Oct 20, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 20, 2024 •

edited

Loading

Uh oh!

joecummings Oct 21, 2024

Uh oh!

ebsmothers Oct 21, 2024

Uh oh!

felipemello1 left a comment

Uh oh!

Uh oh!

Uh oh!

Normalize CE loss by total number of (non-padding) tokens #1875

Normalize CE loss by total number of (non-padding) tokens #1875

Uh oh!

Conversation

ebsmothers commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The approach

Test plan

E2E tests

Llama 3 8B full finetune, single device

Loss curves:

Peak allocated memory:

Tokens/sec:

Qwen 2 1.5B with LoRA on two devices

Loss curves:

Peak allocated memory:

Tokens/sec:

Uh oh!

pytorch-bot bot commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1875

✅ No Failures

Uh oh!

joecummings Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

ebsmothers Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

felipemello1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ebsmothers commented Oct 20, 2024 •

edited

Loading

pytorch-bot bot commented Oct 20, 2024 •

edited

Loading