Fix the hang issue in some TBE GPU optimizers #2509

sryap · 2024-04-17T17:36:14Z

Summary:
Previously, some TBE optimizer unit tests hung indefinitely causing
the unit tests to timeout. We were able to reproduce this problem
consistently by using the config in D50612178 (composed by ezyang).
The main characteristics of this config are (1) the optimizer is
PARTIAL_ROWWISE_ADAM, (2) the embedding dimension is less than 32, and
(3) it contains long segments (i.e., some indices are repeated with
extremely high counts).

Upon our investigation, we identified that the value reduction in
PARTIAL_ROWWISE_ADAM was implemented incorrectly. The optimizer
intended to perform a value reduction within a sub-warp (i.e., a group
of threads in a warp) instead of an entire warp. (Note that sub-warp
reduction is done when the embedding dimension is smaller than the
warp size). However, it did not pass a correct shfl_sync mask. The
wrong mask expected an entire warp to perform the reduction. When the
segment length is long (> 32), only one sub-warp would perform the
reduction. Such the warp divergence caused the kernel execution to
freeze. (Note that the reduction is a collective operation).

This diff fixes the issue by passing a correct mask when invoking the
reduction function.

Reviewed By: shintaro-iwasaki

Differential Revision: D56223375

Summary: Previously, some TBE optimizer unit tests hung indefinitely causing the unit tests to timeout. We were able to reproduce this problem consistently by using the config in D50612178 (composed by ezyang). The main characteristics of this config are (1) the optimizer is PARTIAL_ROWWISE_ADAM, (2) the embedding dimension is less than 32, and (3) it contains long segments (i.e., some indices are repeated with extremely high counts). Upon our investigation, we identified that the value reduction in PARTIAL_ROWWISE_ADAM was implemented incorrectly. The optimizer intended to perform a value reduction within a sub-warp (i.e., a group of threads in a warp) instead of an entire warp. (Note that sub-warp reduction is done when the embedding dimension is smaller than the warp size). However, it did not pass a correct `shfl_sync` mask. The wrong mask expected an entire warp to perform the reduction. When the segment length is long (> 32), only one sub-warp would perform the reduction. Such the warp divergence caused the kernel execution to freeze. (Note that the reduction is a collective operation). This diff fixes the issue by passing a correct mask when invoking the reduction function. Reviewed By: shintaro-iwasaki Differential Revision: D56223375

facebook-github-bot · 2024-04-17T17:36:24Z

This pull request was exported from Phabricator. Differential Revision: D56223375

netlify · 2024-04-17T17:36:30Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`4d1d931`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6620089044c9a90008a6c713
😎 Deploy Preview	https://deploy-preview-2509--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2024-04-17T20:40:28Z

This pull request has been merged in 3f4f98f.

facebook-github-bot added the cla signed label Apr 17, 2024

facebook-github-bot added the fb-exported label Apr 17, 2024

facebook-github-bot closed this in 3f4f98f Apr 17, 2024

facebook-github-bot added the Merged label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the hang issue in some TBE GPU optimizers #2509

Fix the hang issue in some TBE GPU optimizers #2509

Uh oh!

sryap commented Apr 17, 2024

Uh oh!

facebook-github-bot commented Apr 17, 2024

Uh oh!

netlify bot commented Apr 17, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 17, 2024

Uh oh!

Uh oh!

Fix the hang issue in some TBE GPU optimizers #2509

Fix the hang issue in some TBE GPU optimizers #2509

Uh oh!

Conversation

sryap commented Apr 17, 2024

Uh oh!

facebook-github-bot commented Apr 17, 2024

Uh oh!

netlify bot commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Apr 17, 2024

Uh oh!

Uh oh!

netlify bot commented Apr 17, 2024 •

edited

Loading