Skip to content

Fix the hang issue in some TBE GPU optimizers #2509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Apr 17, 2024

Summary:
Previously, some TBE optimizer unit tests hung indefinitely causing
the unit tests to timeout. We were able to reproduce this problem
consistently by using the config in D50612178 (composed by ezyang).
The main characteristics of this config are (1) the optimizer is
PARTIAL_ROWWISE_ADAM, (2) the embedding dimension is less than 32, and
(3) it contains long segments (i.e., some indices are repeated with
extremely high counts).

Upon our investigation, we identified that the value reduction in
PARTIAL_ROWWISE_ADAM was implemented incorrectly. The optimizer
intended to perform a value reduction within a sub-warp (i.e., a group
of threads in a warp) instead of an entire warp. (Note that sub-warp
reduction is done when the embedding dimension is smaller than the
warp size). However, it did not pass a correct shfl_sync mask. The
wrong mask expected an entire warp to perform the reduction. When the
segment length is long (> 32), only one sub-warp would perform the
reduction. Such the warp divergence caused the kernel execution to
freeze. (Note that the reduction is a collective operation).

This diff fixes the issue by passing a correct mask when invoking the
reduction function.

Reviewed By: shintaro-iwasaki

Differential Revision: D56223375

Summary:
Previously, some TBE optimizer unit tests hung indefinitely causing
the unit tests to timeout.  We were able to reproduce this problem
consistently by using the config in D50612178 (composed by ezyang).
The main characteristics of this config are (1) the optimizer is
PARTIAL_ROWWISE_ADAM, (2) the embedding dimension is less than 32, and
(3) it contains long segments (i.e., some indices are repeated with
extremely high counts).

Upon our investigation, we identified that the value reduction in
PARTIAL_ROWWISE_ADAM was implemented incorrectly.  The optimizer
intended to perform a value reduction within a sub-warp (i.e., a group
of threads in a warp) instead of an entire warp.  (Note that sub-warp
reduction is done when the embedding dimension is smaller than the
warp size).  However, it did not pass a correct `shfl_sync` mask.  The
wrong mask expected an entire warp to perform the reduction.  When the
segment length is long (> 32), only one sub-warp would perform the
reduction.  Such the warp divergence caused the kernel execution to
freeze.  (Note that the reduction is a collective operation).

This diff fixes the issue by passing a correct mask when invoking the
reduction function.

Reviewed By: shintaro-iwasaki

Differential Revision: D56223375
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56223375

Copy link

netlify bot commented Apr 17, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 4d1d931
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6620089044c9a90008a6c713
😎 Deploy Preview https://deploy-preview-2509--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3f4f98f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants