Skip to content

Fix IMA in TBE grad indices kernel for int32 indices #3877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Mar 25, 2025

Differential Revision: D71796826

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71796826

Copy link

netlify bot commented Mar 25, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 4068c0e
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67e332d8c70aaf0008f8ac18
😎 Deploy Preview https://deploy-preview-3877--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary:
This diff forces casting the weight tensor index to
`overflow_safe_int_t` whichi is int64_t to address the 32-bit int
overflow issue when the TBE `indices` is int32_t.

Before this diff, `idx_j` and `D_emb` are both int32_t when using
32-bit int `indices`.  When accessing the embedding `weights` tensor,
we computed an index by multiplying `idx_j` and `D_emb` together.
Their product could be larger than the max value of int32_t.  This led
to an integer overflow problem, resulting in illegal memory access.

By forcing the `idx_j` and `D_emb` to be int64_t, we can prevent the
32-bit int overflow problem.  Note that using int64_t is safe since
its max value is much larger than the memory sizes of modern GPUs and
CPUs.

We also add unit tests for this issue.


X-link: facebookresearch/FBGEMM#967

**Facebook:**

This is the fix for S498528.  The full root cause details are in
https://fburl.com/gdoc/lhmvenw3.

Reviewed By: brad-mengchi, spcyppt

Differential Revision: D71796826
@sryap sryap force-pushed the export-D71796826 branch from 5ab4269 to 4068c0e Compare March 25, 2025 22:48
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71796826

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6a6db7c.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
This diff forces casting the weight tensor index to
`overflow_safe_int_t` whichi is int64_t to address the 32-bit int
overflow issue when the TBE `indices` is int32_t.

Before this diff, `idx_j` and `D_emb` are both int32_t when using
32-bit int `indices`.  When accessing the embedding `weights` tensor,
we computed an index by multiplying `idx_j` and `D_emb` together.
Their product could be larger than the max value of int32_t.  This led
to an integer overflow problem, resulting in illegal memory access.

By forcing the `idx_j` and `D_emb` to be int64_t, we can prevent the
32-bit int overflow problem.  Note that using int64_t is safe since
its max value is much larger than the memory sizes of modern GPUs and
CPUs.

We also add unit tests for this issue.

X-link: pytorch#3877

Pull Request resolved: facebookresearch/FBGEMM#967

**Facebook:**

This is the fix for S498528.  The full root cause details are in
https://fburl.com/gdoc/lhmvenw3.

Reviewed By: brad-mengchi, spcyppt

Differential Revision: D71796826

fbshipit-source-id: df7c8e06d36e2f06585a5812df8ae40863ea6253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants