Fix segfault in training unit tests #2929

sryap · 2024-08-02T01:41:35Z

Summary:
Before this diff, there was a segmentation fault error (P1507485454)
when running the SSD-TBE unit tests. It was caused by the premature
tensor deallocation when the unit test invoked set_cuda. Since
set_cuda is non-blocking asynchronous, the unit test must ensure
that the input tensors are alive until set_cuda is complete.
However, the unit test allocated an input tensor inside a for-loop (in
a stack memory). The tensor was deallocated as soon as each for-loop
iteration was done -- causing segmentation fault.

This diff fixes the problem by making sure that the input tensor is
alive until set_cuda is complete by moving the scope of the tensor
outside of the for-loop and adding a proper synchronization.

Differential Revision: D60627636

netlify · 2024-08-02T01:41:51Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`7c4b276`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66ad6d9e1a208e00082cb34e
😎 Deploy Preview	https://deploy-preview-2929--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2024-08-02T01:41:58Z

This pull request was exported from Phabricator. Differential Revision: D60627636

facebook-github-bot · 2024-08-02T23:30:41Z

This pull request was exported from Phabricator. Differential Revision: D60627636

Summary: X-link: facebookresearch/FBGEMM#30 Pull Request resolved: pytorch#2929 Before this diff, there was a segmentation fault error (P1507485454) when running the SSD-TBE unit tests. It was caused by the premature tensor deallocation when the unit test invoked `set_cuda`. Since `set_cuda` is non-blocking asynchronous, the unit test must ensure that the input tensors are alive until `set_cuda` is complete. However, the unit test allocated an input tensor inside a for-loop (in a stack memory). The tensor was deallocated as soon as each for-loop iteration was done -- causing segmentation fault. This diff fixes the problem by making sure that the input tensor is alive until `set_cuda` is complete by moving the scope of the tensor outside of the for-loop and adding a proper synchronization. Reviewed By: duduyi2013 Differential Revision: D60627636

facebook-github-bot · 2024-08-02T23:36:57Z

This pull request was exported from Phabricator. Differential Revision: D60627636

facebook-github-bot · 2024-08-04T02:32:38Z

This pull request has been merged in 9cbf073.

Summary: Pull Request resolved: facebookresearch/FBGEMM#30 X-link: pytorch#2929 Before this diff, there was a segmentation fault error (P1507485454) when running the SSD-TBE unit tests. It was caused by the premature tensor deallocation when the unit test invoked `set_cuda`. Since `set_cuda` is non-blocking asynchronous, the unit test must ensure that the input tensors are alive until `set_cuda` is complete. However, the unit test allocated an input tensor inside a for-loop (in a stack memory). The tensor was deallocated as soon as each for-loop iteration was done -- causing segmentation fault. This diff fixes the problem by making sure that the input tensor is alive until `set_cuda` is complete by moving the scope of the tensor outside of the for-loop and adding a proper synchronization. Reviewed By: duduyi2013 Differential Revision: D60627636 fbshipit-source-id: a2016b9b23a154513bf851c07f6bdce4e7da70a6

facebook-github-bot added the cla signed label Aug 2, 2024

facebook-github-bot added the fb-exported label Aug 2, 2024

sryap force-pushed the export-D60627636 branch from 8c0f44e to 29f5b1c Compare August 2, 2024 23:30

sryap force-pushed the export-D60627636 branch from 29f5b1c to 7c4b276 Compare August 2, 2024 23:37

facebook-github-bot closed this in 9cbf073 Aug 4, 2024

facebook-github-bot added the Merged label Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix segfault in training unit tests #2929

Fix segfault in training unit tests #2929

Uh oh!

sryap commented Aug 2, 2024

Uh oh!

netlify bot commented Aug 2, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 4, 2024

Uh oh!

Uh oh!

Fix segfault in training unit tests #2929

Fix segfault in training unit tests #2929

Uh oh!

Conversation

sryap commented Aug 2, 2024

Uh oh!

netlify bot commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 4, 2024

Uh oh!

Uh oh!

netlify bot commented Aug 2, 2024 •

edited

Loading