Skip to content

Fix segfault in training unit tests #2929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Aug 2, 2024

Summary:
Before this diff, there was a segmentation fault error (P1507485454)
when running the SSD-TBE unit tests. It was caused by the premature
tensor deallocation when the unit test invoked set_cuda. Since
set_cuda is non-blocking asynchronous, the unit test must ensure
that the input tensors are alive until set_cuda is complete.
However, the unit test allocated an input tensor inside a for-loop (in
a stack memory). The tensor was deallocated as soon as each for-loop
iteration was done -- causing segmentation fault.

This diff fixes the problem by making sure that the input tensor is
alive until set_cuda is complete by moving the scope of the tensor
outside of the for-loop and adding a proper synchronization.

Differential Revision: D60627636

Copy link

netlify bot commented Aug 2, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 7c4b276
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66ad6d9e1a208e00082cb34e
😎 Deploy Preview https://deploy-preview-2929--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60627636

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60627636

@sryap sryap force-pushed the export-D60627636 branch from 8c0f44e to 29f5b1c Compare August 2, 2024 23:30
sryap added a commit to sryap/FBGEMM that referenced this pull request Aug 2, 2024
Summary:
X-link: facebookresearch/FBGEMM#30

Pull Request resolved: pytorch#2929

Before this diff, there was a segmentation fault error (P1507485454)
when running the SSD-TBE unit tests.  It was caused by the premature
tensor deallocation when the unit test invoked `set_cuda`.  Since
`set_cuda` is non-blocking asynchronous, the unit test must ensure
that the input tensors are alive until `set_cuda` is complete.
However, the unit test allocated an input tensor inside a for-loop (in
a stack memory). The tensor was deallocated as soon as each for-loop
iteration was done -- causing segmentation fault.

This diff fixes the problem by making sure that the input tensor is
alive until `set_cuda` is complete by moving the scope of the tensor
outside of the for-loop and adding a proper synchronization.

Reviewed By: duduyi2013

Differential Revision: D60627636
Summary:
X-link: facebookresearch/FBGEMM#30

Pull Request resolved: pytorch#2929

Before this diff, there was a segmentation fault error (P1507485454)
when running the SSD-TBE unit tests.  It was caused by the premature
tensor deallocation when the unit test invoked `set_cuda`.  Since
`set_cuda` is non-blocking asynchronous, the unit test must ensure
that the input tensors are alive until `set_cuda` is complete.
However, the unit test allocated an input tensor inside a for-loop (in
a stack memory). The tensor was deallocated as soon as each for-loop
iteration was done -- causing segmentation fault.

This diff fixes the problem by making sure that the input tensor is
alive until `set_cuda` is complete by moving the scope of the tensor
outside of the for-loop and adding a proper synchronization.

Reviewed By: duduyi2013

Differential Revision: D60627636
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60627636

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 9cbf073.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
Pull Request resolved: facebookresearch/FBGEMM#30

X-link: pytorch#2929

Before this diff, there was a segmentation fault error (P1507485454)
when running the SSD-TBE unit tests.  It was caused by the premature
tensor deallocation when the unit test invoked `set_cuda`.  Since
`set_cuda` is non-blocking asynchronous, the unit test must ensure
that the input tensors are alive until `set_cuda` is complete.
However, the unit test allocated an input tensor inside a for-loop (in
a stack memory). The tensor was deallocated as soon as each for-loop
iteration was done -- causing segmentation fault.

This diff fixes the problem by making sure that the input tensor is
alive until `set_cuda` is complete by moving the scope of the tensor
outside of the for-loop and adding a proper synchronization.

Reviewed By: duduyi2013

Differential Revision: D60627636

fbshipit-source-id: a2016b9b23a154513bf851c07f6bdce4e7da70a6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants