You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Pull Request resolved: pytorch#1602
Illegal memory access is a common problem during GPU kernel execution.
The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and
the CUDA runtime to detect such problems and throw an error. However,
there are a few known issues with this approach.
(1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host.
However, due to the non-blocking, asynchronous nature of GPU kernel
execution, the error is caught on the host at a later point than where
the problematic kernel was launched. This can cause the stack trace
to be inaccurate and make debugging more difficult. Although the
issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`,
this can change the state of the execution and cause Heisenbugs.
(2) Not all illegal memory accesses are caught by the runtime. This
means that the system may not always throw an error when illegal
memory access occurs.
(3) Although the runtime throws an error for illegal memory access, it
is difficult to pinpoint the specific kernel and memory buffer/address
that is causing the problem.
For all the aforementioned reasons, we attempt to catch and throw an
error as soon as possible in the kernel when illegal memory accesses
occur in FBGEMM GPU. We introduce the `FBGEMM_GPU_MEMCHECK` flag
to enable memory checking during compile time. We copy PyTorch's
`TensorAccessor.h` into the FBGEMM GPU and extend it to check every
memory access through the `PackedTensorAccessor`. If an invalid memory
access occurs, we throw an error using `CUDA_KERNEL_ASSERT`. The error
message includes the name of the tensor and the kernel that caused the
problem.
If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use
`fbgemm::PackedTensorAccessor`. Otherwise, they will use
`at::PackedTensorAccessor`
`FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in
subsequent diffs
Differential Revision: D43421838
fbshipit-source-id: 3281ba97044e1f5fefcff3dc3b013fb19c265d38
0 commit comments