forked from pytorch/FBGEMM
-
Notifications
You must be signed in to change notification settings - Fork 3
Ifu 2023 03 10 #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Ifu 2023 03 10 #37
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: Pull Request resolved: pytorch#1586 Changes are as follows: - Update variable names of jagged_index_add_2d for GPU - Use a single autograd function to dispatch the op on CPU and GPU - Add an implementation on CPU using `at::parallel_for` to parallelize work on multiple threads. For jagged_index_add_2d (backward), locks are used instead of atomic add (one lock per row) to manage add conflict between threads. Reviewed By: brad-mengchi Differential Revision: D43111264 fbshipit-source-id: e319a4cd3376dd3e4dd8eaf8d6ffef9f7a390961
…eighted in C++ op (pytorch#1599) Summary: Pull Request resolved: pytorch#1599 tbe_input_combine returns at::empty for per_sample_weights in the unweighted case, but if this output is passed directly to ops like `int_nbit_split_embedding_codegen_lookup_function_cpu`, the bool check fails because the optional has a tensor, it's just that the tensor is empty. This ultimately ends up with a SEGV as the weighted branch tries to read the empty tensor for per_sample_weights. Reviewed By: RoshanPAN Differential Revision: D43373836 fbshipit-source-id: 5251e4e20d9096454fade2d56c0e5d2dc8d51f11
Summary: Pull Request resolved: pytorch#1603 There are some error messages on CUB namespace when building fbgemm with CUDA 12: ``` /usr/local/cuda-12.0/include/cub/agent/../util_math.cuh(103): error: namespace "fbgemm_gpu::cub" has no member "min" /usr/local/cuda-12.0/include/cub/agent/../util_math.cuh(104): error: namespace "fbgemm_gpu::cub" has no member "max" /usr/local/cuda-12.0/include/cub/agent/agent_radix_sort_onesweep.cuh(76): error: RegBoundScaling is not a template /usr/local/cuda-12.0/include/cub/agent/agent_radix_sort_upsweep.cuh(60): error: RegBoundScaling is not a template /usr/local/cuda-12.0/include/cub/agent/single_pass_scan_operators.cuh(325): error: namespace "fbgemm_gpu::cub" has no member "Debug" /usr/local/cuda-12.0/include/cub/agent/single_pass_scan_operators.cuh(354): error: namespace "fbgemm_gpu::cub" has no member "Debug" ``` This Diff fixed the issue. Reviewed By: q10 Differential Revision: D43423185 fbshipit-source-id: 9545835a9847aabeec4bc2a8dda8511771910fb6
Summary: - Upgrade the FBGEMM_GPU release workflows to use the same build scripts framework for running the builds - Update the build scripts framework to properly support running PyTests - Disable certain tests from running in the **CPU-only build** as they are known to be failing at the moment (`jagged_tensor_ops_test`, `uvm_test`) - Update both the nightly and release jobs to actually run tests prior to publishing to PyPI - Fix bug with PyPI publishing for the FBGEMM_GPU-CPU nightly job - Update the nightly build cron schedules to follow PST - Remove some jobs from FBGEMM_GPU CI as they are now redundant with the FBGEMM_GPU nightly / release builds - Update the FBGEMM_GPU-CPU CI job (Ubuntu) to use the build scripts framework for building and running tests Pull Request resolved: pytorch#1598 Reviewed By: brad-mengchi Differential Revision: D43448188 Pulled By: q10 fbshipit-source-id: 81ec5c0ee2d9c7d1246649a31f669b0ed585d23d
Summary: Pull Request resolved: pytorch#1605 Reviewed By: nanoax Differential Revision: D43487913 fbshipit-source-id: 357bcfe927c148071037f9df91e46a65e533f4f4
Summary: Pull Request resolved: pytorch#1587 As titled Reviewed By: jianyuh Differential Revision: D43132884 fbshipit-source-id: ab3b00b4f0e2cd79134e3e89822a73456a2fd97b
Summary: Pull Request resolved: pytorch#1593 Move `binary_search_range` to header to allow kernels outside of `jagged_tensor_ops.cu` to use it. Reviewed By: jianyuh Differential Revision: D43216577 fbshipit-source-id: 96a681ea0e0f3db929dc6e6d25d971107015aba1
Summary: Fix torchrec test modules/tests/test_fused_embedding_modules.py::FusedEmbeddingBagCollectionTest::test_optimizer_fusion The credit for this PR belongs to amathews-amd. Pull Request resolved: pytorch#1591 Reviewed By: shintaro-iwasaki Differential Revision: D43448076 Pulled By: q10 fbshipit-source-id: 64cc97a90962a6ca153a858f96d1bd14eb500251
…forward_unweighted_small_kernel (pytorch#1610) Summary: Pull Request resolved: pytorch#1610 `[split|dense]_embedding_nobag_codegen_forward_unweighted_small_kernel` always accesses the LXU cache even when the cache is not available. This diff moves cache accesses under a runtime conditional to ensure that the access is valid. Reviewed By: jspark1105, mjanderson09 Differential Revision: D43581917 fbshipit-source-id: ab302092ffeb27b9a7fd18205c3856abc1dd8713
…#1592) Summary: Pull Request resolved: pytorch#1592 Before this diff, `group_index_select` requires all input tensors to have the same shape. This diff allows input tensors to have different shapes. However, it still requires the input tensors to have the same number of dimensions and for the first dimensions to match. Reviewed By: jianyuh Differential Revision: D43216425 fbshipit-source-id: fc8c74a472616c0e46785eb71482fa9012155f7e
Summary: Pull Request resolved: pytorch#1581 Load inplace update dependencies after BC issue is resolved. Reviewed By: brad-mengchi Differential Revision: D43026352 fbshipit-source-id: b365535ca5896e88eaac95bab72afe6a7fb34733
Summary: - Add support for retries in build steps that are known to fail due to the occasional network connection failures - Add support for installing ROCm tooling and testing ROCm builds in the build scripts framework - Update the existing FBGEMM_GPU CI / build_amd_gpu job to use the build scripts framework - Fix the annotations to tests in `jagged_tensor_ops_test.py` to run correctly on CPU-only mode - Impose timeouts of 10 minutes for running the test suites (in practice, they generally complete within 3 minutes) - Add ability to conditionally disable tests depending on whether or not they are running inside the GitHub runner - Disable the `test_jagged_index_select_2d` test on GitHub until we figure out the root cause of it hanging whenever it is run on GitHub (regardless of GPU or CPU variant) Pull Request resolved: pytorch#1606 Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D43601032 Pulled By: q10 fbshipit-source-id: 55203151f54bf010859b0e7e1568572e064fa7cf
…oming from different device (pytorch#1615) Summary: Pull Request resolved: pytorch#1615 Reviewed By: jianyuh, houseroad Differential Revision: D43564925 fbshipit-source-id: 9d8db49df76889e56f70ebb3fb4984c292186edc
Summary: Pull Request resolved: https://fburl.com/4d2jifwi Previously, we want to disable cache stats report when gflag ```FLAGS_tbe_uvm_cache_stat_report``` is non-positive but we can still see observe cache stats (cache miss when UVM cache is enabled) when ```FLAGS_tbe_uvm_cache_stat_report``` is negative. This diff fixes this error so that cache stats report can be disabled. Reviewed By: doehyun Differential Revision: D43630535 fbshipit-source-id: e267f881166721f804f04e4cdd3a35c6491452e7
…seDispatchMode (pytorch#1617) Summary: Pull Request resolved: pytorch#1617 When we use meta, it's not treated as CPU tensor. But the use case is that we want to process the weights on CPU first. So we set use_cpu as true when device is true in `IntNBitTableBatchedEmbeddingBagsCodegen` Reviewed By: jianyuh, nanoax Differential Revision: D43633336 fbshipit-source-id: bfb5f52c17f6ecf8bedb4f47a85aabcf35b87416
Summary: Pull Request resolved: pytorch#1619 Reviewed By: shintaro-iwasaki Differential Revision: D43678509 Pulled By: q10 fbshipit-source-id: 1f4d868038890d677fdde6fbc29594e0c2019ab8
…rch#1613) Summary: Pull Request resolved: pytorch#1613 The TBE inference kernel (`*_split_embedding_codegen_forward*_kernel_small_L`) may illegally access memory if the number of indices is zero. This update ensures that the TBE inference operator returns a zero-tensor when the number of indices is zero, preventing the aforementioned problem. Reviewed By: jianyuh Differential Revision: D43588092 fbshipit-source-id: 95eb6b96c7b9cac8518072c03c6d382edfc47436
…agsCodegen (pytorch#1622) Summary: Pull Request resolved: pytorch#1622 Need to init the weight during model loading with XL V2, so export it. Reviewed By: qxy11, jianyuh Differential Revision: D43709984 fbshipit-source-id: 1730ea6f108303fee0811eb9de3fa5bef2fa8267
…ytorch#1621) Summary: Pull Request resolved: pytorch#1621 - Re-enable building of `merge_pooled_embeddings` operators in OSS - Add `nm` checks for a few operators after OSS compilation to ensure that they are actually compiled into the shared library - Downgrade the GCC version used for OSS builds to 9.3, since later versions of GCC will build libraries that have a dependency on `GLIBCXX_3.4.29`, which is not available on systems with older versions of `libstdc++.so.6`, such as CentOS Stream 8 and Ubuntu 20.04 - Print SHA of the built wheels after build and before installation to verify that the packages are correctly uploaded to and downloaded from GHA Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D43764547 Pulled By: q10 fbshipit-source-id: 905d51feebe8d73c08689abc605f1160cc4b4600
…ops_utils.h (pytorch#1624) Summary: Pull Request resolved: pytorch#1624 Reviewed By: dracifer Differential Revision: D43781630 fbshipit-source-id: 1544ac95a50d8410fb614ddd3901758101a95202
Summary: Pull Request resolved: pytorch#1626 Reviewed By: sryap Differential Revision: D43787029 fbshipit-source-id: 87e07acf39010d489366d3e4ea10b9a33dec1fd5
Summary: Pull Request resolved: pytorch#1628 previously the scale is calcualted by max_int8/(eps + max_val), while eps = 1e-8; this is okay because the forward tensor needs to be larger than that to make a difference; with fp8, the gradients are much smaller, eps =1e-8 will cause the tensor not scaling up to cover the entire range of fp8, thus we change it to 1e-20 Reviewed By: brad-mengchi Differential Revision: D43665395 fbshipit-source-id: 0214e22af7739ef5cf7561b15e02830c35320402
Summary: - Add support for CUDA 11.8 in the OSS builds - Annotate the package wheels with Python 3.9 and 3.10 support tags - Update `setup.py` to auto-derive the package version from the git information (namely tags) to allow us for fast tag-and-release - Check for the actual presence of NVIDIA drivers in OSS builds, and error out with friendly message instead of cryptic `RuntimeError: No such operator fbgemm::jagged_2d_to_dense` errors when `fbgemm_gpu` is installed and loaded on a system with a GPU but without GPU drivers installed Pull Request resolved: pytorch#1627 Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D43868995 Pulled By: q10 fbshipit-source-id: 41da34bd5a82032d20daa4972cc9c848a6ed09e1
Summary: Pull Request resolved: pytorch#1623 previous cache stats report (e.g. cache miss rate, unique rate, etc) didn't take the number of requested indices of each TBE op into consideration, which could easily cause inaccurate cache miss rate than the real situation, especially with unbalance-requested TBEs. Now we update the cache stats report using the weighted caching ratios by accumulating all the TBE's cache data within the same round of lookup call. Reviewed By: doehyun Differential Revision: D43729139 fbshipit-source-id: 491b8613e70cf17c3d0f89dbd5c808ee9fe33587
Summary: - Remove local version identifier from build versioning - Use CPU-only instances for CPU-only tests Pull Request resolved: pytorch#1631 Reviewed By: shintaro-iwasaki Differential Revision: D43915104 Pulled By: q10 fbshipit-source-id: 223845c1c38d911f650d5db926aa7f2f59561b85
amathews-amd
approved these changes
Mar 10, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
test_log.txt
use_cpu
injagged_index_select_2d_ref
is disabled pytorch#1636. May wait for pytorch#1635