Ifu 2023 03 10 #37

liligwu · 2023-03-10T16:41:00Z

test_log.txt
use_cpu in jagged_index_select_2d_ref is disabled pytorch#1636. May wait for pytorch#1635

Summary: Pull Request resolved: pytorch#1586 Changes are as follows: - Update variable names of jagged_index_add_2d for GPU - Use a single autograd function to dispatch the op on CPU and GPU - Add an implementation on CPU using `at::parallel_for` to parallelize work on multiple threads. For jagged_index_add_2d (backward), locks are used instead of atomic add (one lock per row) to manage add conflict between threads. Reviewed By: brad-mengchi Differential Revision: D43111264 fbshipit-source-id: e319a4cd3376dd3e4dd8eaf8d6ffef9f7a390961

…eighted in C++ op (pytorch#1599) Summary: Pull Request resolved: pytorch#1599 tbe_input_combine returns at::empty for per_sample_weights in the unweighted case, but if this output is passed directly to ops like `int_nbit_split_embedding_codegen_lookup_function_cpu`, the bool check fails because the optional has a tensor, it's just that the tensor is empty. This ultimately ends up with a SEGV as the weighted branch tries to read the empty tensor for per_sample_weights. Reviewed By: RoshanPAN Differential Revision: D43373836 fbshipit-source-id: 5251e4e20d9096454fade2d56c0e5d2dc8d51f11

Summary: Pull Request resolved: pytorch#1603 There are some error messages on CUB namespace when building fbgemm with CUDA 12: ``` /usr/local/cuda-12.0/include/cub/agent/../util_math.cuh(103): error: namespace "fbgemm_gpu::cub" has no member "min" /usr/local/cuda-12.0/include/cub/agent/../util_math.cuh(104): error: namespace "fbgemm_gpu::cub" has no member "max" /usr/local/cuda-12.0/include/cub/agent/agent_radix_sort_onesweep.cuh(76): error: RegBoundScaling is not a template /usr/local/cuda-12.0/include/cub/agent/agent_radix_sort_upsweep.cuh(60): error: RegBoundScaling is not a template /usr/local/cuda-12.0/include/cub/agent/single_pass_scan_operators.cuh(325): error: namespace "fbgemm_gpu::cub" has no member "Debug" /usr/local/cuda-12.0/include/cub/agent/single_pass_scan_operators.cuh(354): error: namespace "fbgemm_gpu::cub" has no member "Debug" ``` This Diff fixed the issue. Reviewed By: q10 Differential Revision: D43423185 fbshipit-source-id: 9545835a9847aabeec4bc2a8dda8511771910fb6

Summary: - Upgrade the FBGEMM_GPU release workflows to use the same build scripts framework for running the builds - Update the build scripts framework to properly support running PyTests - Disable certain tests from running in the **CPU-only build** as they are known to be failing at the moment (`jagged_tensor_ops_test`, `uvm_test`) - Update both the nightly and release jobs to actually run tests prior to publishing to PyPI - Fix bug with PyPI publishing for the FBGEMM_GPU-CPU nightly job - Update the nightly build cron schedules to follow PST - Remove some jobs from FBGEMM_GPU CI as they are now redundant with the FBGEMM_GPU nightly / release builds - Update the FBGEMM_GPU-CPU CI job (Ubuntu) to use the build scripts framework for building and running tests Pull Request resolved: pytorch#1598 Reviewed By: brad-mengchi Differential Revision: D43448188 Pulled By: q10 fbshipit-source-id: 81ec5c0ee2d9c7d1246649a31f669b0ed585d23d

Summary: Pull Request resolved: pytorch#1605 Reviewed By: nanoax Differential Revision: D43487913 fbshipit-source-id: 357bcfe927c148071037f9df91e46a65e533f4f4

Summary: Pull Request resolved: pytorch#1587 As titled Reviewed By: jianyuh Differential Revision: D43132884 fbshipit-source-id: ab3b00b4f0e2cd79134e3e89822a73456a2fd97b

Summary: Pull Request resolved: pytorch#1593 Move `binary_search_range` to header to allow kernels outside of `jagged_tensor_ops.cu` to use it. Reviewed By: jianyuh Differential Revision: D43216577 fbshipit-source-id: 96a681ea0e0f3db929dc6e6d25d971107015aba1

Summary: Fix torchrec test modules/tests/test_fused_embedding_modules.py::FusedEmbeddingBagCollectionTest::test_optimizer_fusion The credit for this PR belongs to amathews-amd. Pull Request resolved: pytorch#1591 Reviewed By: shintaro-iwasaki Differential Revision: D43448076 Pulled By: q10 fbshipit-source-id: 64cc97a90962a6ca153a858f96d1bd14eb500251

…forward_unweighted_small_kernel (pytorch#1610) Summary: Pull Request resolved: pytorch#1610 `[split|dense]_embedding_nobag_codegen_forward_unweighted_small_kernel` always accesses the LXU cache even when the cache is not available. This diff moves cache accesses under a runtime conditional to ensure that the access is valid. Reviewed By: jspark1105, mjanderson09 Differential Revision: D43581917 fbshipit-source-id: ab302092ffeb27b9a7fd18205c3856abc1dd8713

…#1592) Summary: Pull Request resolved: pytorch#1592 Before this diff, `group_index_select` requires all input tensors to have the same shape. This diff allows input tensors to have different shapes. However, it still requires the input tensors to have the same number of dimensions and for the first dimensions to match. Reviewed By: jianyuh Differential Revision: D43216425 fbshipit-source-id: fc8c74a472616c0e46785eb71482fa9012155f7e

Summary: Pull Request resolved: pytorch#1581 Load inplace update dependencies after BC issue is resolved. Reviewed By: brad-mengchi Differential Revision: D43026352 fbshipit-source-id: b365535ca5896e88eaac95bab72afe6a7fb34733

Summary: - Add support for retries in build steps that are known to fail due to the occasional network connection failures - Add support for installing ROCm tooling and testing ROCm builds in the build scripts framework - Update the existing FBGEMM_GPU CI / build_amd_gpu job to use the build scripts framework - Fix the annotations to tests in `jagged_tensor_ops_test.py` to run correctly on CPU-only mode - Impose timeouts of 10 minutes for running the test suites (in practice, they generally complete within 3 minutes) - Add ability to conditionally disable tests depending on whether or not they are running inside the GitHub runner - Disable the `test_jagged_index_select_2d` test on GitHub until we figure out the root cause of it hanging whenever it is run on GitHub (regardless of GPU or CPU variant) Pull Request resolved: pytorch#1606 Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D43601032 Pulled By: q10 fbshipit-source-id: 55203151f54bf010859b0e7e1568572e064fa7cf

…oming from different device (pytorch#1615) Summary: Pull Request resolved: pytorch#1615 Reviewed By: jianyuh, houseroad Differential Revision: D43564925 fbshipit-source-id: 9d8db49df76889e56f70ebb3fb4984c292186edc

Summary: Pull Request resolved: https://fburl.com/4d2jifwi Previously, we want to disable cache stats report when gflag ```FLAGS_tbe_uvm_cache_stat_report``` is non-positive but we can still see observe cache stats (cache miss when UVM cache is enabled) when ```FLAGS_tbe_uvm_cache_stat_report``` is negative. This diff fixes this error so that cache stats report can be disabled. Reviewed By: doehyun Differential Revision: D43630535 fbshipit-source-id: e267f881166721f804f04e4cdd3a35c6491452e7

…seDispatchMode (pytorch#1617) Summary: Pull Request resolved: pytorch#1617 When we use meta, it's not treated as CPU tensor. But the use case is that we want to process the weights on CPU first. So we set use_cpu as true when device is true in `IntNBitTableBatchedEmbeddingBagsCodegen` Reviewed By: jianyuh, nanoax Differential Revision: D43633336 fbshipit-source-id: bfb5f52c17f6ecf8bedb4f47a85aabcf35b87416

Summary: Pull Request resolved: pytorch#1619 Reviewed By: shintaro-iwasaki Differential Revision: D43678509 Pulled By: q10 fbshipit-source-id: 1f4d868038890d677fdde6fbc29594e0c2019ab8

…rch#1613) Summary: Pull Request resolved: pytorch#1613 The TBE inference kernel (`*_split_embedding_codegen_forward*_kernel_small_L`) may illegally access memory if the number of indices is zero. This update ensures that the TBE inference operator returns a zero-tensor when the number of indices is zero, preventing the aforementioned problem. Reviewed By: jianyuh Differential Revision: D43588092 fbshipit-source-id: 95eb6b96c7b9cac8518072c03c6d382edfc47436

…agsCodegen (pytorch#1622) Summary: Pull Request resolved: pytorch#1622 Need to init the weight during model loading with XL V2, so export it. Reviewed By: qxy11, jianyuh Differential Revision: D43709984 fbshipit-source-id: 1730ea6f108303fee0811eb9de3fa5bef2fa8267

…ytorch#1621) Summary: Pull Request resolved: pytorch#1621 - Re-enable building of `merge_pooled_embeddings` operators in OSS - Add `nm` checks for a few operators after OSS compilation to ensure that they are actually compiled into the shared library - Downgrade the GCC version used for OSS builds to 9.3, since later versions of GCC will build libraries that have a dependency on `GLIBCXX_3.4.29`, which is not available on systems with older versions of `libstdc++.so.6`, such as CentOS Stream 8 and Ubuntu 20.04 - Print SHA of the built wheels after build and before installation to verify that the packages are correctly uploaded to and downloaded from GHA Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D43764547 Pulled By: q10 fbshipit-source-id: 905d51feebe8d73c08689abc605f1160cc4b4600

…ops_utils.h (pytorch#1624) Summary: Pull Request resolved: pytorch#1624 Reviewed By: dracifer Differential Revision: D43781630 fbshipit-source-id: 1544ac95a50d8410fb614ddd3901758101a95202

Summary: Pull Request resolved: pytorch#1626 Reviewed By: sryap Differential Revision: D43787029 fbshipit-source-id: 87e07acf39010d489366d3e4ea10b9a33dec1fd5

Summary: Pull Request resolved: pytorch#1628 previously the scale is calcualted by max_int8/(eps + max_val), while eps = 1e-8; this is okay because the forward tensor needs to be larger than that to make a difference; with fp8, the gradients are much smaller, eps =1e-8 will cause the tensor not scaling up to cover the entire range of fp8, thus we change it to 1e-20 Reviewed By: brad-mengchi Differential Revision: D43665395 fbshipit-source-id: 0214e22af7739ef5cf7561b15e02830c35320402

Summary: - Add support for CUDA 11.8 in the OSS builds - Annotate the package wheels with Python 3.9 and 3.10 support tags - Update `setup.py` to auto-derive the package version from the git information (namely tags) to allow us for fast tag-and-release - Check for the actual presence of NVIDIA drivers in OSS builds, and error out with friendly message instead of cryptic `RuntimeError: No such operator fbgemm::jagged_2d_to_dense` errors when `fbgemm_gpu` is installed and loaded on a system with a GPU but without GPU drivers installed Pull Request resolved: pytorch#1627 Reviewed By: brad-mengchi, shintaro-iwasaki Differential Revision: D43868995 Pulled By: q10 fbshipit-source-id: 41da34bd5a82032d20daa4972cc9c848a6ed09e1

Summary: Pull Request resolved: pytorch#1623 previous cache stats report (e.g. cache miss rate, unique rate, etc) didn't take the number of requested indices of each TBE op into consideration, which could easily cause inaccurate cache miss rate than the real situation, especially with unbalance-requested TBEs. Now we update the cache stats report using the weighted caching ratios by accumulating all the TBE's cache data within the same round of lookup call. Reviewed By: doehyun Differential Revision: D43729139 fbshipit-source-id: 491b8613e70cf17c3d0f89dbd5c808ee9fe33587

Summary: - Remove local version identifier from build versioning - Use CPU-only instances for CPU-only tests Pull Request resolved: pytorch#1631 Reviewed By: shintaro-iwasaki Differential Revision: D43915104 Pulled By: q10 fbshipit-source-id: 223845c1c38d911f650d5db926aa7f2f59561b85

for ROCm

amathews-amd

LGTM

sryap and others added 28 commits February 21, 2023 09:48

put NRO side TBE on CUDA to generate a runnable model (pytorch#1605)

4ebdb1a

Summary: Pull Request resolved: pytorch#1605 Reviewed By: nanoax Differential Revision: D43487913 fbshipit-source-id: 357bcfe927c148071037f9df91e46a65e533f4f4

Add int support to dense_to_jagged_forward on GPU (pytorch#1587)

111f696

Summary: Pull Request resolved: pytorch#1587 As titled Reviewed By: jianyuh Differential Revision: D43132884 fbshipit-source-id: ab3b00b4f0e2cd79134e3e89822a73456a2fd97b

Add back torch load_library inplace update op (pytorch#1581)

85a6009

Summary: Pull Request resolved: pytorch#1581 Load inplace update dependencies after BC issue is resolved. Reviewed By: brad-mengchi Differential Revision: D43026352 fbshipit-source-id: b365535ca5896e88eaac95bab72afe6a7fb34733

Fix OSS PyPI Publishing (pytorch#1619)

9b4952f

Summary: Pull Request resolved: pytorch#1619 Reviewed By: shintaro-iwasaki Differential Revision: D43678509 Pulled By: q10 fbshipit-source-id: 1f4d868038890d677fdde6fbc29594e0c2019ab8

Move error-checking macros from sparsenn_operators_utils.h to sparse_…

7c7aee0

…ops_utils.h (pytorch#1624) Summary: Pull Request resolved: pytorch#1624 Reviewed By: dracifer Differential Revision: D43781630 fbshipit-source-id: 1544ac95a50d8410fb614ddd3901758101a95202

Error check some CUDA API calls (pytorch#1626)

ce83746

Summary: Pull Request resolved: pytorch#1626 Reviewed By: sryap Differential Revision: D43787029 fbshipit-source-id: 87e07acf39010d489366d3e4ea10b9a33dec1fd5

Merge remote-tracking branch 'upstream/main' into main

b26aa14

disable use_cpu in jagged_tensor_ops_test test_jagged_index_select_2d

d41f042

for ROCm

resolve confilicts

a25615e

liligwu requested a review from amathews-amd March 10, 2023 16:41

amathews-amd approved these changes Mar 10, 2023

View reviewed changes

amathews-amd merged commit 91f17f9 into main Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ifu 2023 03 10 #37

Ifu 2023 03 10 #37

Uh oh!

liligwu commented Mar 10, 2023

Uh oh!

amathews-amd left a comment

Uh oh!

Uh oh!

Ifu 2023 03 10 #37

Ifu 2023 03 10 #37

Uh oh!

Conversation

liligwu commented Mar 10, 2023

Uh oh!

amathews-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!