Skip to content

Ifu 2023 03 10 #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Mar 10, 2023
Merged

Ifu 2023 03 10 #37

merged 28 commits into from
Mar 10, 2023

Conversation

liligwu
Copy link
Collaborator

@liligwu liligwu commented Mar 10, 2023

test_log.txt
use_cpu in jagged_index_select_2d_ref is disabled pytorch#1636. May wait for pytorch#1635

sryap and others added 28 commits February 21, 2023 09:48
Summary:
Pull Request resolved: pytorch#1586

Changes are as follows:

- Update variable names of jagged_index_add_2d for GPU
- Use a single autograd function to dispatch the op on CPU and GPU
- Add an implementation on CPU using `at::parallel_for` to parallelize
  work on multiple threads.  For jagged_index_add_2d (backward), locks
  are used instead of atomic add (one lock per row) to manage add
  conflict between threads.

Reviewed By: brad-mengchi

Differential Revision: D43111264

fbshipit-source-id: e319a4cd3376dd3e4dd8eaf8d6ffef9f7a390961
…eighted in C++ op (pytorch#1599)

Summary:
Pull Request resolved: pytorch#1599

tbe_input_combine returns at::empty for per_sample_weights in the unweighted case, but if this output is passed directly to ops like `int_nbit_split_embedding_codegen_lookup_function_cpu`, the bool check fails because the optional has a tensor, it's just that the tensor is empty. This ultimately ends up with a SEGV as the weighted branch tries to read the empty tensor for per_sample_weights.

Reviewed By: RoshanPAN

Differential Revision: D43373836

fbshipit-source-id: 5251e4e20d9096454fade2d56c0e5d2dc8d51f11
Summary:
Pull Request resolved: pytorch#1603

There are some error messages on CUB namespace when building fbgemm with CUDA 12:

```
/usr/local/cuda-12.0/include/cub/agent/../util_math.cuh(103): error: namespace "fbgemm_gpu::cub" has no member "min"

/usr/local/cuda-12.0/include/cub/agent/../util_math.cuh(104): error: namespace "fbgemm_gpu::cub" has no member "max"

/usr/local/cuda-12.0/include/cub/agent/agent_radix_sort_onesweep.cuh(76): error: RegBoundScaling is not a template

/usr/local/cuda-12.0/include/cub/agent/agent_radix_sort_upsweep.cuh(60): error: RegBoundScaling is not a template

/usr/local/cuda-12.0/include/cub/agent/single_pass_scan_operators.cuh(325): error: namespace "fbgemm_gpu::cub" has no member "Debug"

/usr/local/cuda-12.0/include/cub/agent/single_pass_scan_operators.cuh(354): error: namespace "fbgemm_gpu::cub" has no member "Debug"
```

This Diff fixed the issue.

Reviewed By: q10

Differential Revision: D43423185

fbshipit-source-id: 9545835a9847aabeec4bc2a8dda8511771910fb6
Summary:
- Upgrade the FBGEMM_GPU release workflows to use the same build scripts framework for running the builds
- Update the build scripts framework to properly support running PyTests
- Disable certain tests from running in the **CPU-only build** as they are known to be failing at the moment (`jagged_tensor_ops_test`, `uvm_test`)
- Update both the nightly and release jobs to actually run tests prior to publishing to PyPI
- Fix bug with PyPI publishing for the FBGEMM_GPU-CPU nightly job
- Update the nightly build cron schedules to follow PST
- Remove some jobs from FBGEMM_GPU CI as they are now redundant with the FBGEMM_GPU nightly / release builds
- Update the FBGEMM_GPU-CPU CI job (Ubuntu) to use the build scripts framework for building and running tests

Pull Request resolved: pytorch#1598

Reviewed By: brad-mengchi

Differential Revision: D43448188

Pulled By: q10

fbshipit-source-id: 81ec5c0ee2d9c7d1246649a31f669b0ed585d23d
Summary: Pull Request resolved: pytorch#1605

Reviewed By: nanoax

Differential Revision: D43487913

fbshipit-source-id: 357bcfe927c148071037f9df91e46a65e533f4f4
Summary:
Pull Request resolved: pytorch#1587

As titled

Reviewed By: jianyuh

Differential Revision: D43132884

fbshipit-source-id: ab3b00b4f0e2cd79134e3e89822a73456a2fd97b
Summary:
Pull Request resolved: pytorch#1593

Move `binary_search_range` to header to allow kernels outside of
`jagged_tensor_ops.cu` to use it.

Reviewed By: jianyuh

Differential Revision: D43216577

fbshipit-source-id: 96a681ea0e0f3db929dc6e6d25d971107015aba1
Summary:
Fix torchrec test modules/tests/test_fused_embedding_modules.py::FusedEmbeddingBagCollectionTest::test_optimizer_fusion
The credit for this PR belongs to amathews-amd.

Pull Request resolved: pytorch#1591

Reviewed By: shintaro-iwasaki

Differential Revision: D43448076

Pulled By: q10

fbshipit-source-id: 64cc97a90962a6ca153a858f96d1bd14eb500251
…forward_unweighted_small_kernel (pytorch#1610)

Summary:
Pull Request resolved: pytorch#1610

`[split|dense]_embedding_nobag_codegen_forward_unweighted_small_kernel`
always accesses the LXU cache even when the cache is not available.
This diff moves cache accesses under a runtime conditional to ensure
that the access is valid.

Reviewed By: jspark1105, mjanderson09

Differential Revision: D43581917

fbshipit-source-id: ab302092ffeb27b9a7fd18205c3856abc1dd8713
…#1592)

Summary:
Pull Request resolved: pytorch#1592

Before this diff, `group_index_select` requires all input tensors to
have the same shape.  This diff allows input tensors to have different
shapes.  However, it still requires the input tensors to have the same
number of dimensions and for the first dimensions to match.

Reviewed By: jianyuh

Differential Revision: D43216425

fbshipit-source-id: fc8c74a472616c0e46785eb71482fa9012155f7e
Summary:
Pull Request resolved: pytorch#1581

Load inplace update dependencies after BC issue is resolved.

Reviewed By: brad-mengchi

Differential Revision: D43026352

fbshipit-source-id: b365535ca5896e88eaac95bab72afe6a7fb34733
Summary:
- Add support for retries in build steps that are known to fail due to the occasional network connection failures
- Add support for installing ROCm tooling and testing ROCm builds in the build scripts framework
- Update the existing FBGEMM_GPU CI / build_amd_gpu job to use the build scripts framework
- Fix the annotations to tests in `jagged_tensor_ops_test.py` to run correctly on CPU-only mode
- Impose timeouts of 10 minutes for running the test suites (in practice, they generally complete within 3 minutes)
- Add ability to conditionally disable tests depending on whether or not they are running inside the GitHub runner
- Disable the `test_jagged_index_select_2d` test on GitHub until we figure out the root cause of it hanging whenever it is run on GitHub (regardless of GPU or CPU variant)

Pull Request resolved: pytorch#1606

Reviewed By: brad-mengchi, shintaro-iwasaki

Differential Revision: D43601032

Pulled By: q10

fbshipit-source-id: 55203151f54bf010859b0e7e1568572e064fa7cf
…oming from different device (pytorch#1615)

Summary: Pull Request resolved: pytorch#1615

Reviewed By: jianyuh, houseroad

Differential Revision: D43564925

fbshipit-source-id: 9d8db49df76889e56f70ebb3fb4984c292186edc
Summary:
Pull Request resolved: https://fburl.com/4d2jifwi

Previously, we want to disable cache stats report when gflag ```FLAGS_tbe_uvm_cache_stat_report``` is non-positive but we can still see observe cache stats (cache miss when UVM cache is enabled) when ```FLAGS_tbe_uvm_cache_stat_report``` is negative. This diff fixes this error so that cache stats report can be disabled.

Reviewed By: doehyun

Differential Revision: D43630535

fbshipit-source-id: e267f881166721f804f04e4cdd3a35c6491452e7
…seDispatchMode (pytorch#1617)

Summary:
Pull Request resolved: pytorch#1617

When we use meta, it's not treated as CPU tensor. But the use case is that we want to process the weights on CPU first.

So we set use_cpu as true when device is true in `IntNBitTableBatchedEmbeddingBagsCodegen`

Reviewed By: jianyuh, nanoax

Differential Revision: D43633336

fbshipit-source-id: bfb5f52c17f6ecf8bedb4f47a85aabcf35b87416
Summary: Pull Request resolved: pytorch#1619

Reviewed By: shintaro-iwasaki

Differential Revision: D43678509

Pulled By: q10

fbshipit-source-id: 1f4d868038890d677fdde6fbc29594e0c2019ab8
…rch#1613)

Summary:
Pull Request resolved: pytorch#1613

The TBE inference kernel
(`*_split_embedding_codegen_forward*_kernel_small_L`) may illegally access
memory if the number of indices is zero.  This update ensures that the
TBE inference operator returns a zero-tensor when the number of
indices is zero, preventing the aforementioned problem.

Reviewed By: jianyuh

Differential Revision: D43588092

fbshipit-source-id: 95eb6b96c7b9cac8518072c03c6d382edfc47436
…agsCodegen (pytorch#1622)

Summary:
Pull Request resolved: pytorch#1622

Need to init the weight during model loading with XL V2, so export it.

Reviewed By: qxy11, jianyuh

Differential Revision: D43709984

fbshipit-source-id: 1730ea6f108303fee0811eb9de3fa5bef2fa8267
…ytorch#1621)

Summary:
Pull Request resolved: pytorch#1621

- Re-enable building of `merge_pooled_embeddings` operators in OSS
- Add `nm` checks for a few operators after OSS compilation to ensure that they are actually compiled into the shared library
- Downgrade the GCC version used for OSS builds to 9.3, since later versions of GCC will build libraries that have a dependency on `GLIBCXX_3.4.29`, which is not available on systems with older versions of `libstdc++.so.6`, such as CentOS Stream 8 and Ubuntu 20.04
- Print SHA of the built wheels after build and before installation to verify that the packages are correctly uploaded to and downloaded from GHA

Reviewed By: brad-mengchi, shintaro-iwasaki

Differential Revision: D43764547

Pulled By: q10

fbshipit-source-id: 905d51feebe8d73c08689abc605f1160cc4b4600
…ops_utils.h (pytorch#1624)

Summary: Pull Request resolved: pytorch#1624

Reviewed By: dracifer

Differential Revision: D43781630

fbshipit-source-id: 1544ac95a50d8410fb614ddd3901758101a95202
Summary: Pull Request resolved: pytorch#1626

Reviewed By: sryap

Differential Revision: D43787029

fbshipit-source-id: 87e07acf39010d489366d3e4ea10b9a33dec1fd5
Summary:
Pull Request resolved: pytorch#1628

previously the scale is calcualted by max_int8/(eps + max_val), while eps = 1e-8; this is okay because the forward tensor needs to be larger than that to make a difference;
with fp8, the gradients are much smaller, eps =1e-8 will cause the tensor not scaling up to cover the entire range of fp8, thus we change it to 1e-20

Reviewed By: brad-mengchi

Differential Revision: D43665395

fbshipit-source-id: 0214e22af7739ef5cf7561b15e02830c35320402
Summary:
- Add support for CUDA 11.8 in the OSS builds
- Annotate the package wheels with Python 3.9 and 3.10 support tags
- Update `setup.py` to auto-derive the package version from the git information (namely tags) to allow us for fast tag-and-release
- Check for the actual presence of NVIDIA drivers in OSS builds, and error out with friendly message instead of cryptic `RuntimeError: No such operator fbgemm::jagged_2d_to_dense` errors when `fbgemm_gpu` is installed and loaded on a system with a GPU but without GPU drivers installed

Pull Request resolved: pytorch#1627

Reviewed By: brad-mengchi, shintaro-iwasaki

Differential Revision: D43868995

Pulled By: q10

fbshipit-source-id: 41da34bd5a82032d20daa4972cc9c848a6ed09e1
Summary:
Pull Request resolved: pytorch#1623

previous cache stats report (e.g. cache miss rate, unique rate, etc) didn't take the number of requested indices of each TBE op into consideration, which could easily cause inaccurate cache miss rate than the real situation, especially with unbalance-requested TBEs. Now we update the cache stats report using the weighted caching ratios by accumulating all the TBE's cache data within the same round of lookup call.

Reviewed By: doehyun

Differential Revision: D43729139

fbshipit-source-id: 491b8613e70cf17c3d0f89dbd5c808ee9fe33587
Summary:
- Remove local version identifier from build versioning
- Use CPU-only instances for CPU-only tests

Pull Request resolved: pytorch#1631

Reviewed By: shintaro-iwasaki

Differential Revision: D43915104

Pulled By: q10

fbshipit-source-id: 223845c1c38d911f650d5db926aa7f2f59561b85
@liligwu liligwu requested a review from amathews-amd March 10, 2023 16:41
Copy link

@amathews-amd amathews-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amathews-amd amathews-amd merged commit 91f17f9 into main Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants