Ifu 2023 04 20 #39

liligwu · 2023-04-20T21:09:34Z

Summary: Pull Request resolved: pytorch#1638 This diff adds another mechanism for allocating the host mapped pinned memory to reduce adverse affect on other processes running on the same host when one process is doing some large allocations. Reviewed By: zyan0, jianyuh Differential Revision: D43950253 fbshipit-source-id: 41a434cb63354509d32e00c851c5f3a2d68be686

Summary: This PR addresses the issue pytorch#1636 akin to https://github.com/pytorch/FBGEMM/blob/8616ed701015f8b9e4c2825ce592b204b4cfaf28/fbgemm_gpu/test/split_table_batched_embeddings_test.py#L1009 Pull Request resolved: pytorch#1635 Reviewed By: shintaro-iwasaki Differential Revision: D44033725 Pulled By: q10 fbshipit-source-id: 49f28fc2f1c20948a42728eebf3defc5195baa5d

… when using freq based methods (pytorch#1352) Summary: Pull Request resolved: pytorch#1352 1. Update interface to accomadate rowwise_adagrad_with_counter. 2. Route backend for rowwise_adagrad to the new rowwise_adagrad_with_counter when freq based methods (e.g. freq sgd, counter adjusted regularization) are used. Reviewed By: csmiler Differential Revision: D36788395 fbshipit-source-id: 8eb5da8a5c8b52bc1e237af1054aac9f7245c443

…ard (pytorch#1642) Summary: Pull Request resolved: pytorch#1642 Remove sync point in jagged_dense_elementwise_add_jagged_output backward Reviewed By: brad-mengchi Differential Revision: D44039901 fbshipit-source-id: 8e7e23e4d9e01359e67e5b166adc57f894a1224d

…ytorch#1639) Summary: - Remove `.post0` suffix from the autogenerated package version - Document the full FBGEMM_GPU OSS build process in a separate Markdown file - Remove installation of packages not needed for ROCm builds - Migrate CPU and ROCm jobs to run on top of Docker containers instead of bare metal instances - Update GitHub workflow configuration to cancel previous jobs for a PR if a new commit is pushed to the PR Pull Request resolved: pytorch#1639 Reviewed By: shintaro-iwasaki Differential Revision: D44076312 Pulled By: q10 fbshipit-source-id: 6b2d083022feb7421b26da2d998678e00c11f283

Summary: fix build with gcc-13 Pull Request resolved: pytorch#1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033

Summary: Pull Request resolved: pytorch#1611 If group size is larger than 54, internally breaks the group down into smaller groups (each subgroup size is less than or equal to 54). Reviewed By: jianyuh Differential Revision: D43585937 fbshipit-source-id: bf14eeb79881a5737dcf7660e3e0f56d21f7b326

Summary: Pull Request resolved: pytorch#1637 Enforce cache misses (even if trace-driven testing doesn't experience cache miss due to limited trace size) so that we can evaluate performance under cache misses. Note that it's not exactly cache misses; enforce access to UVM by overriding lxu_cache_locations -- N / 256 requests. Reviewed By: YuzeDaiMeta Differential Revision: D42194019 fbshipit-source-id: ab04c1cc7a749e84d605cfe4f1687489ceab5725

Summary: Pull Request resolved: pytorch#1602 Illegal memory access is a common problem during GPU kernel execution. The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and the CUDA runtime to detect such problems and throw an error. However, there are a few known issues with this approach. (1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host. However, due to the non-blocking, asynchronous nature of GPU kernel execution, the error is caught on the host at a later point than where the problematic kernel was launched. This can cause the stack trace to be inaccurate and make debugging more difficult. Although the issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`, this can change the state of the execution and cause Heisenbugs. (2) Not all illegal memory accesses are caught by the runtime. This means that the system may not always throw an error when illegal memory access occurs. (3) Although the runtime throws an error for illegal memory access, it is difficult to pinpoint the specific kernel and memory buffer/address that is causing the problem. For all the aforementioned reasons, we attempt to catch and throw an error as soon as possible in the kernel when illegal memory accesses occur in FBGEMM GPU. We introduce the `FBGEMM_GPU_MEMCHECK` flag to enable memory checking during compile time. We copy PyTorch's `TensorAccessor.h` into the FBGEMM GPU and extend it to check every memory access through the `PackedTensorAccessor`. If an invalid memory access occurs, we throw an error using `CUDA_KERNEL_ASSERT`. The error message includes the name of the tensor and the kernel that caused the problem. If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use `fbgemm::PackedTensorAccessor`. Otherwise, they will use `at::PackedTensorAccessor` `FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in subsequent diffs Reviewed By: r-barnes Differential Revision: D43421838 fbshipit-source-id: c8ef04970d94bb097cb5f09b42f994db72845167

Summary: Pull Request resolved: pytorch#1648 This hack is not needed in Xcode 14.3 anymore, where the clang version is 14.0.3. So change the workaround to only include up to 14.0.2. Reviewed By: MatzeB Differential Revision: D44130421 fbshipit-source-id: 1fb2948567941bdf6ee9487ccfaa9dfb2caf92dd

…ch#1646) Summary: - Parallelize the FBGEMM CI builds to build and test static and shared libraries independently instead of in serial - Move the FBGEMM CI builds to run inside Docker containers - Add support for building FBGEMM_GPU against Python 3.11 in OSS - Move all FBGEMM_GPU nightly and release build jobs to run inside `amazonlinux:2023` Docker container - Assuming no build errors or resource starvation, the full OSS build process now runs under 30 minutes. Pull Request resolved: pytorch#1646 Reviewed By: shintaro-iwasaki Differential Revision: D44157228 Pulled By: q10 fbshipit-source-id: 6403ea9955856157785c50837b0b8e4c0cd26d53

Summary: Pull Request resolved: pytorch#1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa

Summary: Pull Request resolved: pytorch#1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672

Summary: Pull Request resolved: pytorch#1643 This diff optimizes the jagged_dense_bmm operator with the following optimizations: * tiling across thread blocks, and use GPU shared memory for thread block * tiling across threads within a thread block, and use registers for each thread Reviewed By: brad-mengchi Differential Revision: D43674845 fbshipit-source-id: 85f0abf89fa958f79636ef59c3070a1c569b73c2

Summary: This patch fixes test failures on AMD GPUs. 1. Remove `__restrict__ `. I don't think it is needed even for CUDA, but it confuses HIPCC. 2. Use `uint32_t` instead of `auto`: old ROCm (including ROCm <= 5.3) does not have `+=` operator for the type of `blockIdx.z`, causing a compilation error. We observed that this issue is fixed in ROCm 5.4.3, but let's use `uint32_t` for now. We should revisit and use `auto` later. See this for details: ROCm/hipamd@86a1634 Pull Request resolved: pytorch#1655 Test Plan: GitHub Actions' AMD CI Reviewed By: q10, brad-mengchi Differential Revision: D44242622 Pulled By: shintaro-iwasaki fbshipit-source-id: c9b88155ebf1ed881b2d03e3be0e8991b4b30174

Summary: Pull Request resolved: pytorch#1656 wushirong reported the failure on https://fburl.com/code/hae91ra7 . - The embedding config is from f418615450 . - `max_int8_128b_rows` is 10 --> D = 1280 Our embedding dim has grown to 1024 + ? Note that the static shared memory can only go up to 48 KB: > Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) in https://docs.nvidia.com/cuda/cuda-c-programming-guide/ for ptx shared mem error: ``` [2023-03-21T22:04:33.899-07:00] ptxas error : Entry function '_ZN4nbit60INT8_split_embedding_codegen_forward_weighted_kernel_small_LIiN3c104HalfELm2ELm4ELm4E Lm8ELm16ELb1EEEvN2at27GenericPackedTensorAccessorIhLm1ENS3_17RestrictPtrTraitsElEES6_NS4_IiLm1ES5_iEENS4_IlLm1ES5_iEENS4_IhLm1ES5_iEES7_N10fbgemm_gpu12FixedDiv isorENS4_IT_Lm1ES5_iEESD_llNS4_IfLm1ES5_iEENS4_IT0_Lm2ES5_iEENS4_IhLm2ES5_lEES7_' uses too much shared data (0x10080 bytes, 0xc000 max) ``` Currently we reduce `InputRowsInFlight` to bypass the issue (the static shared memory used in the kernel is ``` typedef uint4 AllBuffers[WarpsPerBlock][OutputRowsPerThread][InputRowsInFlight][NumUint4LoadsPerRow]; __shared__ AllBuffers buffers; ``` Long term, we can change the static shared memory to dynamic shared memory, and increase the shared memory size to be 64 KB +. Reviewed By: wushirong Differential Revision: D44270081 fbshipit-source-id: 367ae838ea073dfe58d859ea3c0e6c7190beca6a

Summary: - Containerize the remaining FBGEMM_GPU CI jobs - Add Conda cleanups to make PyTorch and CUDA installs more reliable - Update post-install checks for PyTorch to work with ROCm - Update the CI to continue running on jobs that fail on just a few variants - Use PIP to install PyTorch GPU nightly as the nightly packages show up in PIP more reliably than in Conda Pull Request resolved: pytorch#1658 Reviewed By: shintaro-iwasaki Differential Revision: D44306708 Pulled By: q10 fbshipit-source-id: 5f0862f18eca7151759d9983aa97849222539d7d

Summary: Pull Request resolved: pytorch#1647 Implement `tbe_input_combine_with_length` for GPU. The operator takes 3 lists of tensors (`indices`, `lengths`, and `per_sample_weights`) and concatenates each one into a single tensor. Implicit type casting is also performed if the input types are different from the output types. `indices` and `lengths` tensors can be of type `int32_t` or `int64_t`. The outputs for `indices` concatenation and `lengths` concatenation are fixed to `int32_t`. `per_sample_weights` must be `float`. Reviewed By: bangshengtang Differential Revision: D44076452 fbshipit-source-id: f6ce8628e7345093bb55835f9523870c2914516f

Summary: Pull Request resolved: pytorch#1644 This diff optimizes the jagged_jagged_bmm operator using tiling across thread blocks and GPU shared memory. Reviewed By: brad-mengchi Differential Revision: D44029528 fbshipit-source-id: fa5cd5a26893f935427bce5efb7dfcc731c3f47d

Summary: Pull Request resolved: pytorch#1660 When enabled emulate cache miss, it caused illegal memory access, if we're using more than one GPU. It turns out that previous diff didn't specify device within emulate_cache_miss kernel. This diff fixes it. In addition, cleaned up a bit (e.g., no need to used index_t based kernel launch for emulate_cache_miss kernel, as lxu_cache_locations is always with int32_t. Reviewed By: sryap, YuzeDaiMeta Differential Revision: D44340131 fbshipit-source-id: d99ba2364e9030cbca6c1166e578d24d99646bb1

Summary: - Add C++17 support for the entire FBGEMM_GPU build - Add C++17 support for the entire FBGEMM build - Update FBGEMM tests and benchmarks to be C++17-compatible - Make FBGEMM builds output more logging - Cherry-pick code changes from D43776442 v4 now that C++17 is fully supported Pull Request resolved: pytorch#1652 Reviewed By: shintaro-iwasaki Differential Revision: D44287321 Pulled By: q10 fbshipit-source-id: 4bf2bcf66d528939865d42b6deafc470bee55d17

Summary: Pull Request resolved: pytorch#1659 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Note: x = supported Reviewed By: jianyuh Differential Revision: D44326540 fbshipit-source-id: 02413256b4a675f13ada8e8820820cb5112cb405

Summary: - Rewrite the documentation builds job to use the build infrastructure tooling - Rename workflow files for consistency Pull Request resolved: pytorch#1673 Reviewed By: shintaro-iwasaki Differential Revision: D44472660 Pulled By: q10 fbshipit-source-id: 60434c1f7098b7efa8c750133bb22f14fc98d5dc

Summary: Pull Request resolved: pytorch#1675 Original commit changeset: 02413256b4a6 Original Phabricator Diff: D44326540 Reviewed By: q10, jianyuh Differential Revision: D44475251 fbshipit-source-id: 5be66944a833e03a2737fc6d1baaa5c351455b2c

Summary: Pull Request resolved: pytorch#1633 Prepare `bounds_check_indices` for variable batch size TBE (VBE). - Update the frontend API to accept VBE args - Update the backend logic to process VBE data Reviewed By: jianyuh Differential Revision: D43253703 fbshipit-source-id: 2870f0c41a96265650281a9b6362d4e6dc48009b

pytorch#1667) Summary: Pull Request resolved: pytorch#1667 As title. This diff moves pruning/index_remapping support to embedding inplace update files. Reviewed By: jianyuh Differential Revision: D44409419 fbshipit-source-id: 93fc91d83502eb95cb0feca2a8a03b003c336078

Summary: Pull Request resolved: pytorch#1661 This diff optimizes jagged_softmax forward with more efficient reduction from cub library. Reviewed By: brad-mengchi Differential Revision: D44161021 fbshipit-source-id: bf2e059d14ef4d7ad311edac65155a463ba653ff

Summary: Pull Request resolved: pytorch#1662 This diff optimizes jagged_softmax backward with more efficient reduction from cub library Reviewed By: brad-mengchi Differential Revision: D44205819 fbshipit-source-id: cd1d7a886d6ba68201dc1ad782c2e8cde7ff706b

Summary: Pull Request resolved: pytorch#1674 improved multi-gpu all_to_one with: 1. new intermediate hop selection taking advantage of distinct NVLinks 2. overlapping of intermediate hop transfers with each-other and with direct-peer transfers Reviewed By: doehyun Differential Revision: D44285941 fbshipit-source-id: 0202083f04388b5ba60b8155809433f334993ef4

pytorch#1669) Summary: Pull Request resolved: pytorch#1669 Extract portions initializing the weights_placements/offsets tensors into separate functions and jit.export them. SplitState is converted to a NamedTuple since we can't jit.script a dataclass that also holds an enum. Reviewed By: houseroad Differential Revision: D44338256 fbshipit-source-id: e1c12e5956f7217d51cd190958c3764d220e521d

Summary: - Clean up the ROCm test job and re-enable ROCm testing on the rocm instances. - Update the build scripts framework to build FBGEMM_GPU against the correct hardware target that it is intended to be tested on. One thing that was discovered was that if FBGEMM_GPU was built with `PYTORCH_ROCM_ARCH=gfx90a` but run on `gfx908` target, the tests will fail with a segfault. While the failure is expected, the segfault can be unfriendly and confusing for users. - Enable correct compilation of `merge_pooled_embeddings` operator under ROCm - Fix existing code in `jagged_tensor_ops` from PR pytorch#1661 and pytorch#1662 that break its compilation under ROCm 5.3 Pull Request resolved: pytorch#1668 Reviewed By: shintaro-iwasaki Differential Revision: D44453594 Pulled By: q10 fbshipit-source-id: 2030cd0e00c6ff9694c2783dfd62c31cf5543da2

…ts loading (pytorch#1676) Summary: Pull Request resolved: pytorch#1676 Export a function to reset the embedding specs by target location Reviewed By: RoshanPAN, houseroad Differential Revision: D44338258 fbshipit-source-id: 502733e9f3a164450a02656d2822492fbf69f994

…rch#1670) Summary: Pull Request resolved: pytorch#1670 ATT Reviewed By: RoshanPAN, houseroad Differential Revision: D44338257 fbshipit-source-id: c091666c7a4d294c283f5e3774d0494089fc3478

Summary: Pull Request resolved: pytorch#1683 Disable FBGEMM test on COUNTER mode temporarily. Reviewed By: sryap Differential Revision: D44589052 fbshipit-source-id: f2af6f9e3cce75d4c599c4708055e5f52ac705e2

…1682) Summary: Pull Request resolved: pytorch#1682 Reviewed By: shintaro-iwasaki Differential Revision: D44599348 Pulled By: q10 fbshipit-source-id: 8f968a7c21b09358eac070a35ee15d5b767ea94c

Summary: - Use the pytorch/test-infra action ot install NVIDIA drivers properly if the instance is missing the drivers Pull Request resolved: pytorch#1684 Reviewed By: shintaro-iwasaki Differential Revision: D44603925 Pulled By: q10 fbshipit-source-id: 712bdf5c2af67c5a6f540567abcc47ed892912c1

Summary: Sumary: - Clean up the linting job to use the build scripts infrastructure - Delete the Conda prefix directory before creating a new environment, if it exists Pull Request resolved: pytorch#1686 Reviewed By: shintaro-iwasaki Differential Revision: D44646234 Pulled By: q10 fbshipit-source-id: d754efeadffb265c9e55bc302606fc1e60ef8b51

Summary: Pull Request resolved: pytorch#1571 reduce_to_one for row-wise sharding in inference Similar approach to all_to_one but without having the source waiting for target to be ready for potential WAR and WAW dependency violation because in this reduce_to_one implementation we create a new destination tensor. Reviewed By: xing-liu, jianyuh Differential Revision: D34263436 fbshipit-source-id: 7b1630b395311cfd6fef124113436f87f51a6fba

Summary: Pull Request resolved: pytorch#1685 Reviewed By: r-barnes, shintaro-iwasaki Differential Revision: D44654808 Pulled By: q10 fbshipit-source-id: a58987b4a3970139bba72db8cecc89c0256fba76

Summary: Pull Request resolved: pytorch#1678 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. [1/2] Update `lookup_invoker` to enable the function invoker based on `has_cpu_support` and `has_gpu_support` [2/2] Update the code generation part The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Reviewed By: q10 Differential Revision: D44484764 fbshipit-source-id: f04710e66498bdcbdad619d48411c2403316901c

Summary: Pull Request resolved: pytorch#1691 This diff adds thread tiling optimization in jagged_jagged_bmm operator, where each thread will process a tile of elements instead of one. The implementation is similar to the one applied to jagged_dense_bmm: D43674845. Reviewed By: brad-mengchi Differential Revision: D44764339 fbshipit-source-id: ca4cf257bac755ab97754fdc6605072cfbfb1c4d

Summary: Pull Request resolved: pytorch#1692 Tune the tile sizes based on the input tensor size. If M > N, then use larger tile size in M dimension, otherwise use larger tile size in N dimension. Reviewed By: brad-mengchi Differential Revision: D44791699 fbshipit-source-id: 348a66089d781e9fef141b63d7a56e6dfa5da905

Summary: Pull Request resolved: pytorch#1632 ATT. Reviewed By: jianyuh Differential Revision: D43887969 fbshipit-source-id: 048ff61a925113b29c547abf20d7acdc4a50b8d7

Summary: - Update build scripts to print out cc, c++, and nvcc preprocessor defines - Print out all undefined symbols in the output library after build to inspect whether or not templates have been un-instantiated - Handle the case where `TORCH_CUDA_ARCH_LIST` is pre-defined in the environment - Clean up the FBGEMM_GPU READMEs to consolidate all FBGEMM_GPU build instructions into `docs/BuildInstructions.md` - Fix the build badges for FBGEMM and FBGEMM_GPU - Add Slack contact information to the READMEs - Remove deprecated GitHub workflows and build scripts in favor of the new scripts, which cover all the functionality of the old scripts Pull Request resolved: pytorch#1695 Reviewed By: shintaro-iwasaki Differential Revision: D44901368 Pulled By: q10 fbshipit-source-id: bef6045347c905a051970e4e5f8630175e0f5ef6

Summary: Pull Request resolved: pytorch#1697 Reviewed By: shintaro-iwasaki Differential Revision: D44935915 Pulled By: q10 fbshipit-source-id: e1bdd4ebff18bd9708208a5b659ef9a93ebc866a

Summary: This change fixes a missing step (cd) in the build instructions. Pull Request resolved: pytorch#1701 Reviewed By: sryap Differential Revision: D45011147 Pulled By: q10 fbshipit-source-id: 704ce5bd3cfbd62c31f434c830a7300e5d645024

Summary: This project is compiled with -Wall and -Werror (see pytorch#868) and is throwing an error for the unused variable here. This code appears to be debugging code that was used to verify that the function it's contained in was originally implemented properly so the most straightforward solution is to just remove it. Pull Request resolved: pytorch#1702 Reviewed By: sryap Differential Revision: D45011174 Pulled By: q10 fbshipit-source-id: 2c252cfa6063789371f5fba5f642c2f4fb72455f

Summary: This test mistakenly calls reserve() to set a vector's length instead of resize(). reserve() allocates memory for the specified number of elements, but does not actually increase the number of elements that can legally be stored in the vector. This test runs with ASAN enabled which is catching this illegal access and causing the test to fail. This change fixes the code to instead call resize(); the test now passes. Pull Request resolved: pytorch#1703 Reviewed By: sryap Differential Revision: D45011317 Pulled By: q10 fbshipit-source-id: 2840d7bfcfb46ca1523f55e77a3834a1d561c045

Summary: Pull Request resolved: pytorch#1700 This diff support `get_optimizer_state` for exact_adagrad. Exact_adagrad is not supported in `get_optimizer_state`. However, this is needed for creating fused optimizer in torchrec. Reviewed By: r-barnes Differential Revision: D44963975 fbshipit-source-id: e2f523dfc1e1d17a4925e7ce4a9e65829f1cf1b0

…` into Smaller Files (pytorch#1694) Summary: `embedding_forward_quantized_split_template.cu` is a very large jinja template that renders 30+ C++ templates, which are then instantiated to over 600+ kernel functions. There are three sets of jinja templates in `embedding_forward_quantized_split_template.cu`: those related to `int_nbit_split_embedding_*`, `pruned_hashmap_lookup_*` and `pruned_array_lookup_*`.. Currently, the rendering produces a single file, which takes a large amount of time to compile. This PR does two things at a high level. First, it breaks up the jinja template into multiple jinja templates. Then, it forces each of these smaller jinja templates to render multiple source files instead of a single source file. This change will enable build parallelization and overall build time savings. Details: - Port improvements to `embedding_forward_quantized_split_template.cu` from D44707812 - Move the non-jinja-template code inside `embedding_forward_quantized_split_template.cu` over to `embedding_forward_template_helpers.cuh` - Move `pruned_hashmap_lookup_*` and `pruned_array_lookup_*` sets of jinja templates out to non-jinja-template `embedding_forward_quantized_split_lookup.cu`, since the template generated functions are redundant. - Break the `int_nbit_split_embedding_*` set of jinja templates into two files, one for rendering kernel-side code (`embedding_forward_quantized_split_nbit_kernel_template.cu`) and the other for rendering host-side code (`embedding_forward_quantized_split_nbit_host_template.cu`) - For the `int_nbit_split_embedding_*` host-side jinja template, make it render `weighted`, `unweighted`, and `unweighted nobag` variants into separate source files - For the `int_nbit_split_embedding_*` kernel-side jinja template, make it render into N = [`weighted`, `unweighted`, and `unweighted nobag` variants ] x [ 6 embedding types ] separate source files, each containing a single C++ template kernel function. Also generate the code to explicitly instantiate the kernel templates. For each of the C++ templates being generated, there will be 2 {device-only bool} x [3-4] (output types) x [3-5] (cases) = 18-40 actual template instantiations - To help with debugging missing template instantiations, print out all undefined symbols in the output library after build to inspect whether or not templates have been un-instantiated - Update build scripts to print out `cc`, `c++`, and `nvcc` preprocessor defines - Handle the case where `TORCH_CUDA_ARCH_LIST` is pre-defined in the environment Pull Request resolved: pytorch#1694 Reviewed By: sryap, r-barnes Differential Revision: D44842524 Pulled By: q10 fbshipit-source-id: 96f92e40ab2fec598aeb8c483e94997ac050aae7

Summary: Pull Request resolved: pytorch#1706 Original commit changeset: f04710e66498 Original Phabricator Diff: D44484764 Reviewed By: q10, brad-mengchi, jianyuh, shintaro-iwasaki Differential Revision: D45054051 fbshipit-source-id: 9d14504c76eb93b2f1b14f4c2ec4c5b807c7fc4a

Summary: Pull Request resolved: pytorch#1707 Temporarily use the CUB kernel instead of the custom kernel for 2D `asynchronous_complete_cumsum` Reviewed By: q10, brad-mengchi, jianyuh Differential Revision: D45062784 fbshipit-source-id: cebe3992ff8ebec9c0f554e729b8d79a1eced1de

…into Smaller Files (pytorch#1705) Summary: `embedding_backward_split_template.cu` contains both jinja-template and non-jinja-template code, and some of the templating is unneccessary. Furthermore, the template generates both the vanilla and `nobag` variants of unweighted into the same source file. This PR moves the non-jinja-template code out of the template, de-duplicates code are unneccessarily templated, and splits the generation of the code to three files per optimizer, one for `weighted`, `unweighted nobag`, and `unweighted`. Details: - Migrate non-jinja-templated code out of `embedding_backward_split_template.cu` and into `embedding_backward_template_helpers.cuh` - De-templatize `split_embedding_backward_codegen_{{ optimizer }}_{{ wdesc }}_find_long_segments` into `split_embedding_backward_codegen_find_long_segments` since there is no implementation difference between the optimizers and weighted vs unweighted - Migrate `grad_mean_kernel` and `split_embedding_backward_codegen_find_long_segments` into a separate non-template source file to de-duplicate code generation and compilation - Split the code generation of `embedding_backward_split_template.cu` into 3 files per optimizer, according to weighted, unweighted_nobag, and unweighted Pull Request resolved: pytorch#1705 Reviewed By: sryap Differential Revision: D45073273 Pulled By: q10 fbshipit-source-id: e82ea643f8e67ad5aa0b3de03562532c5735453d

Summary: Pull Request resolved: pytorch#1690 The context why this is needed is as follows 1) For really long sparse features we want to split them into multiple chunks that can be fed into the model 2) Slicing requires users to require per row start point & a maximum L. Based on these requirements, a custom op mimicing the slice semantics of a normal tensor works best. An example usage using pseudo code ``` input_jagged_tensor = [[1, 2, 3, 4], [1, 2, 3], [1, 2, 3, 4, 5, 6], [1], [1, 2]] start = [0, 0, 0, 0, 0] slice_length = 3 >> jagged_slice(input_jagged_tensor, start, slice_length) output_jagged_tensor = [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1], [1, 2]] ``` A corresponding operation for dense tensor would look like ``` dense_tensor = torch.randn((8, 10)) slice_dense_tensor = dense_tensor[:, 1:3] ``` Reviewed By: sryap Differential Revision: D44299744 fbshipit-source-id: 44996f2f2ec5fc5f31dda4cb3bd8f0241497df66

… negative integers (pytorch#1672) Summary: Move the `radix_sort` implementation to common utilities, so it can be used in PyTorch in case it was not built with FBGEMM GPU. Add the possibility to handle negative integers, which is crucial for reusing `radix_sort` in PyTorch's `sort` operation. Details: This PR addresses two issues: 1. `radix_sort` is currently used in [scatter_reduce](https://github.com/dszwicht/pytorch/blob/master/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp#L630) (please view this [comment](https://github.com/pytorch/pytorch/pull/82703/files#r1045360609) for more information). Till now `radix_sort` was under `fbgemm_gpu` subproject. It means that implementation was not available in PyTorch in case it was built for CPU - that's why `radix_sort` was copy pasted under aten directory in PyTorch. This PR moves `radix_sort` implementation to common utilities. 2. In GNN workloads we often sort 1D integer data with non-negative values, for example, when converting CSR to CSC format. Unfortunately, `torch.sort` for 1D data works sequentially. `radix_sort` seems to be a perfect match to accelerate described case. However, suppose we want to do that on the PyTorch site. In that case, we have to either fallback to a regular path after detecting negative numbers in the tensor or perform post-processing, by swapping positive and negative blocks of data (data like `[2, -1, -2, 1]` after sorting will be in the following form `[1, 2, -2, -1]`, due to the fact of how numbers are stored). Both these solutions are not elegant. As an alternative, I propose the extension of `radix_sort` algorithm, by giving it capability to work with negative numbers. This can be enabled by passing an optional parameter, `maybe_with_neg_vals`. If set to `true`, we will perform all passes (up to the most significant sign bit) and apply a special prefix sum combination in the last pass. An example of how we can reuse fbgemm in PyTorch can be found in my private fork, [here](DamianSzwichtenberg/pytorch#2) (I also provide speedup data). The above changes have several consequences: 1. `TORCH_CHECK` was replaced with `assert` as fbgemm CPU does not have PyTorch in its dependencies. 2. `__builtin_clz` was replaced with manual implementation as `__builtin_clz` is not portable. Additional information for reviewers: I did perform benchmarks of `radix_sort` before and after my code modification. I didn't observe any performance drop. Pull Request resolved: pytorch#1672 Reviewed By: sryap Differential Revision: D44616959 Pulled By: q10 fbshipit-source-id: f34594478c94ec6610c05545feb2044b58d79d66

Reviewed By: bigfootjon Differential Revision: D45141964 fbshipit-source-id: 58308a31522a3b1446835e358a93483b611c4b15

banitag1 and others added 30 commits March 10, 2023 13:04

include cstdint (pytorch#1640)

da01a59

Summary: fix build with gcc-13 Pull Request resolved: pytorch#1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033

Remove magic numbers from fbgemm/Types.h (pytorch#1629)

54eeae2

Summary: Pull Request resolved: pytorch#1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa

added check to avoid div 0 errors in cache report (pytorch#1645)

35bdd40

Summary: Pull Request resolved: pytorch#1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672

q10 and others added 27 commits March 30, 2023 22:55

Extract index remappings array initialization and jit.export it (pyto…

99edf26

…rch#1670) Summary: Pull Request resolved: pytorch#1670 ATT Reviewed By: RoshanPAN, houseroad Differential Revision: D44338257 fbshipit-source-id: c091666c7a4d294c283f5e3774d0494089fc3478

Disable COUNTER in FBGEMM test (pytorch#1683)

bd65045

Summary: Pull Request resolved: pytorch#1683 Disable FBGEMM test on COUNTER mode temporarily. Reviewed By: sryap Differential Revision: D44589052 fbshipit-source-id: f2af6f9e3cce75d4c599c4708055e5f52ac705e2

update hipify_torch and remove manual mapping of C10 macros (pytorch#…

cef67a5

…1682) Summary: Pull Request resolved: pytorch#1682 Reviewed By: shintaro-iwasaki Differential Revision: D44599348 Pulled By: q10 fbshipit-source-id: 8f968a7c21b09358eac070a35ee15d5b767ea94c

Reorganize the build scripts (pytorch#1685)

3eaae90

Summary: Pull Request resolved: pytorch#1685 Reviewed By: r-barnes, shintaro-iwasaki Differential Revision: D44654808 Pulled By: q10 fbshipit-source-id: a58987b4a3970139bba72db8cecc89c0256fba76

Populate supported optims to match OSS Pytorch state dict (pytorch#1632)

3b8e7a5

Summary: Pull Request resolved: pytorch#1632 ATT. Reviewed By: jianyuh Differential Revision: D43887969 fbshipit-source-id: 048ff61a925113b29c547abf20d7acdc4a50b8d7

Add Documentation to Work Around GCC 12 Regressions (pytorch#1697)

f458305

Summary: Pull Request resolved: pytorch#1697 Reviewed By: shintaro-iwasaki Differential Revision: D44935915 Pulled By: q10 fbshipit-source-id: e1bdd4ebff18bd9708208a5b659ef9a93ebc866a

Fix build instructions (pytorch#1701)

b121008

Summary: This change fixes a missing step (cd) in the build instructions. Pull Request resolved: pytorch#1701 Reviewed By: sryap Differential Revision: D45011147 Pulled By: q10 fbshipit-source-id: 704ce5bd3cfbd62c31f434c830a7300e5d645024

Daily arc lint --take CLANGFORMAT

97c9e16

Reviewed By: bigfootjon Differential Revision: D45141964 fbshipit-source-id: 58308a31522a3b1446835e358a93483b611c4b15

Merge branch 'main' into IFU-2023-04-20

370234c

liligwu merged commit bbc33e5 into main Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ifu 2023 04 20 #39

Ifu 2023 04 20 #39

Uh oh!

liligwu commented Apr 20, 2023 •

edited

Loading

Uh oh!

Uh oh!

Ifu 2023 04 20 #39

Ifu 2023 04 20 #39

Uh oh!

Conversation

liligwu commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

liligwu commented Apr 20, 2023 •

edited

Loading