Skip to content

Commit 642da31

Browse files
liligwubanitag1Sabin DevkotaJunjie Yangq10
authored
Ifu 2023 03 31 (#38)
* using different mechanism for host mapped pinned memory (pytorch#1638) Summary: Pull Request resolved: pytorch#1638 This diff adds another mechanism for allocating the host mapped pinned memory to reduce adverse affect on other processes running on the same host when one process is doing some large allocations. Reviewed By: zyan0, jianyuh Differential Revision: D43950253 fbshipit-source-id: 41a434cb63354509d32e00c851c5f3a2d68be686 * disable use_cpu test (pytorch#1635) Summary: This PR addresses the issue pytorch#1636 akin to https://github.com/pytorch/FBGEMM/blob/8616ed701015f8b9e4c2825ce592b204b4cfaf28/fbgemm_gpu/test/split_table_batched_embeddings_test.py#L1009 Pull Request resolved: pytorch#1635 Reviewed By: shintaro-iwasaki Differential Revision: D44033725 Pulled By: q10 fbshipit-source-id: 49f28fc2f1c20948a42728eebf3defc5195baa5d * Update API interface and reroute backend for exact_rowwise_adagrad FE when using freq based methods (pytorch#1352) Summary: Pull Request resolved: pytorch#1352 1. Update interface to accomadate rowwise_adagrad_with_counter. 2. Route backend for rowwise_adagrad to the new rowwise_adagrad_with_counter when freq based methods (e.g. freq sgd, counter adjusted regularization) are used. Reviewed By: csmiler Differential Revision: D36788395 fbshipit-source-id: 8eb5da8a5c8b52bc1e237af1054aac9f7245c443 * Remove sync point in jagged_dense_elementwise_add_jagged_output backward (pytorch#1642) Summary: Pull Request resolved: pytorch#1642 Remove sync point in jagged_dense_elementwise_add_jagged_output backward Reviewed By: brad-mengchi Differential Revision: D44039901 fbshipit-source-id: 8e7e23e4d9e01359e67e5b166adc57f894a1224d * Add Comprehensive Build Instructions and Isolate CPU and ROCm Builds (pytorch#1639) Summary: - Remove `.post0` suffix from the autogenerated package version - Document the full FBGEMM_GPU OSS build process in a separate Markdown file - Remove installation of packages not needed for ROCm builds - Migrate CPU and ROCm jobs to run on top of Docker containers instead of bare metal instances - Update GitHub workflow configuration to cancel previous jobs for a PR if a new commit is pushed to the PR Pull Request resolved: pytorch#1639 Reviewed By: shintaro-iwasaki Differential Revision: D44076312 Pulled By: q10 fbshipit-source-id: 6b2d083022feb7421b26da2d998678e00c11f283 * include cstdint (pytorch#1640) Summary: fix build with gcc-13 Pull Request resolved: pytorch#1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033 * Add support for group size > 54 in group_index_select (pytorch#1611) Summary: Pull Request resolved: pytorch#1611 If group size is larger than 54, internally breaks the group down into smaller groups (each subgroup size is less than or equal to 54). Reviewed By: jianyuh Differential Revision: D43585937 fbshipit-source-id: bf14eeb79881a5737dcf7660e3e0f56d21f7b326 * Implement cache miss emulation in UVM_CACHING (pytorch#1637) Summary: Pull Request resolved: pytorch#1637 Enforce cache misses (even if trace-driven testing doesn't experience cache miss due to limited trace size) so that we can evaluate performance under cache misses. Note that it's not exactly cache misses; enforce access to UVM by overriding lxu_cache_locations -- N / 256 requests. Reviewed By: YuzeDaiMeta Differential Revision: D42194019 fbshipit-source-id: ab04c1cc7a749e84d605cfe4f1687489ceab5725 * Add TensorAccessor with memcheck (pytorch#1602) Summary: Pull Request resolved: pytorch#1602 Illegal memory access is a common problem during GPU kernel execution. The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and the CUDA runtime to detect such problems and throw an error. However, there are a few known issues with this approach. (1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host. However, due to the non-blocking, asynchronous nature of GPU kernel execution, the error is caught on the host at a later point than where the problematic kernel was launched. This can cause the stack trace to be inaccurate and make debugging more difficult. Although the issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`, this can change the state of the execution and cause Heisenbugs. (2) Not all illegal memory accesses are caught by the runtime. This means that the system may not always throw an error when illegal memory access occurs. (3) Although the runtime throws an error for illegal memory access, it is difficult to pinpoint the specific kernel and memory buffer/address that is causing the problem. For all the aforementioned reasons, we attempt to catch and throw an error as soon as possible in the kernel when illegal memory accesses occur in FBGEMM GPU. We introduce the `FBGEMM_GPU_MEMCHECK` flag to enable memory checking during compile time. We copy PyTorch's `TensorAccessor.h` into the FBGEMM GPU and extend it to check every memory access through the `PackedTensorAccessor`. If an invalid memory access occurs, we throw an error using `CUDA_KERNEL_ASSERT`. The error message includes the name of the tensor and the kernel that caused the problem. If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use `fbgemm::PackedTensorAccessor`. Otherwise, they will use `at::PackedTensorAccessor` `FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in subsequent diffs Reviewed By: r-barnes Differential Revision: D43421838 fbshipit-source-id: c8ef04970d94bb097cb5f09b42f994db72845167 * Fix compiling with Xcode 14.3 (pytorch#1648) Summary: Pull Request resolved: pytorch#1648 This hack is not needed in Xcode 14.3 anymore, where the clang version is 14.0.3. So change the workaround to only include up to 14.0.2. Reviewed By: MatzeB Differential Revision: D44130421 fbshipit-source-id: 1fb2948567941bdf6ee9487ccfaa9dfb2caf92dd * Add support for building FBGEMM_GPU against Python 3.11 in OSS (pytorch#1646) Summary: - Parallelize the FBGEMM CI builds to build and test static and shared libraries independently instead of in serial - Move the FBGEMM CI builds to run inside Docker containers - Add support for building FBGEMM_GPU against Python 3.11 in OSS - Move all FBGEMM_GPU nightly and release build jobs to run inside `amazonlinux:2023` Docker container - Assuming no build errors or resource starvation, the full OSS build process now runs under 30 minutes. Pull Request resolved: pytorch#1646 Reviewed By: shintaro-iwasaki Differential Revision: D44157228 Pulled By: q10 fbshipit-source-id: 6403ea9955856157785c50837b0b8e4c0cd26d53 * Remove magic numbers from fbgemm/Types.h (pytorch#1629) Summary: Pull Request resolved: pytorch#1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa * added check to avoid div 0 errors in cache report (pytorch#1645) Summary: Pull Request resolved: pytorch#1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672 * jagged_dense_bmm operator optimization (pytorch#1643) Summary: Pull Request resolved: pytorch#1643 This diff optimizes the jagged_dense_bmm operator with the following optimizations: * tiling across thread blocks, and use GPU shared memory for thread block * tiling across threads within a thread block, and use registers for each thread Reviewed By: brad-mengchi Differential Revision: D43674845 fbshipit-source-id: 85f0abf89fa958f79636ef59c3070a1c569b73c2 * jagged_dense_bmm: fix ROCm test failures (pytorch#1655) Summary: This patch fixes test failures on AMD GPUs. 1. Remove `__restrict__ `. I don't think it is needed even for CUDA, but it confuses HIPCC. 2. Use `uint32_t` instead of `auto`: old ROCm (including ROCm <= 5.3) does not have `+=` operator for the type of `blockIdx.z`, causing a compilation error. We observed that this issue is fixed in ROCm 5.4.3, but let's use `uint32_t` for now. We should revisit and use `auto` later. See this for details: ROCm/hipamd@86a1634 Pull Request resolved: pytorch#1655 Test Plan: GitHub Actions' AMD CI Reviewed By: q10, brad-mengchi Differential Revision: D44242622 Pulled By: shintaro-iwasaki fbshipit-source-id: c9b88155ebf1ed881b2d03e3be0e8991b4b30174 * Support embedding dim 1024 ~ 2048 (pytorch#1656) Summary: Pull Request resolved: pytorch#1656 wushirong reported the failure on https://fburl.com/code/hae91ra7 . - The embedding config is from f418615450 . - `max_int8_128b_rows` is 10 --> D = 1280 Our embedding dim has grown to 1024 + ? Note that the static shared memory can only go up to 48 KB: > Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) in https://docs.nvidia.com/cuda/cuda-c-programming-guide/ for ptx shared mem error: ``` [2023-03-21T22:04:33.899-07:00] ptxas error : Entry function '_ZN4nbit60INT8_split_embedding_codegen_forward_weighted_kernel_small_LIiN3c104HalfELm2ELm4ELm4E Lm8ELm16ELb1EEEvN2at27GenericPackedTensorAccessorIhLm1ENS3_17RestrictPtrTraitsElEES6_NS4_IiLm1ES5_iEENS4_IlLm1ES5_iEENS4_IhLm1ES5_iEES7_N10fbgemm_gpu12FixedDiv isorENS4_IT_Lm1ES5_iEESD_llNS4_IfLm1ES5_iEENS4_IT0_Lm2ES5_iEENS4_IhLm2ES5_lEES7_' uses too much shared data (0x10080 bytes, 0xc000 max) ``` Currently we reduce `InputRowsInFlight` to bypass the issue (the static shared memory used in the kernel is ``` typedef uint4 AllBuffers[WarpsPerBlock][OutputRowsPerThread][InputRowsInFlight][NumUint4LoadsPerRow]; __shared__ AllBuffers buffers; ``` Long term, we can change the static shared memory to dynamic shared memory, and increase the shared memory size to be 64 KB +. Reviewed By: wushirong Differential Revision: D44270081 fbshipit-source-id: 367ae838ea073dfe58d859ea3c0e6c7190beca6a * Containerize the remaining FBGEMM_GPU CI jobs (pytorch#1658) Summary: - Containerize the remaining FBGEMM_GPU CI jobs - Add Conda cleanups to make PyTorch and CUDA installs more reliable - Update post-install checks for PyTorch to work with ROCm - Update the CI to continue running on jobs that fail on just a few variants - Use PIP to install PyTorch GPU nightly as the nightly packages show up in PIP more reliably than in Conda Pull Request resolved: pytorch#1658 Reviewed By: shintaro-iwasaki Differential Revision: D44306708 Pulled By: q10 fbshipit-source-id: 5f0862f18eca7151759d9983aa97849222539d7d * Add tbe_input_combine_with_length for GPU (pytorch#1647) Summary: Pull Request resolved: pytorch#1647 Implement `tbe_input_combine_with_length` for GPU. The operator takes 3 lists of tensors (`indices`, `lengths`, and `per_sample_weights`) and concatenates each one into a single tensor. Implicit type casting is also performed if the input types are different from the output types. `indices` and `lengths` tensors can be of type `int32_t` or `int64_t`. The outputs for `indices` concatenation and `lengths` concatenation are fixed to `int32_t`. `per_sample_weights` must be `float`. Reviewed By: bangshengtang Differential Revision: D44076452 fbshipit-source-id: f6ce8628e7345093bb55835f9523870c2914516f * jagged_jagged_bmm operator optimization (pytorch#1644) Summary: Pull Request resolved: pytorch#1644 This diff optimizes the jagged_jagged_bmm operator using tiling across thread blocks and GPU shared memory. Reviewed By: brad-mengchi Differential Revision: D44029528 fbshipit-source-id: fa5cd5a26893f935427bce5efb7dfcc731c3f47d * Specify device to emulate_cache_miss kernel (pytorch#1660) Summary: Pull Request resolved: pytorch#1660 When enabled emulate cache miss, it caused illegal memory access, if we're using more than one GPU. It turns out that previous diff didn't specify device within emulate_cache_miss kernel. This diff fixes it. In addition, cleaned up a bit (e.g., no need to used index_t based kernel launch for emulate_cache_miss kernel, as lxu_cache_locations is always with int32_t. Reviewed By: sryap, YuzeDaiMeta Differential Revision: D44340131 fbshipit-source-id: d99ba2364e9030cbca6c1166e578d24d99646bb1 * Add C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (pytorch#1652) Summary: - Add C++17 support for the entire FBGEMM_GPU build - Add C++17 support for the entire FBGEMM build - Update FBGEMM tests and benchmarks to be C++17-compatible - Make FBGEMM builds output more logging - Cherry-pick code changes from D43776442 v4 now that C++17 is fully supported Pull Request resolved: pytorch#1652 Reviewed By: shintaro-iwasaki Differential Revision: D44287321 Pulled By: q10 fbshipit-source-id: 4bf2bcf66d528939865d42b6deafc470bee55d17 * Prune CPU/GPU TBE optimizer codegen (pytorch#1659) Summary: Pull Request resolved: pytorch#1659 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Note: x = supported Reviewed By: jianyuh Differential Revision: D44326540 fbshipit-source-id: 02413256b4a675f13ada8e8820820cb5112cb405 * Fix the Documentation Build Job (pytorch#1673) Summary: - Rewrite the documentation builds job to use the build infrastructure tooling - Rename workflow files for consistency Pull Request resolved: pytorch#1673 Reviewed By: shintaro-iwasaki Differential Revision: D44472660 Pulled By: q10 fbshipit-source-id: 60434c1f7098b7efa8c750133bb22f14fc98d5dc * Back out "Prune CPU/GPU TBE optimizer codegen" (pytorch#1675) Summary: Pull Request resolved: pytorch#1675 Original commit changeset: 02413256b4a6 Original Phabricator Diff: D44326540 Reviewed By: q10, jianyuh Differential Revision: D44475251 fbshipit-source-id: 5be66944a833e03a2737fc6d1baaa5c351455b2c * Prepare bounds_check_indices for VBE (pytorch#1633) Summary: Pull Request resolved: pytorch#1633 Prepare `bounds_check_indices` for variable batch size TBE (VBE). - Update the frontend API to accept VBE args - Update the backend logic to process VBE data Reviewed By: jianyuh Differential Revision: D43253703 fbshipit-source-id: 2870f0c41a96265650281a9b6362d4e6dc48009b * Move pruning/index_remapping support to embedding inplace update files (pytorch#1667) Summary: Pull Request resolved: pytorch#1667 As title. This diff moves pruning/index_remapping support to embedding inplace update files. Reviewed By: jianyuh Differential Revision: D44409419 fbshipit-source-id: 93fc91d83502eb95cb0feca2a8a03b003c336078 * jagged_softmax forward optimization (pytorch#1661) Summary: Pull Request resolved: pytorch#1661 This diff optimizes jagged_softmax forward with more efficient reduction from cub library. Reviewed By: brad-mengchi Differential Revision: D44161021 fbshipit-source-id: bf2e059d14ef4d7ad311edac65155a463ba653ff * jagged_softmax backward optimization (pytorch#1662) Summary: Pull Request resolved: pytorch#1662 This diff optimizes jagged_softmax backward with more efficient reduction from cub library Reviewed By: brad-mengchi Differential Revision: D44205819 fbshipit-source-id: cd1d7a886d6ba68201dc1ad782c2e8cde7ff706b * multi-gpu all_to_one improvements (pytorch#1674) Summary: Pull Request resolved: pytorch#1674 improved multi-gpu all_to_one with: 1. new intermediate hop selection taking advantage of distinct NVLinks 2. overlapping of intermediate hop transfers with each-other and with direct-peer transfers Reviewed By: doehyun Differential Revision: D44285941 fbshipit-source-id: 0202083f04388b5ba60b8155809433f334993ef4 * Extract and export weights offsets/placements initialization functions (pytorch#1669) Summary: Pull Request resolved: pytorch#1669 Extract portions initializing the weights_placements/offsets tensors into separate functions and jit.export them. SplitState is converted to a NamedTuple since we can't jit.script a dataclass that also holds an enum. Reviewed By: houseroad Differential Revision: D44338256 fbshipit-source-id: e1c12e5956f7217d51cd190958c3764d220e521d * Fix the ROCm Test Job (pytorch#1668) Summary: - Clean up the ROCm test job and re-enable ROCm testing on the rocm instances. - Update the build scripts framework to build FBGEMM_GPU against the correct hardware target that it is intended to be tested on. One thing that was discovered was that if FBGEMM_GPU was built with `PYTORCH_ROCM_ARCH=gfx90a` but run on `gfx908` target, the tests will fail with a segfault. While the failure is expected, the segfault can be unfriendly and confusing for users. - Enable correct compilation of `merge_pooled_embeddings` operator under ROCm - Fix existing code in `jagged_tensor_ops` from PR pytorch#1661 and pytorch#1662 that break its compilation under ROCm 5.3 Pull Request resolved: pytorch#1668 Reviewed By: shintaro-iwasaki Differential Revision: D44453594 Pulled By: q10 fbshipit-source-id: 2030cd0e00c6ff9694c2783dfd62c31cf5543da2 * Use exported functions instead of calling initialize_weights in weights loading (pytorch#1676) Summary: Pull Request resolved: pytorch#1676 Export a function to reset the embedding specs by target location Reviewed By: RoshanPAN, houseroad Differential Revision: D44338258 fbshipit-source-id: 502733e9f3a164450a02656d2822492fbf69f994 * Extract index remappings array initialization and jit.export it (pytorch#1670) Summary: Pull Request resolved: pytorch#1670 ATT Reviewed By: RoshanPAN, houseroad Differential Revision: D44338257 fbshipit-source-id: c091666c7a4d294c283f5e3774d0494089fc3478 * update hipify_torch and remove the manually mapping of the C10 macros --------- Co-authored-by: Banit Agrawal <[email protected]> Co-authored-by: Sabin Devkota <[email protected]> Co-authored-by: Junjie Yang <[email protected]> Co-authored-by: Benson Ma <[email protected]> Co-authored-by: Alfredo Tupone <[email protected]> Co-authored-by: Sarunya Pumma <[email protected]> Co-authored-by: Doe Hyun Yoon <[email protected]> Co-authored-by: Matt Galloway <[email protected]> Co-authored-by: Richard Barnes <[email protected]> Co-authored-by: Xiao Sun <[email protected]> Co-authored-by: Rengan Xu <[email protected]> Co-authored-by: siwasaki <[email protected]> Co-authored-by: Jianyu Huang <[email protected]> Co-authored-by: Yue Dong <[email protected]> Co-authored-by: Geet Sethi <[email protected]> Co-authored-by: Janet Yang <[email protected]>
1 parent 91f17f9 commit 642da31

File tree

64 files changed

+4690
-1325
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+4690
-1325
lines changed

.bazelrc

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
# This source code is licensed under the BSD-style license found in the
4+
# LICENSE file in the root directory of this source tree.
5+
6+
################################################################################
7+
# FBGEMM Bazel configuration file
8+
#
9+
# Based on MozoLM build options:
10+
# https://github.com/google-research/mozolm/blob/main/.bazelrc
11+
#
12+
# Documentation for Bazel configuration options can be found in:
13+
# https://bazel.build/reference/command-line-reference
14+
################################################################################
15+
16+
# Automatically picks up host-OS-specific config lines from bazelrc files
17+
# Enabling this is equivalent to auto-calling --config=linux on Linux, --config=windows, etc
18+
build --enable_platform_specific_config
19+
20+
# Print logs for all tests
21+
test --test_output=all
22+
23+
# Build with verbose logging
24+
build --verbose_explanations --verbose_failures
25+
test --verbose_explanations --verbose_failures
26+
27+
# Build with optimization mode turned on
28+
build --compilation_mode opt
29+
test --compilation_mode opt
30+
31+
# Build FBGEMM with C17 and C++17
32+
build:linux --cxxopt=-std=c++17
33+
build:linux --host_cxxopt=-std=c++17
34+
build:linux --conlyopt=-std=c17
35+
build:linux --host_conlyopt=-std=c17
36+
build:macos --cxxopt=-std=c++17
37+
build:macos --host_cxxopt=-std=c++17
38+
build:macos --conlyopt=-std=c17
39+
build:macos --host_conlyopt=-std=c17
40+
build:windows --cxxopt=/std:c++17
41+
build:windows --host_cxxopt=/std:c++17
42+
build:windows --conlyopt=/std:c17
43+
build:windows --host_conlyopt=/std:c17
44+
45+
# Generation of `runfiles` directories on Windows has to be explicitly enabled.
46+
# See https://github.com/bazelbuild/bazel/issues/8843.
47+
build:windows --enable_runfiles
48+
test:windows --enable_runfiles

0 commit comments

Comments
 (0)