-
Notifications
You must be signed in to change notification settings - Fork 610
Optimize MX4 padding to minimize need for tuning #3040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. Differential Revision: D61816830
c8b08e2
to
3381c55
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D61816830 |
3381c55
to
d1002ce
Compare
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. Differential Revision: D61816830
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. Differential Revision: D61816830
d1002ce
to
1a2251d
Compare
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
1a2251d
to
a3926f7
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
a3926f7
to
acd2a7c
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
acd2a7c
to
6b4d3a7
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
6b4d3a7
to
5412811
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
5412811
to
ee72991
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
dde4c7f
to
bec909d
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
bec909d
to
de07d2f
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
de07d2f
to
8de5079
Compare
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
8de5079
to
60e8578
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
60e8578
to
a9e8892
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
a9e8892
to
cc93627
Compare
This pull request was exported from Phabricator. Differential Revision: D61816830 |
cc93627
to
9dfc24b
Compare
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
This pull request was exported from Phabricator. Differential Revision: D61816830 |
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
9dfc24b
to
d6bed7e
Compare
Summary: X-link: facebookresearch/FBGEMM#137 Pull Request resolved: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Differential Revision: D61816830
This pull request was exported from Phabricator. Differential Revision: D61816830 |
d6bed7e
to
f4f7779
Compare
This pull request has been merged in c818b87. |
Summary: Pull Request resolved: facebookresearch/FBGEMM#137 X-link: pytorch#3040 D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes. After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding. With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance. After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot. Reviewed By: jianyuh Differential Revision: D61816830 fbshipit-source-id: d35460ae39d374a699f57e43b2ee5b9675f00d69
Summary:
D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.
After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.
With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.
Differential Revision: D61816830