Optimize MX4 padding to minimize need for tuning #3040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

jwfromm wants to merge 1 commit into pytorch:main from jwfromm:export-D61816830

Contributor

jwfromm commented Aug 26, 2024

Summary:
D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

Differential Revision: D61816830

facebook-github-bot added the cla signed label

netlify bot commented Aug 26, 2024 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`f4f7779`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66cfbe77c77b7d000886ff37
😎 Deploy Preview	https://deploy-preview-3040--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

facebook-github-bot added the fb-exported label

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

3381c55

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from c8b08e2 to 3381c55 Compare

August 26, 2024 21:07

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

1 similar comment

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from 3381c55 to d1002ce Compare

August 26, 2024 21:13

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

d1002ce

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

Differential Revision: D61816830

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

1a2251d

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from d1002ce to 1a2251d Compare

August 26, 2024 21:18

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

a3926f7

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from 1a2251d to a3926f7 Compare

August 26, 2024 21:27

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

1 similar comment

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

acd2a7c

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from a3926f7 to acd2a7c Compare

August 26, 2024 21:31

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

6b4d3a7

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from acd2a7c to 6b4d3a7 Compare

August 26, 2024 21:35

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from 6b4d3a7 to 5412811 Compare

August 26, 2024 21:40

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

ee72991

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from 5412811 to ee72991 Compare

August 26, 2024 21:44

Contributor

facebook-github-bot commented Aug 26, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

f5f7d55

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from dde4c7f to bec909d Compare

August 27, 2024 20:45

Contributor

facebook-github-bot commented Aug 27, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

de07d2f

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from bec909d to de07d2f Compare

August 27, 2024 20:49

Contributor

facebook-github-bot commented Aug 28, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from de07d2f to 8de5079 Compare

August 28, 2024 17:36

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

8de5079

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

Contributor

facebook-github-bot commented Aug 28, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

60e8578

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from 8de5079 to 60e8578 Compare

August 28, 2024 18:15

Contributor

facebook-github-bot commented Aug 28, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

a9e8892

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from 60e8578 to a9e8892 Compare

August 28, 2024 18:21

Contributor

facebook-github-bot commented Aug 28, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

cc93627

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from a9e8892 to cc93627 Compare

August 28, 2024 18:26

Contributor

facebook-github-bot commented Aug 29, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from cc93627 to 9dfc24b Compare

August 29, 2024 00:09

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

9dfc24b

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

Contributor

facebook-github-bot commented Aug 29, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

d6bed7e

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from 9dfc24b to d6bed7e Compare

August 29, 2024 00:13


          Optimize MX4 padding to minimize need for tuning (pytorch#3040)

f4f7779

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

Contributor

facebook-github-bot commented Aug 29, 2024

This pull request was exported from Phabricator. Differential Revision: D61816830

jwfromm force-pushed the export-D61816830 branch from d6bed7e to f4f7779 Compare

August 29, 2024 00:19

facebook-github-bot closed this in

c818b87

Contributor

facebook-github-bot commented Aug 31, 2024

This pull request has been merged in c818b87.

facebook-github-bot added the Merged label

q10 pushed a commit to q10/FBGEMM that referenced this pull request


          Optimize MX4 padding to minimize need for tuning (pytorch#137)

51ee279

Summary:
Pull Request resolved: facebookresearch/FBGEMM#137

X-link: pytorch#3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Reviewed By: jianyuh

Differential Revision: D61816830

fbshipit-source-id: d35460ae39d374a699f57e43b2ee5b9675f00d69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged