FP8 Grouped Gemm Optimization #3655

jwfromm · 2025-02-04T02:08:58Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/731

While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping quantize_fp8_row and having to slice input tensors before calling f8f8bf16_rowwise_grouped.

To fix the former, we enable triton_quantize_fp8_row to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead.

To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor.

In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads.

Reviewed By: jiawenliu64

Differential Revision: D69072529

facebook-github-bot · 2025-02-04T02:09:17Z

This pull request was exported from Phabricator. Differential Revision: D69072529

netlify · 2025-02-04T02:09:34Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`e2dd52b`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67a55edff7eb6c0008ac8453
😎 Deploy Preview	https://deploy-preview-3655--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Reviewed By: jiawenliu64 Differential Revision: D69072529

facebook-github-bot · 2025-02-06T18:38:12Z

This pull request was exported from Phabricator. Differential Revision: D69072529

Summary: X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Reviewed By: jiawenliu64 Differential Revision: D69072529

facebook-github-bot · 2025-02-06T21:56:50Z

This pull request was exported from Phabricator. Differential Revision: D69072529

Summary: Pull Request resolved: pytorch#3655 X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Differential Revision: D69072529 Reviewed By: jiawenliu64

facebook-github-bot · 2025-02-06T23:15:34Z

This pull request was exported from Phabricator. Differential Revision: D69072529

Summary: Pull Request resolved: pytorch#3655 X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Reviewed By: jiawenliu64 Differential Revision: D69072529

Summary: Pull Request resolved: pytorch#3655 X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Differential Revision: D69072529 Reviewed By: jiawenliu64

Summary: Pull Request resolved: pytorch#3655 X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Differential Revision: D69072529 Reviewed By: jianyuh, jiawenliu64

facebook-github-bot · 2025-02-07T00:47:19Z

This pull request was exported from Phabricator. Differential Revision: D69072529

Summary: Pull Request resolved: pytorch#3655 X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Reviewed By: jianyuh, jiawenliu64 Differential Revision: D69072529

facebook-github-bot · 2025-02-07T00:58:56Z

This pull request was exported from Phabricator. Differential Revision: D69072529

Summary: Pull Request resolved: pytorch#3655 X-link: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Reviewed By: jianyuh, jiawenliu64 Differential Revision: D69072529

facebook-github-bot · 2025-02-07T01:16:11Z

This pull request was exported from Phabricator. Differential Revision: D69072529

facebook-github-bot · 2025-02-07T05:14:41Z

This pull request has been merged in d564c8c.

Summary: X-link: pytorch#3655 Pull Request resolved: facebookresearch/FBGEMM#731 While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`. To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead. To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor. In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads. Reviewed By: jianyuh, jiawenliu64 Differential Revision: D69072529 fbshipit-source-id: b90b4d1c76bf813f94f36cd21a55118442f62b38

facebook-github-bot added the cla signed label Feb 4, 2025

facebook-github-bot added the fb-exported label Feb 4, 2025

jwfromm force-pushed the export-D69072529 branch from 606449f to 632354b Compare February 6, 2025 18:38

jwfromm force-pushed the export-D69072529 branch from 632354b to 3582d66 Compare February 6, 2025 21:56

jwfromm force-pushed the export-D69072529 branch from 3582d66 to 08fcc98 Compare February 6, 2025 23:15

jwfromm force-pushed the export-D69072529 branch from 08fcc98 to 2db3c2f Compare February 7, 2025 00:47

jwfromm force-pushed the export-D69072529 branch from 2db3c2f to 67ac9f8 Compare February 7, 2025 00:58

jwfromm force-pushed the export-D69072529 branch from 67ac9f8 to e2dd52b Compare February 7, 2025 01:16

facebook-github-bot closed this in d564c8c Feb 7, 2025

facebook-github-bot added the Merged label Feb 7, 2025

q10 added the feature:fp8 label Feb 8, 2025

q10 added the category:improvement label Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FP8 Grouped Gemm Optimization #3655

FP8 Grouped Gemm Optimization #3655

Uh oh!

jwfromm commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

netlify bot commented Feb 4, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 6, 2025

Uh oh!

facebook-github-bot commented Feb 6, 2025

Uh oh!

facebook-github-bot commented Feb 6, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

Uh oh!

FP8 Grouped Gemm Optimization #3655

FP8 Grouped Gemm Optimization #3655

Uh oh!

Conversation

jwfromm commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

netlify bot commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Feb 6, 2025

Uh oh!

facebook-github-bot commented Feb 6, 2025

Uh oh!

facebook-github-bot commented Feb 6, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

facebook-github-bot commented Feb 7, 2025

Uh oh!

Uh oh!

netlify bot commented Feb 4, 2025 •

edited

Loading