Enable fast FP8 GEMM for memory bound #3577

jiawenliu64 · 2025-01-16T01:34:03Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/659

This Diff enables fast FP8 gemm for memory bound with adding TRT-LLM FP8 cuda gemm to fbgemm. In addition to the original kernel, this Diff extends the kernel to:

Support pytorch operations
Support cuda graph with handling scale as tensor
Support larger dim M
Support benchmark/unittest

For decode attn linear shapes:

When BS=1, TRT-LLM FP8 gemm brings 2x speedup compared to BF16, while FP8 cutlass gemm’s perf is similar to BF16
When BS>4, TRT-LLM FP8 gemm does not bring perf gain

This TRT-LLM kernel is based on tensorwise quantization not rowwise

Reviewed By: jwfromm

Differential Revision: D68193920

facebook-github-bot · 2025-01-16T01:34:19Z

This pull request was exported from Phabricator. Differential Revision: D68193920

netlify · 2025-01-16T01:34:22Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`b76d104`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67886a787d4fc30008b16464
😎 Deploy Preview	https://deploy-preview-3577--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#659 This Diff enables fast FP8 gemm for memory bound with adding TRT-LLM FP8 cuda gemm to fbgemm. In addition to the original kernel, this Diff extends the kernel to: - Support pytorch operations - Support cuda graph with handling scale as tensor - Support larger dim M - Support benchmark/unittest For decode attn linear shapes: - When BS=1, TRT-LLM FP8 gemm brings 2x speedup compared to BF16, while FP8 cutlass gemm’s perf is similar to BF16 - When BS>4, TRT-LLM FP8 gemm does not bring perf gain This TRT-LLM kernel is based on tensorwise quantization not rowwise. Reviewed By: jwfromm Differential Revision: D68193920

facebook-github-bot · 2025-01-16T02:10:08Z

This pull request was exported from Phabricator. Differential Revision: D68193920

facebook-github-bot · 2025-01-16T17:47:58Z

This pull request has been merged in 497bad6.

facebook-github-bot · 2025-01-17T23:02:19Z

This pull request has been reverted by 2d025dc.

Summary: X-link: pytorch#3577 Pull Request resolved: facebookresearch/FBGEMM#659 This Diff enables fast FP8 gemm for memory bound with adding TRT-LLM FP8 cuda gemm to fbgemm. In addition to the original kernel, this Diff extends the kernel to: - Support pytorch operations - Support cuda graph with handling scale as tensor - Support larger dim M - Support benchmark/unittest For decode attn linear shapes: - When BS=1, TRT-LLM FP8 gemm brings 2x speedup compared to BF16, while FP8 cutlass gemm’s perf is similar to BF16 - When BS>4, TRT-LLM FP8 gemm does not bring perf gain This TRT-LLM kernel is based on tensorwise quantization not rowwise. Reviewed By: jwfromm Differential Revision: D68193920 fbshipit-source-id: fbf34e283e9430a8fed63ddb91781ade321012e3

facebook-github-bot added the cla signed label Jan 16, 2025

facebook-github-bot added the fb-exported label Jan 16, 2025

jiawenliu64 force-pushed the export-D68193920 branch from 17573a1 to b76d104 Compare January 16, 2025 02:09

facebook-github-bot closed this in 497bad6 Jan 16, 2025

facebook-github-bot added the Merged label Jan 16, 2025

facebook-github-bot added the Reverted label Jan 17, 2025

q10 added feature:fp8 category:improvement labels Apr 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable fast FP8 GEMM for memory bound #3577

Enable fast FP8 GEMM for memory bound #3577

Uh oh!

jiawenliu64 commented Jan 16, 2025

Uh oh!

facebook-github-bot commented Jan 16, 2025

Uh oh!

netlify bot commented Jan 16, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 16, 2025

Uh oh!

facebook-github-bot commented Jan 16, 2025

Uh oh!

facebook-github-bot commented Jan 17, 2025

Uh oh!

Uh oh!

Enable fast FP8 GEMM for memory bound #3577

Enable fast FP8 GEMM for memory bound #3577

Uh oh!

Conversation

jiawenliu64 commented Jan 16, 2025

Uh oh!

facebook-github-bot commented Jan 16, 2025

Uh oh!

netlify bot commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Jan 16, 2025

Uh oh!

facebook-github-bot commented Jan 16, 2025

Uh oh!

facebook-github-bot commented Jan 17, 2025

Uh oh!

Uh oh!

netlify bot commented Jan 16, 2025 •

edited

Loading