Enable fast FP8 GEMM for memory bound (resubmit) #3608

jiawenliu64 · 2025-01-23T18:34:57Z

Summary:
This Diff (resubmit from D68193920) enables fast FP8 gemm for memory bound with adding TRT-LLM FP8 cuda gemm to fbgemm. In addition to the original kernel, this Diff extends the kernel to:

Support pytorch operations
Support cuda graph with handling scale as tensor
Support smaller dim M for much faster compilation time
Support benchmark/unittest

For decode attn linear shapes:

When BS=1, TRT-LLM FP8 gemm brings 2x speedup compared to BF16, while FP8 cutlass gemm’s perf is similar to BF16
When BS>4, TRT-LLM FP8 gemm does not bring perf gain
This TRT-LLM kernel is based on tensorwise quantization not rowwise.

Note: As M>4 does not bring perf gain in our use cases, we only instantiate 4 template instances to reduce compilation time (10 mins -> 2.5 mins). If we would like to increase instances for larger M in the future, we could tradeoff acceptable compilation time or dedicate cuda file to each instance with compile in parallel

Differential Revision: D68568596

Summary: This Diff (resubmit from D68193920) enables fast FP8 gemm for memory bound with adding TRT-LLM FP8 cuda gemm to fbgemm. In addition to the original kernel, this Diff extends the kernel to: - Support pytorch operations - Support cuda graph with handling scale as tensor - Support smaller dim M for much faster compilation time - Support benchmark/unittest For decode attn linear shapes: - When BS=1, TRT-LLM FP8 gemm brings 2x speedup compared to BF16, while FP8 cutlass gemm’s perf is similar to BF16 - When BS>4, TRT-LLM FP8 gemm does not bring perf gain - This TRT-LLM kernel is based on tensorwise quantization not rowwise. Note: As M>4 does not bring perf gain in our use cases, we only instantiate 4 template instances to reduce compilation time (10 mins -> 2.5 mins). If we would like to increase instances for larger M in the future, we could tradeoff acceptable compilation time or dedicate cuda file to each instance with compile in parallel Differential Revision: D68568596

facebook-github-bot · 2025-01-23T18:35:21Z

This pull request was exported from Phabricator. Differential Revision: D68568596

netlify · 2025-01-23T18:35:41Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`4ed19bf`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67928bd4cfc09a00088dcaf2
😎 Deploy Preview	https://deploy-preview-3608--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-01-23T22:21:35Z

This pull request has been merged in 5754ce7.

Summary: X-link: pytorch#3608 Pull Request resolved: facebookresearch/FBGEMM#686 This Diff (resubmit from D68193920) enables fast FP8 gemm for memory bound with adding TRT-LLM FP8 cuda gemm to fbgemm. In addition to the original kernel, this Diff extends the kernel to: - Support pytorch operations - Support cuda graph with handling scale as tensor - Support smaller dim M for much faster compilation time - Support benchmark/unittest For decode attn linear shapes: - When BS=1, TRT-LLM FP8 gemm brings 2x speedup compared to BF16, while FP8 cutlass gemm’s perf is similar to BF16 - When BS>4, TRT-LLM FP8 gemm does not bring perf gain - This TRT-LLM kernel is based on tensorwise quantization not rowwise. Note: As M>4 does not bring perf gain in our use cases, we only instantiate 4 template instances to reduce compilation time (10 mins -> 2.5 mins). If we would like to increase instances for larger M in the future, we could tradeoff acceptable compilation time or dedicate cuda file to each instance with compile in parallel Reviewed By: q10, jwfromm Differential Revision: D68568596 fbshipit-source-id: ba8b565a564533717deb29f9d701550d99a8c759

facebook-github-bot added the cla signed label Jan 23, 2025

facebook-github-bot added the fb-exported label Jan 23, 2025

facebook-github-bot closed this in 5754ce7 Jan 23, 2025

facebook-github-bot added the Merged label Jan 23, 2025

q10 added feature:fp8 category:new labels Apr 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable fast FP8 GEMM for memory bound (resubmit) #3608

Enable fast FP8 GEMM for memory bound (resubmit) #3608

Uh oh!

jiawenliu64 commented Jan 23, 2025

Uh oh!

facebook-github-bot commented Jan 23, 2025

Uh oh!

netlify bot commented Jan 23, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 23, 2025

Uh oh!

Uh oh!

Enable fast FP8 GEMM for memory bound (resubmit) #3608

Enable fast FP8 GEMM for memory bound (resubmit) #3608

Uh oh!

Conversation

jiawenliu64 commented Jan 23, 2025

Uh oh!

facebook-github-bot commented Jan 23, 2025

Uh oh!

netlify bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Jan 23, 2025

Uh oh!

Uh oh!

netlify bot commented Jan 23, 2025 •

edited

Loading