Performance Optimization: Improved TileShape Configuration for Large Llama Shapes (pytorch#3790)

MatrixAssembler · facebook-github-bot · commit ec57cc625962 · 2025-04-07T21:44:22.000-07:00
Summary:
## Issue: Suboptimal TileShape Configuration in FBGEMM for Large Llama Shapes
The current FBGEMM F8 kernel utilizes a TileShape configuration of 128x128x128,
which is suboptimal for dense F8 tensor core operations on NVIDIA H100 GPUs.

The optimal configuration for maximizing tensor core throughput and memory bandwidth usage
on H100 is m64n256k32. The current setting leads to inefficiencies, particularly
for large GEMM operations in Llama 70B and 405B models when K = 4096.


## Proposed Optimization: 128x256x128 TileShape for Large GEMM Operations
This PR modifies the TileShape configuration from 128x128x128 to 128x256x128
for large GEMM workloads. The new configuration is applied via a cooperative kernel,
ensuring improved tensor core utilization and memory bandwidth efficiency.

Notably, this tile shape is also used in FlashAttention V3 for F8 precision.


## Benchmark Results on H100 GPU
### Benchmark Setup:
PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
Benchmarks conducted with Llama model sizes 70B and 405B (M = 16,384)

### Benchmark
#### f8f8bf16_rowwise (M = 16,384)
| Llam Shape          | Old TFlops | New Tflops | Improvement |
|---------------------|------------|----------- |-------------|
| N =  1280 K =  8192 |       1252 |       1492 |      +17.4% |
| N =  8192 K =  1024 |       1258 |       1258 |          —  |
| N =  7168 K =  8192 |       1324 |       1463 |      +10.5% |
| N =  8192 K =  3584 |       1401 |       1401 |          —  |
| N = 13312 K =  6656 |       1259 |       1360 |       +8.0% |
| N = 13312 K = 16384 |       1170 |       1388 |      +18.6% |
| N = 16384 K =  6656 |       1238 |       1266 |       +2.3% |
| N = 16384 K = 16384 |       1166 |       1316 |      +12.9% |

The cooperative 128x256x128 TileShape consistently outperforms the 128x128x128
Ping-Pong kernel for all large GEMM sizes where K &gt;= 4096.

For a small subset of cases, the 128x192x128 Ping-Pong kernel achieves a 2-3%
performance advantage, notably in the shape M = 16,384, N = 16,384, K = 16,384

A more detailed heuristic rule could be explored for these specific cases.

## Technical Implementation
Introduced TileShape 128x256x128 with a cooperative kernel for f8f8bf16_rowwise

The new configuration is selectively applied for large matrices where:
- **M &gt; 128 &amp;&amp; N &gt; 128**
- **AND (M &gt; 2048 || N &gt; 2048)**
- **AND K &gt; 4096**

Performance Validation:
- We ensured that the changes do not introduce performance regressions
  for existing configurations that do not match the above conditions.
- The code modifications were designed to preserve existing configurations outside of large GEMM cases.

These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.


## Test Coverage
### Unit Tests Results
The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

jiawenliu64 jwfromm Thank you!


Differential Revision: D72617756

Pulled By: jiawenliu64
diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu b/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu
@@ -34,6 +34,7 @@ template <
     int TBS_N,
     int TBS_K,
     bool PONG,
+    bool COOP,
     bool FAST_ACCUM,
     bool USE_BIAS,
     typename INPUT_DTYPE,
@@ -170,6 +171,23 @@ at::Tensor f8f8bf16_rowwise_impl(
   using EpilogueEVT =
       cute::conditional_t<USE_BIAS, EVTComputeBias, EVTCompute1>;
 
+  using DefaultSchedule = cutlass::gemm::KernelTmaWarpSpecialized;
+  using PongSchedule = cutlass::gemm::KernelTmaWarpSpecializedPingpong;
+  using SlowAccum = cute::conditional_t<PONG, PongSchedule, DefaultSchedule>;
+  using FastAccum = cute::conditional_t<
+      COOP,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum,
+      cute::conditional_t<
+          PONG,
+          cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum,
+          cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum>>;
+  using MainLoopSchedule =
+      cute::conditional_t<FAST_ACCUM, FastAccum, SlowAccum>;
+  using EpilogueSchedule = cute::conditional_t<
+      COOP,
+      cutlass::epilogue::TmaWarpSpecializedCooperative,
+      cutlass::epilogue::TmaWarpSpecialized>;
+
   using CollectiveEpilogue =
       typename cutlass::epilogue::collective::CollectiveBuilder<
           cutlass::arch::Sm90,
@@ -185,21 +203,9 @@ at::Tensor f8f8bf16_rowwise_impl(
           ElementOutput,
           LayoutOutput,
           AlignmentOutput,
-          cutlass::epilogue::TmaWarpSpecialized,
+          EpilogueSchedule,
           EpilogueEVT>::CollectiveOp;
 
-  using DefaultSchedule = cutlass::gemm::KernelTmaWarpSpecialized;
-  using PongSchedule = cutlass::gemm::KernelTmaWarpSpecializedPingpong;
-  using FastDefaultSchedule =
-      cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
-  using FastPongSchedule =
-      cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum;
-  using SlowAccum = cute::conditional_t<PONG, PongSchedule, DefaultSchedule>;
-  using FastAccum =
-      cute::conditional_t<PONG, FastPongSchedule, FastDefaultSchedule>;
-  using MainLoopSchedule =
-      cute::conditional_t<FAST_ACCUM, FastAccum, SlowAccum>;
-
   using CollectiveMainloop =
       typename cutlass::gemm::collective::CollectiveBuilder<
           ArchTag,
@@ -322,6 +328,7 @@ at::Tensor dispatch_fp8_rowwise_kernel(
     at::Tensor w_scale,
     std::optional<at::Tensor> bias,
     std::optional<at::Tensor> output) {
+  auto K = XQ.size(1);
   KernelMode kernel = get_kernel_mode(XQ, WQ);
   if (kernel == KernelMode::Small) {
     return f8f8bf16_rowwise_impl<
@@ -332,23 +339,41 @@ at::Tensor dispatch_fp8_rowwise_kernel(
         1,
         1,
         false,
+        false,
         FastAccum,
         UseBias,
         InputDType,
         BiasDType>(XQ, WQ, x_scale, w_scale, bias, output);
   } else if (kernel == KernelMode::Large) {
-    return f8f8bf16_rowwise_impl<
-        128,
-        128,
-        128,
-        2,
-        1,
-        1,
-        true,
-        FastAccum,
-        UseBias,
-        InputDType,
-        BiasDType>(XQ, WQ, x_scale, w_scale, bias, output);
+    if (K < 4096) {
+      return f8f8bf16_rowwise_impl<
+          128,
+          128,
+          128,
+          2,
+          1,
+          1,
+          true,
+          false,
+          FastAccum,
+          UseBias,
+          InputDType,
+          BiasDType>(XQ, WQ, x_scale, w_scale, bias, output);
+    } else {
+      return f8f8bf16_rowwise_impl<
+          128,
+          256,
+          128,
+          2,
+          1,
+          1,
+          false,
+          true,
+          FastAccum,
+          UseBias,
+          InputDType,
+          BiasDType>(XQ, WQ, x_scale, w_scale, bias, output);
+    }
   } else {
     return f8f8bf16_rowwise_impl<
         128,
@@ -358,6 +383,7 @@ at::Tensor dispatch_fp8_rowwise_kernel(
         2,
         1,
         false,
+        false,
         FastAccum,
         UseBias,
         InputDType,