-
Notifications
You must be signed in to change notification settings - Fork 610
implement packed quantize row / dequantize row API #3915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D72121939 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
19330ae
to
bafd0fd
Compare
Summary: X-link: facebookresearch/FBGEMM#1004 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
Summary: X-link: facebookresearch/FBGEMM#1004 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
bafd0fd
to
a4146f3
Compare
This pull request was exported from Phabricator. Differential Revision: D72121939 |
Summary: X-link: facebookresearch/FBGEMM#1004 Pull Request resolved: pytorch#3915 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
a4146f3
to
99d101a
Compare
This pull request was exported from Phabricator. Differential Revision: D72121939 |
Summary: X-link: facebookresearch/FBGEMM#1004 Pull Request resolved: pytorch#3915 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
99d101a
to
b932338
Compare
Summary: X-link: facebookresearch/FBGEMM#1004 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
b932338
to
051e6de
Compare
Summary: X-link: facebookresearch/FBGEMM#1004 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
051e6de
to
9f3672a
Compare
This pull request was exported from Phabricator. Differential Revision: D72121939 |
Summary: X-link: facebookresearch/FBGEMM#1004 Pull Request resolved: pytorch#3915 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
9f3672a
to
75cccc8
Compare
Summary: X-link: facebookresearch/FBGEMM#1004 Pull Request resolved: pytorch#3915 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939
This pull request was exported from Phabricator. Differential Revision: D72121939 |
75cccc8
to
41cb838
Compare
This pull request has been merged in def7bbe. |
Summary: Pull Request resolved: facebookresearch/FBGEMM#1004 X-link: pytorch#3915 API for a packed version of quantize/dequantize row. This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor. Example usage: ``` a = torch.randn(shape, dtype=torch.bfloat16, device="cuda") packed_values = quantize_fp8_packed_row_raw( a, use_triton=True, ) # Undo scaling. a_bf16 = dequantize_fp8_packed_row(packed_values) torch.testing.assert_close(a_bf16, a, atol=2e-1, rtol=1e-1) ``` A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing). Reviewed By: jiawenliu64 Differential Revision: D72121939 fbshipit-source-id: c88f99977e21e64f6f6f7dbb3c7430c2c2b56d0e
Summary:
API for a packed version of quantize/dequantize row.
This version will return a single, contiguous tensor in memory instead of returning two tensors, and operates on the contiguous tensor.
Example usage:
A third API: "quantize_fp8_packed_row" mimics the API of quantize_fp8_row (mainly for testing).
Differential Revision: D72121939