Add `iter` singular value into TBE optimizer state #3228

csmiler · 2024-10-07T15:59:48Z

Summary:
When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the iter number is a single value tensor, which cannot be tracked and checkpointed properly (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!)

Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim).

By doing so, single-value iter can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training.

Differential Revision: D63909559

netlify · 2024-10-07T16:00:03Z

❌ Deploy Preview for pytorch-fbgemm-docs failed.

Name	Link
🔨 Latest commit	`4c23038`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6708036ccbfc730008d25c18

facebook-github-bot · 2024-10-07T16:00:04Z

This pull request was exported from Phabricator. Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/FBGEMM#3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

facebook-github-bot · 2024-10-09T23:51:19Z

This pull request was exported from Phabricator. Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/FBGEMM#3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

facebook-github-bot · 2024-10-10T12:51:34Z

This pull request was exported from Phabricator. Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/FBGEMM#3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

facebook-github-bot · 2024-10-10T16:40:32Z

This pull request was exported from Phabricator. Differential Revision: D63909559

facebook-github-bot · 2024-10-11T05:09:37Z

This pull request has been merged in f9f0600.

Summary: Pull Request resolved: #2474 X-link: pytorch/FBGEMM#3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559 fbshipit-source-id: e14c1dc3e8f87bfc4cc95f2321b358526719d88f

Summary: X-link: pytorch/torchrec#2474 X-link: pytorch#3228 Pull Request resolved: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559 fbshipit-source-id: e14c1dc3e8f87bfc4cc95f2321b358526719d88f

facebook-github-bot added the cla signed label Oct 7, 2024

facebook-github-bot added the fb-exported label Oct 7, 2024

csmiler force-pushed the export-D63909559 branch from d7fa453 to 2dae0fe Compare October 9, 2024 23:50

csmiler force-pushed the export-D63909559 branch from 2dae0fe to 07ab246 Compare October 10, 2024 12:51

csmiler force-pushed the export-D63909559 branch from 07ab246 to 4c23038 Compare October 10, 2024 16:40

facebook-github-bot closed this in f9f0600 Oct 11, 2024

facebook-github-bot added the Merged label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `iter` singular value into TBE optimizer state #3228

Add `iter` singular value into TBE optimizer state #3228

Uh oh!

csmiler commented Oct 7, 2024

Uh oh!

netlify bot commented Oct 7, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 7, 2024

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

Uh oh!

Add iter singular value into TBE optimizer state #3228

Add iter singular value into TBE optimizer state #3228

Uh oh!

Conversation

csmiler commented Oct 7, 2024

Uh oh!

netlify bot commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Deploy Preview for pytorch-fbgemm-docs failed.

Uh oh!

facebook-github-bot commented Oct 7, 2024

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

Uh oh!

Add `iter` singular value into TBE optimizer state #3228

Add `iter` singular value into TBE optimizer state #3228

netlify bot commented Oct 7, 2024 •

edited

Loading