Skip to content

pass in kernel tbe id into rocksdb wrapper #2930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

duduyi2013
Copy link
Contributor

Summary:
the reason we need this is we constantly see the port conflict error in rocksdb initialization. Before this diff we call getFreePort to ge an available port. For each ssd tbe we will create 32 rocksdb shards, so in total there are 256 ports needed per host.
This works fine with 4 hosts until we are running 16 hosts training job as we need make sure all 16 hosts don't get into the corner cases where multiple db shard get assigned the same free port.

Differential Revision: D60635718

Copy link

netlify bot commented Aug 2, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 639a2f7
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66b0128ebcb31a0008ce60a6
😎 Deploy Preview https://deploy-preview-2930--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60635718

duduyi2013 added a commit to duduyi2013/FBGEMM that referenced this pull request Aug 3, 2024
Summary:
X-link: facebookresearch/FBGEMM#32

Pull Request resolved: pytorch#2930

the reason we need this is we constantly see the port conflict error in rocksdb initialization. Before this diff we call getFreePort to ge an available port. For each ssd tbe we will create 32 rocksdb shards, so in total there are 256 ports needed per host.
This works fine with 4 hosts until we are running 16 hosts training job as we need make sure all 16 hosts don't get into the corner cases where multiple db shard get assigned the same free port.

Reviewed By: sryap

Differential Revision: D60635718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60635718

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60635718

duduyi2013 added a commit to duduyi2013/FBGEMM that referenced this pull request Aug 3, 2024
Summary:
X-link: facebookresearch/FBGEMM#32

Pull Request resolved: pytorch#2930

the reason we need this is we constantly see the port conflict error in rocksdb initialization. Before this diff we call getFreePort to ge an available port. For each ssd tbe we will create 32 rocksdb shards, so in total there are 256 ports needed per host.
This works fine with 4 hosts until we are running 16 hosts training job as we need make sure all 16 hosts don't get into the corner cases where multiple db shard get assigned the same free port.

Reviewed By: sryap

Differential Revision: D60635718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60635718

duduyi2013 added a commit to duduyi2013/FBGEMM that referenced this pull request Aug 4, 2024
Summary:
X-link: facebookresearch/FBGEMM#32

Pull Request resolved: pytorch#2930

the reason we need this is we constantly see the port conflict error in rocksdb initialization. Before this diff we call getFreePort to ge an available port. For each ssd tbe we will create 32 rocksdb shards, so in total there are 256 ports needed per host.
This works fine with 4 hosts until we are running 16 hosts training job as we need make sure all 16 hosts don't get into the corner cases where multiple db shard get assigned the same free port.

Reviewed By: sryap

Differential Revision: D60635718
Summary:
X-link: facebookresearch/FBGEMM#32

Pull Request resolved: pytorch#2930

the reason we need this is we constantly see the port conflict error in rocksdb initialization. Before this diff we call getFreePort to ge an available port. For each ssd tbe we will create 32 rocksdb shards, so in total there are 256 ports needed per host.
This works fine with 4 hosts until we are running 16 hosts training job as we need make sure all 16 hosts don't get into the corner cases where multiple db shard get assigned the same free port.

Reviewed By: sryap

Differential Revision: D60635718
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60635718

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6607072.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
Pull Request resolved: facebookresearch/FBGEMM#32

X-link: pytorch#2930

the reason we need this is we constantly see the port conflict error in rocksdb initialization. Before this diff we call getFreePort to ge an available port. For each ssd tbe we will create 32 rocksdb shards, so in total there are 256 ports needed per host.
This works fine with 4 hosts until we are running 16 hosts training job as we need make sure all 16 hosts don't get into the corner cases where multiple db shard get assigned the same free port.

Reviewed By: sryap

Differential Revision: D60635718

fbshipit-source-id: 606216a4a2d5a43f82f7bd681477537413bd372a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants