-
Notifications
You must be signed in to change notification settings - Fork 647
change saving logic #2182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change saving logic #2182
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2182
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 446e8cc with merge base cdf5ea2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -330,6 +329,7 @@ def copy_files( | |||
output_dir: Union[str, Path], | |||
*, | |||
ignore_suffixes: Optional[List[str]] = None, | |||
max_file_size_mb: int = 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there actually a way for someone to modify these values from the CLI/config? Seems like no, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but i dont see it being an issue. It shouldnt happen, and If users really want to come a 100MB file to every epoch, they can do it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamping to unblock. Please make sure to test that after the changes to the directory structure both resume from checkpoint and HF from_pretrained functionality are unaffected.
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
Context
What is the purpose of this PR? Is it to
Changelog
Test plan
tree -a /tmp/torchtune/llama3_2_3B/full_single_device

testing with HF

resuming from ckpt
