-
Notifications
You must be signed in to change notification settings - Fork 647
Added filter_fn to text_completion_dataset v1.0 #1429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Anurav Modak <[email protected]>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1429
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6122730 with merge base 7e084d9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @AnuravModak! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Signed-off-by: Anurav Modak <[email protected]>
Also, in the second commit i have added one more feature where if user does not want to use any filter function (be it default or custom) then simply by marking skip_filter as True it can be achieved! But this also can be achieved from the first commit itself by passing None in the filter function ! Let me know which one works best! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together. Left a few comments.
Hi, your suggestions are noted!
I will update this PR in next 8-12 hrs as its midnight in India. Will update soon as per your comments.
|
Signed-off-by: Anurav Modak <[email protected]>
Hi @RdoubleA , i have made the changes as per your comments and the third commit is as per the changes |
Looks great! Do you mind signing the CLA? |
Heyy, I have done it, kindly check and advise! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Looks like there's still a lint error, you'll just need to run |
[like] Anurav Modak reacted to your message:
…________________________________
From: Rafi Ayub ***@***.***>
Sent: Thursday, August 29, 2024 5:05:24 PM
To: pytorch/torchtune ***@***.***>
Cc: Anurav Modak ***@***.***>; Mention ***@***.***>
Subject: [External] : Re: [pytorch/torchtune] Added filter_fn to text_completion_dataset v1.0 (PR #1429)
Looks like there's still a lint error, you'll just need to run pre-commit run --all-files
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/pytorch/torchtune/pull/1429*issuecomment-2318394183__;Iw!!ACWV5N9M2RV99hQ!NTbD1sl3E65V-VSdQUoYkumxXXbGOW67Oc6LlmSG9_CdzMLRKJ1SfAi7IVsQa7IROFLLKkCsn5aLDfNBTONm1PNSJYY$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AS4FX5I6BGGZU7VYEBDBDX3ZT5IFJAVCNFSM6AAAAABNIZMNX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYGM4TIMJYGM__;!!ACWV5N9M2RV99hQ!NTbD1sl3E65V-VSdQUoYkumxXXbGOW67Oc6LlmSG9_CdzMLRKJ1SfAi7IVsQa7IROFLLKkCsn5aLDfNBTONmRI0Hw64$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Signed-off-by: Anurav Modak <[email protected]>
Hey @RdoubleA , I have run the precommit script on this! Let me know if anything else is required from my side! |
@AnuravModak you'll need to add a docstring for the new parameter in
Please make sure the pre-commit passes after you add it |
Signed-off-by: Anurav Modak <[email protected]>
Signed-off-by: Anurav Modak <[email protected]>
Hey @RdoubleA i have added the changes in doc string and all the cases are passing except the below one, let me know if i need to push the changes from any other branch instead of main/master, will create a new branch and push it!
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1429 +/- ##
===========================================
- Coverage 70.14% 26.92% -43.23%
===========================================
Files 272 272
Lines 12919 13055 +136
===========================================
- Hits 9062 3515 -5547
- Misses 3857 9540 +5683 ☔ View full report in Codecov by Sentry. |
Co-authored-by: Joe Cummings <[email protected]>
Thanks for all your help @AnuravModak ! |
No worries! Let me know if I can help with anything else, just tag me in or assign me directly will look into it ! |
Towards #1396
I have made changes to include a filter function to remove empty lines from raw text before tokenization, similar to the implementation in
_sft.py
.The updates involve adding two new parameters to the TextCompletionDataset class:
custom_filter: bool = False
filter_fn: Optional[Callable] = None
These additions address the need for a filter function to remove empty lines, thereby saving memory before tokenization.
Usage:
If custom_filter is set to True, users can provide their own filter_fn.
If custom_filter is False, the default filter function (default_filter) will automatically remove empty lines from raw text.
The filter_fn is applied in the _prepare_sample method before tokenization.
Addition of that flag adds a lot more flexibility because personally, I’ve often encountered situations where internal filtration was necessary, but there were also cases where I needed to adjust the raw corpus according to specific requirements.
If the additional flag custom_filter is adding unnecessary complexity and you feel its unwanted let me know i will remove it and update it in the PR with a new commit!
@RdoubleA kindly review and advise!
.