Skip to content

Added filter_fn to text_completion_dataset v1.0 #1429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 30, 2024

Conversation

AnuravModak
Copy link
Contributor

@AnuravModak AnuravModak commented Aug 28, 2024

Towards #1396

I have made changes to include a filter function to remove empty lines from raw text before tokenization, similar to the implementation in _sft.py.

The updates involve adding two new parameters to the TextCompletionDataset class:

custom_filter: bool = False
filter_fn: Optional[Callable] = None
These additions address the need for a filter function to remove empty lines, thereby saving memory before tokenization.

Usage:

If custom_filter is set to True, users can provide their own filter_fn.
If custom_filter is False, the default filter function (default_filter) will automatically remove empty lines from raw text.
The filter_fn is applied in the _prepare_sample method before tokenization.

Addition of that flag adds a lot more flexibility because personally, I’ve often encountered situations where internal filtration was necessary, but there were also cases where I needed to adjust the raw corpus according to specific requirements.

If the additional flag custom_filter is adding unnecessary complexity and you feel its unwanted let me know i will remove it and update it in the PR with a new commit!

@RdoubleA kindly review and advise!

.

Copy link

pytorch-bot bot commented Aug 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1429

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6122730 with merge base 7e084d9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link

Hi @AnuravModak!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@AnuravModak
Copy link
Contributor Author

AnuravModak commented Aug 28, 2024

Also, in the second commit i have added one more feature where if user does not want to use any filter function (be it default or custom) then simply by marking skip_filter as True it can be achieved!

But this also can be achieved from the first commit itself by passing None in the filter function !

Let me know which one works best!

Copy link
Collaborator

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. Left a few comments.

@AnuravModak
Copy link
Contributor Author

AnuravModak commented Aug 28, 2024 via email

@AnuravModak
Copy link
Contributor Author

Hi @RdoubleA , i have made the changes as per your comments and the third commit is as per the changes torchtune/datasets/_sft.py. Also i have removed the code inside _prepare_sample as per your requirement to keep it consistent.

@RdoubleA
Copy link
Collaborator

Looks great! Do you mind signing the CLA?

@AnuravModak
Copy link
Contributor Author

Looks great! Do you mind signing the CLA?

Heyy, I have done it, kindly check and advise!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 29, 2024
@RdoubleA
Copy link
Collaborator

Looks like there's still a lint error, you'll just need to run pre-commit run --all-files

@AnuravModak
Copy link
Contributor Author

AnuravModak commented Aug 29, 2024 via email

@AnuravModak
Copy link
Contributor Author

Hey @RdoubleA , I have run the precommit script on this! Let me know if anything else is required from my side!

@RdoubleA
Copy link
Collaborator

@AnuravModak you'll need to add a docstring for the new parameter in TextCompletionDataset and text_completion_dataset. You can just use the same docstring from SFTDataset for filter_fn:

filter_fn (Optional[Callable]): callable used to filter the dataset prior to any pre-processing. See
      the Hugging Face `docs <https://huggingface.co/docs/datasets/v2.20.0/process#select-and-filter>`_ for more
      details.

Please make sure the pre-commit passes after you add it

@AnuravModak
Copy link
Contributor Author

Hey @RdoubleA i have added the changes in doc string and all the cases are passing except the below one, let me know if i need to push the changes from any other branch instead of main/master, will create a new branch and push it!

don't commit to branch...................................................Failed
- hook id: no-commit-to-branch

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 26.92%. Comparing base (7e084d9) to head (5ea9a86).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
torchtune/datasets/_text_completion.py 33.33% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (7e084d9) and HEAD (5ea9a86). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (7e084d9) HEAD (5ea9a86)
4 2
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1429       +/-   ##
===========================================
- Coverage   70.14%   26.92%   -43.23%     
===========================================
  Files         272      272               
  Lines       12919    13055      +136     
===========================================
- Hits         9062     3515     -5547     
- Misses       3857     9540     +5683     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RdoubleA RdoubleA merged commit 576fafe into pytorch:main Aug 30, 2024
20 checks passed
@RdoubleA
Copy link
Collaborator

Thanks for all your help @AnuravModak !

@AnuravModak
Copy link
Contributor Author

Thanks for all your help @AnuravModak !

Thanks for all your help @AnuravModak !

No worries! Let me know if I can help with anything else, just tag me in or assign me directly will look into it !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants