Added filter_fn to text_completion_dataset v1.0 #1429

AnuravModak · 2024-08-28T17:54:49Z

Towards #1396

I have made changes to include a filter function to remove empty lines from raw text before tokenization, similar to the implementation in _sft.py.

The updates involve adding two new parameters to the TextCompletionDataset class:

custom_filter: bool = False
filter_fn: Optional[Callable] = None
These additions address the need for a filter function to remove empty lines, thereby saving memory before tokenization.

Usage:

If custom_filter is set to True, users can provide their own filter_fn.
If custom_filter is False, the default filter function (default_filter) will automatically remove empty lines from raw text.
The filter_fn is applied in the _prepare_sample method before tokenization.

Addition of that flag adds a lot more flexibility because personally, I’ve often encountered situations where internal filtration was necessary, but there were also cases where I needed to adjust the raw corpus according to specific requirements.

If the additional flag custom_filter is adding unnecessary complexity and you feel its unwanted let me know i will remove it and update it in the PR with a new commit!

@RdoubleA kindly review and advise!

.

Signed-off-by: Anurav Modak <[email protected]>

pytorch-bot · 2024-08-28T17:54:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1429

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6122730 with merge base 7e084d9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-08-28T17:54:55Z

Hi @AnuravModak!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Signed-off-by: Anurav Modak <[email protected]>

AnuravModak · 2024-08-28T18:13:48Z

Also, in the second commit i have added one more feature where if user does not want to use any filter function (be it default or custom) then simply by marking skip_filter as True it can be achieved!

But this also can be achieved from the first commit itself by passing None in the filter function !

Let me know which one works best!

RdoubleA

Thanks for putting this together. Left a few comments.

torchtune/datasets/_text_completion.py

AnuravModak · 2024-08-28T18:42:12Z

Hi, your suggestions are noted! I will update this PR in next 8-12 hrs as its midnight in India. Will update soon as per your comments.

Signed-off-by: Anurav Modak <[email protected]>

AnuravModak · 2024-08-29T12:07:53Z

Hi @RdoubleA , i have made the changes as per your comments and the third commit is as per the changes torchtune/datasets/_sft.py. Also i have removed the code inside _prepare_sample as per your requirement to keep it consistent.

RdoubleA · 2024-08-29T15:25:56Z

Looks great! Do you mind signing the CLA?

AnuravModak · 2024-08-29T15:32:30Z

Looks great! Do you mind signing the CLA?

Heyy, I have done it, kindly check and advise!

facebook-github-bot · 2024-08-29T16:06:19Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

RdoubleA · 2024-08-29T17:05:02Z

Looks like there's still a lint error, you'll just need to run pre-commit run --all-files

AnuravModak · 2024-08-29T17:06:37Z

[like] Anurav Modak reacted to your message:

…

________________________________ From: Rafi Ayub ***@***.***> Sent: Thursday, August 29, 2024 5:05:24 PM To: pytorch/torchtune ***@***.***> Cc: Anurav Modak ***@***.***>; Mention ***@***.***> Subject: [External] : Re: [pytorch/torchtune] Added filter_fn to text_completion_dataset v1.0 (PR #1429) Looks like there's still a lint error, you'll just need to run pre-commit run --all-files — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/pytorch/torchtune/pull/1429*issuecomment-2318394183__;Iw!!ACWV5N9M2RV99hQ!NTbD1sl3E65V-VSdQUoYkumxXXbGOW67Oc6LlmSG9_CdzMLRKJ1SfAi7IVsQa7IROFLLKkCsn5aLDfNBTONm1PNSJYY$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AS4FX5I6BGGZU7VYEBDBDX3ZT5IFJAVCNFSM6AAAAABNIZMNX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYGM4TIMJYGM__;!!ACWV5N9M2RV99hQ!NTbD1sl3E65V-VSdQUoYkumxXXbGOW67Oc6LlmSG9_CdzMLRKJ1SfAi7IVsQa7IROFLLKkCsn5aLDfNBTONmRI0Hw64$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Signed-off-by: Anurav Modak <[email protected]>

AnuravModak · 2024-08-29T19:33:20Z

Hey @RdoubleA , I have run the precommit script on this! Let me know if anything else is required from my side!

RdoubleA · 2024-08-30T01:40:51Z

@AnuravModak you'll need to add a docstring for the new parameter in TextCompletionDataset and text_completion_dataset. You can just use the same docstring from SFTDataset for filter_fn:

filter_fn (Optional[Callable]): callable used to filter the dataset prior to any pre-processing. See
      the Hugging Face `docs <https://huggingface.co/docs/datasets/v2.20.0/process#select-and-filter>`_ for more
      details.

Please make sure the pre-commit passes after you add it

Signed-off-by: Anurav Modak <[email protected]>

AnuravModak · 2024-08-30T09:31:13Z

Hey @RdoubleA i have added the changes in doc string and all the cases are passing except the below one, let me know if i need to push the changes from any other branch instead of main/master, will create a new branch and push it!

don't commit to branch...................................................Failed
- hook id: no-commit-to-branch

codecov-commenter · 2024-08-30T15:28:05Z

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 26.92%. Comparing base (7e084d9) to head (5ea9a86).
Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/datasets/_text_completion.py	33.33%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (7e084d9) and HEAD (5ea9a86). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (7e084d9) HEAD (5ea9a86)

4 2

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1429       +/-   ##
===========================================
- Coverage   70.14%   26.92%   -43.23%     
===========================================
  Files         272      272               
  Lines       12919    13055      +136     
===========================================
- Hits         9062     3515     -5547     
- Misses       3857     9540     +5683

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torchtune/datasets/_text_completion.py

Co-authored-by: Joe Cummings <[email protected]>

RdoubleA · 2024-08-30T17:25:21Z

Thanks for all your help @AnuravModak !

AnuravModak · 2024-08-31T08:39:56Z

Thanks for all your help @AnuravModak !

No worries! Let me know if I can help with anything else, just tag me in or assign me directly will look into it !

Added filter_fn to text_completion_dataset v1.0

9301b6b

Signed-off-by: Anurav Modak <[email protected]>

AnuravModak mentioned this pull request Aug 28, 2024

Add filter_fn to text_completion_dataset #1396

Closed

Added filter_fn to text_completion_dataset v1.1

5365f63

Signed-off-by: Anurav Modak <[email protected]>

RdoubleA reviewed Aug 28, 2024

View reviewed changes

torchtune/datasets/_text_completion.py Outdated Show resolved Hide resolved

torchtune/datasets/_text_completion.py Show resolved Hide resolved

Added filter_fn to text_completion_dataset v1.3

20de0e2

Signed-off-by: Anurav Modak <[email protected]>

RdoubleA approved these changes Aug 29, 2024

View reviewed changes

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 29, 2024

Added filter_fn to text_completion_dataset v1.3

7ae05c3

Signed-off-by: Anurav Modak <[email protected]>

AnuravModak added 2 commits August 30, 2024 14:52

Added filter_fn to text_completion_dataset v1.4

a3bebad

Signed-off-by: Anurav Modak <[email protected]>

Added filter_fn to text_completion_dataset v1.4

5ea9a86

Signed-off-by: Anurav Modak <[email protected]>

joecummings reviewed Aug 30, 2024

View reviewed changes

torchtune/datasets/_text_completion.py Outdated Show resolved Hide resolved

Update torchtune/datasets/_text_completion.py

6122730

Co-authored-by: Joe Cummings <[email protected]>

RdoubleA merged commit 576fafe into pytorch:main Aug 30, 2024
20 checks passed

Added filter_fn to text_completion_dataset v1.0 #1429

Added filter_fn to text_completion_dataset v1.0 #1429

Uh oh!

Conversation

AnuravModak commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1429

✅ No Failures

Uh oh!

facebook-github-bot commented Aug 28, 2024

Action Required

Process

Uh oh!

AnuravModak commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RdoubleA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AnuravModak commented Aug 28, 2024 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnuravModak commented Aug 29, 2024

Uh oh!

RdoubleA commented Aug 29, 2024

Uh oh!

AnuravModak commented Aug 29, 2024

Uh oh!

facebook-github-bot commented Aug 29, 2024

Uh oh!

RdoubleA commented Aug 29, 2024

Uh oh!

AnuravModak commented Aug 29, 2024 via email

Uh oh!

AnuravModak commented Aug 29, 2024

Uh oh!

RdoubleA commented Aug 30, 2024

Uh oh!

AnuravModak commented Aug 30, 2024

Uh oh!

codecov-commenter commented Aug 30, 2024

Codecov Report

Uh oh!

Uh oh!

Uh oh!

RdoubleA commented Aug 30, 2024

Uh oh!

AnuravModak commented Aug 31, 2024

Uh oh!

Uh oh!

AnuravModak commented Aug 28, 2024 •

edited

Loading

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading

AnuravModak commented Aug 28, 2024 •

edited

Loading

AnuravModak commented Aug 28, 2024 via email •

edited

Loading