Sample packing for ConcatDataset #2278

ebsmothers · 2025-01-17T23:02:45Z

Currently we error when any individual datasets in ConcatDataset have packed=True. This goes back to #1708: because packed and unpacked datasets require different collators and packing is done on the individual datasets rather than the ConcatDataset, we can't really guarantee a single collator for ConcatDataset is well-defined in all cases.

Fortunately, it seems pretty likely that someone who is enabling packing on one dataset would want to do so on all of them. And it's actually trivial to support this case. So this PR relaxes the check in ConcatDataset to also allow the case that every dataset is packed.

A couple things that can be revisited here:

Given that dataset packing in ConcatDataset is either all or nothing, it's probably more natural to define the packed attribute there instead. However, then we get into questions of whether we pack before or after merging. Relatedly,
this implementation packs each individual dataset and then merges. This means that we do not support packs combining samples from different datasets.

Test plan:

Added a unit test:

pytest tests/torchtune/datasets/test_concat_dataset.py
...
======== 5 passed in 0.10s ==========

Also ran one of the updated recipes with the following config updates:

tokenizer:
	max_seq_len: 512


dataset:
	- _component_: torchtune.datasets.alpaca_dataset
		packed: True
	- _component_: torchtune.datasets.alpaca_cleaned_dataset
		packed: True

(Per HF datasets page, Alpaca dataset has 52k samples, Alpaca cleaned has 51.8k samples)

Printing the length of different dataset properties at different points:

# Inside PackedDataset, prior to packing
>>> len(self.ds) # alpaca dataset
52002
>>> len(self.ds) # alpaca-cleaned dataset
51760

# Inside PackedDataset, after packing
>>> len(self) # alpaca dataset
13348
>>> len(self) # alpaca-cleaned dataset
 25891

# Inside ConcatDataset, after construction
>>> len(self)
39239
>>> self._indexes
[(0, 13348, 0), (13348, 39239, 1)]

pytorch-bot · 2025-01-17T23:02:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2278

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 50b72a1 with merge base b68cddd ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2025-01-17T23:04:04Z

tests/torchtune/datasets/test_concat_dataset.py

@@ -90,3 +90,33 @@ def test_packed_dataset(self, torch_datasets):

        with pytest.raises(ValueError):
            concated_dataset = ConcatDataset(torch_datasets)
+
+    def test_all_packed_datasets(self, torch_datasets):


should also test error is caught when some datasets are unpacked?

Oh yeah it already exists in the previous unit test

codecov-commenter · 2025-01-17T23:40:32Z

Codecov Report

Attention: Patch coverage is 8.33333% with 22 lines in your changes missing coverage. Please review.

Project coverage is 23.94%. Comparing base (baae232) to head (50b72a1).
Report is 245 commits behind head on main.

Files with missing lines	Patch %	Lines
tests/torchtune/datasets/test_concat_dataset.py	15.38%	11 Missing ⚠️
torchtune/datasets/_concat.py	0.00%	4 Missing ⚠️
recipes/full_finetune_distributed.py	0.00%	1 Missing ⚠️
recipes/full_finetune_single_device.py	0.00%	1 Missing ⚠️
recipes/knowledge_distillation_distributed.py	0.00%	1 Missing ⚠️
recipes/lora_finetune_distributed.py	0.00%	1 Missing ⚠️
recipes/lora_finetune_single_device.py	0.00%	1 Missing ⚠️
recipes/qat_distributed.py	0.00%	1 Missing ⚠️
recipes/qat_lora_finetune_distributed.py	0.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (baae232) and HEAD (50b72a1). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (baae232) HEAD (50b72a1)

3 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2278       +/-   ##
===========================================
- Coverage   64.30%   23.94%   -40.37%     
===========================================
  Files         352      357        +5     
  Lines       20566    21151      +585     
===========================================
- Hits        13225     5064     -8161     
- Misses       7341    16087     +8746

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ebsmothers added 3 commits January 17, 2025 13:56

temp commit

aa72a14

Expose in other recipes, add unit test

0567e6b

revert config debug changes

50b72a1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2025

RdoubleA approved these changes Jan 17, 2025

View reviewed changes

ebsmothers merged commit 1036095 into pytorch:main Jan 18, 2025
17 checks passed

RdoubleA mentioned this pull request Jan 21, 2025

v0.6.0 tracker #2232

Closed

ebsmothers mentioned this pull request Mar 13, 2025

Update concat dataset example #2498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sample packing for ConcatDataset #2278

Sample packing for ConcatDataset #2278

Uh oh!

ebsmothers commented Jan 17, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 17, 2025 •

edited

Loading

Uh oh!

RdoubleA Jan 17, 2025

Uh oh!

ebsmothers Jan 17, 2025

Uh oh!

codecov-commenter commented Jan 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Sample packing for ConcatDataset #2278

Sample packing for ConcatDataset #2278

Uh oh!

Conversation

ebsmothers commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan:

Uh oh!

pytorch-bot bot commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2278

✅ No Failures

Uh oh!

RdoubleA Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

ebsmothers Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

ebsmothers commented Jan 17, 2025 •

edited

Loading

pytorch-bot bot commented Jan 17, 2025 •

edited

Loading

codecov-commenter commented Jan 17, 2025 •

edited

Loading