-
Notifications
You must be signed in to change notification settings - Fork 645
Add vqa_dataset
, update docs
#1820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1820
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1737744 with merge base 7744608 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@RdoubleA Require review. |
I see where is the problem. I committed and then switched branch... |
@RdoubleA Should be fine now. |
Yeah, some stuff failed( Let me fix then other PRs and then this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test this builder with an example VQA dataset once comments are addressed? Something like https://huggingface.co/datasets/derek-thomas/ScienceQA using the question and solution columns.
If you're able to easily configure this dataset using the builder you added, and you can successfully decode tokens of a random sample, and the iamge is encoded correctly, then I would declare this builder to be well-tested (in addition to the unit test you put up)
[ | ||
{ | ||
"input": "What is presented on image?", | ||
"output": "PyTorch logo.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was example
*, | ||
source: str, | ||
column_map: Optional[Dict[str, str]] = None, | ||
train_on_input: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are removing train_on_input for multimodal datasets, as it is not really used. you'll notice that multimodal_chat_dataset
also has this removed
column_map: Optional[Dict[str, str]] = None, | ||
train_on_input: bool = False, | ||
new_system_prompt: Optional[str] = None, | ||
packed: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't currently support packing for multimodal (yet... we are planning to add this soon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RdoubleA Is there will be a problem to add packing in multimodal? There is no issue about it as I see. Might open then next PR with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is non trivial work to ensure the cross attention masks are also packed and collated correctly, so it would involve 1) adding logic to pack the cross attention masks and 2) create a new collated for packed multimodal
We are actively discussing an overhaul to our dataloader / collate / packing logic in the near term, so I would hold off on multimodal packing support for now.
new_system_prompt: Optional[str] = None, | ||
packed: bool = False, | ||
filter_fn: Optional[Callable] = None, | ||
split: str = "train", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would also be good to include the image_dir
functionality similar to multimodal chat: https://github.com/pytorch/torchtune/blob/main/torchtune/datasets/multimodal/_multimodal.py#L22. this would enable users to configure datasets that either have image paths or the images directly. For example, take a look a MathVista (https://huggingface.co/datasets/AI4Math/MathVista), a math image QA benchmark that has both the image paths and the raw image. some datasets may have one or the other.
This may require some non-trivial changes to InputOutputToMessages however to make it support loading images. I'm happy to provide more guidance on that, but it would be similar to the changes to ShareGPTToMessages
in this pr: #1667
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I see you already made those changes :)
|
||
column_map = {"input": "question", "output": "answer", "image": "picture"} | ||
|
||
Masking of the prompt during training is controlled by the ``train_on_input`` flag, which is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove train_on_input
|
||
|
||
def multimodal_instruct_dataset( | ||
tokenizer: ModelTokenizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for multimodal datasets, this needs to be:
tokenizer: ModelTokenizer, | |
model_transform: Transform, |
I would follow multimodal_chat_dataset
more closely instead of instruct_dataset
for this one
:: | ||
|
||
>>> from torchtune.datasets.multimodal import multimodal_instruct_dataset | ||
>>> dataset = multimodal_instruct_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe structure it similar to this example:
>>> model_transform = FlamingoTransform( |
and FlamingoTransform should be replaced everywhere in the example with Llama3VisionTransform
https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama3_2_vision/_transform.py#L17
torchtune/data/_messages.py
Outdated
""" | ||
|
||
def __init__( | ||
self, | ||
train_on_input: bool = False, | ||
column_map: Optional[Dict[str, str]] = None, | ||
new_system_prompt: Optional[str] = None, | ||
is_multimodal: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we infer this similar to in ShareGPTToMessages?
is_multimodal = "image" in sample or (
"image" in self._column_map and self._column_map["image"] in sample
)
so users don't need to worry about another flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
…modal_instruct.py Co-authored-by: Rafi Ayub <[email protected]>
…instruct.py Co-authored-by: Rafi Ayub <[email protected]>
] | ||
|
||
expected_labels = [ | ||
[np.int64(-100), np.int64(-100), np.int64(-100), np.int64(-100), np.int64(-100), np.int64(-100), np.int64(-100), np.int64(-100), np.int64(-100), np.int64(-100), np.int64(7), np.int64(5), np.int64(-1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but is the np.int64
strictly necessary? I know labels are long dtype, but if we are just comparing raw lists via ==
and not using torch tensor asserts or something I think you shouldn't need it (and it makes the expected values easier to read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will be fixed.
Done all small fixes + CI fixes. Only need extra test |
@RdoubleA @ebsmothers
Output:
I'm pretty confident that it is actually working |
@ebsmothers @RdoubleA Added another fix, locally tests are passed |
Yeah, finnaly. Can someone approve and merge? Or should we do some extra checks? |
@pytest.mark.parametrize("train_on_input", [True, False]) | ||
def test_get_item(self, tokenizer, train_on_input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove train_on_input parametrization? Otherwise I think this is just gonna run twice identically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point
torchtune/data/_messages.py
Outdated
self.column_map = column_map | ||
|
||
if self.column_map: | ||
if "input" not in self.column_map: | ||
raise ValueError( | ||
f"Expected a key of 'input' in column_map but found {column_map.keys()}." | ||
f"Expected a key of 'input' in column_map but found {self.column_map.keys()}." | ||
) | ||
if "output" not in column_map: | ||
if "output" not in self.column_map: | ||
raise ValueError( | ||
f"Expected a key of 'output' in column_map but found {column_map.keys()}." | ||
f"Expected a key of 'output' in column_map but found {self.column_map.keys()}." | ||
) | ||
self._column_map = column_map | ||
else: | ||
self._column_map = {"input": "input", "output": "output"} | ||
self.column_map = {"input": "input", "output": "output"} | ||
|
||
self._column_map = column_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels duplicative.. why do we need to define both self.column_map
and self._column_map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, will think about this, problem that we can't fully define column_map in __init__
because of is_multimodal check in__call__
filter_fn: Optional[Callable] = None, | ||
split: str = "train", | ||
**load_dataset_kwargs: Dict[str, Any], | ||
) -> Union[SFTDataset, PackedDataset]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking if we don't support packing yet should return type just be SFTDataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, typehint needs to be fixed
) | ||
assert prompt == expected_tokens[i] | ||
assert label == expected_labels[i] | ||
assert isinstance(image[0], PngImageFile) is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need is True
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably True might be removed, yes
Pushed some fixes |
I have a feeling that we will have to separate 2 version of InputToMessages to not break SRP for example |
@RdoubleA @joecummings Can we merge? |
@@ -175,30 +175,54 @@ def __init__( | |||
): | |||
self.train_on_input = train_on_input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this slight nagging feeling that this transform is over-generalized and it wouldn't be harmful to define another transform which should properly document that the column map can include image, and to also include validation logic, and without needing to check for multimodal.
Fine to leave this as a followup but if there's consensus here I'd like to see an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is the same for ShareGPTToMessages
, so maybe it's fine. We need to properly document MM usage either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider this in further PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see what you mean. Especially as we add more modalities this may get bloated quick. I was the one who initiated this design in ShareGPTToMessages but I think it's worth reconsidering. I'll put up an issue.
column_map = {"input": "question", "output": "answer", "image": "picture"} | ||
|
||
Args: | ||
model_transform (Transform): callable that applies model-specific pre-processing to the sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more concrete here? For example, the cauldron dataset builder describes this arg as:
model_transform (Transform): model-specific transform class that takes in a sample dict and applies custom
transforms on the keys. It should consist of at minimum two components: text tokenization (called
on the "messages" field) and image transform (called on the "images" field). The keys returned by
the model transform should be aligned with the expected inputs into the model.
the expected column names. For example, if your dataset has columns ``"question"``, | ||
``"answer"`` and ``"picture"`` you can use: | ||
|
||
column_map = {"input": "question", "output": "answer", "image": "picture"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need a concrete example of using column_map here, it is better described below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is. For me it looks fine, but we can remove this description if you want
This looks great to me. Coming at this from an outside perspective I'd really love to see a couple follow up issues to make sure we properly document new features we add to our datasets. I've mentioned above some things I immediately found confusing when combing through the current docs in this PR.
|
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Probably the second idea is better.
Then will run full procedure Should I make this docs fixes in this PR or open separate? |
Oh, lint(I accepted changes right here). Let me fix |
We'd be incredibly grateful if you're up for helping out here! I was speaking slightly generally just to make sure some of those tasks get done at some point, so myself (and the other maintainers) are happy to help out, particularly with a training run, and of course, reviews! In any case, seperate PRs are fine. |
Fixed |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1820 +/- ##
===========================================
- Coverage 67.05% 25.75% -41.30%
===========================================
Files 305 307 +2
Lines 15937 16068 +131
===========================================
- Hits 10687 4139 -6548
- Misses 5250 11929 +6679 ☔ View full report in Codecov by Sentry. |
Is there anything else to add in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate you pushing this all the way through. I think this highlighted some rough parts on our multimodal pipeline which we should follow up on and created some productive discussions. Just one comment, otherwise it's good to go once CI is green.
in the filepath in ``data_files``, and set ``split="train"``. See `Hugging Face's | ||
<https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path>`_ | ||
``load_dataset`` for more details. | ||
image_dir (str): path to the directory containing the images as you are expected to download the COCO dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be generalized and not specific to COCO. Maybe use the same docstring from image_dir argument in InputOutputToMessages
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
vqa_dataset
, update docs
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
#1704
Changelog
What are the changes made in this PR?
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example