-
Notifications
You must be signed in to change notification settings - Fork 647
Instruct and chat datasets docs #1571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1571
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 90cb020 with merge base e3718e8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
docs/source/basics/chat_datasets.rst
Outdated
] | ||
} | ||
|
||
If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to link somewhere else for this at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, after the messages docs lands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rafi! It is well written, but i will be a parrot and repeat myself:
IMO, a measurement of how good a tutorial is is how short it is. I think that there is an opportunity to do end2end with a monolithic example and almost no words paragraphs:
Dataset Type one
<2 line summary>
<create dummy data>
<instantiate dataset with a bunch of args, e.g. train_on_input, new_prompt, replace_columns, so the user can understand everything in one go>
<call dataset>
<show outputs / tags / tokens>
Dataset Type two
<2 line summary>
<create dummy data>
<instantiate dataset with a bunch of args, e.g. train_on_input, new_prompt, replace_columns, so the user can understand everything in one go>
<call dataset>
<show outputs / tags / tokens>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rafi! It is well written, but i will be a parrot and repeat myself:
IMO, a measurement of how good a tutorial is is how short it is. I think that there is an opportunity to do end2end with a monolithic example and almost no words paragraphs:
docs/source/basics/chat_datasets.rst
Outdated
Renaming columns | ||
---------------- | ||
|
||
You can remap column names similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`column_map` for more info. | ||
|
||
Training on user input | ||
---------------------- | ||
|
||
You can train on user input similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`train_on_input` for more info. | ||
|
||
Adding system prompts | ||
--------------------- | ||
|
||
You can set a system prompt for your dataset similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`system_prompt` for more info. | ||
|
||
Chat templates | ||
-------------- | ||
|
||
Chat templates are defined the same way as instruct templates in :func:`~torchtune.datasets.instruct_dataset`. See :ref:`instruct_template` for more info. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these might merit a tiny bit more explanation beyond just the link to the instruct docs. E.g. for system prompts: is it the first message of the chat or does it precede each user message? I guess chat templates is kind of a 1:1 mapping with our implementation, but I can imagine that being confusing for folks who are used to more of a separation between how chat and instruct data is prepared
@RdoubleA are you mostly done with the changes for me to emulate with the preference/text completion docs? |
Yes, this should be the bulk of it |
docs/source/basics/chat_datasets.rst
Outdated
conversation_column: conversations | ||
conversation_style: sharegpt | ||
train_on_input: True | ||
new_system_prompt: You are an AI assistant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is missing from the above example? or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, let me remove this. Mistral doesn't use system prompts
split: train | ||
|
||
|
||
Loading local and remote chat datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not already covered above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I guess. was thinking it's worth having a direct section on it, vs just implicit in the example.
] | ||
} | ||
|
||
If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you point to a custom message transform for chat data in the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I don't think we have one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Context
What is the purpose of this PR? Is it to
Details around how to set up custom instruct and chat datasets and the expected format for each of these. As per #1529.
Test plan
doc build