Skip to content

Instruct and chat datasets docs #1571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 20, 2024
Merged

Conversation

RdoubleA
Copy link
Collaborator

@RdoubleA RdoubleA commented Sep 12, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Details around how to set up custom instruct and chat datasets and the expected format for each of these. As per #1529.

Test plan

doc build

Copy link

pytorch-bot bot commented Sep 12, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1571

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 90cb020 with merge base e3718e8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2024
]
}

If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to link somewhere else for this at some point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, after the messages docs lands

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rafi! It is well written, but i will be a parrot and repeat myself:

IMO, a measurement of how good a tutorial is is how short it is. I think that there is an opportunity to do end2end with a monolithic example and almost no words paragraphs:

Dataset Type one
<2 line summary>

<create dummy data>
<instantiate dataset with a bunch of args, e.g. train_on_input, new_prompt, replace_columns, so the user can understand everything in one go>
<call dataset>
<show outputs / tags / tokens>

Dataset Type two
<2 line summary>

<create dummy data>
<instantiate dataset with a bunch of args, e.g. train_on_input, new_prompt, replace_columns, so the user can understand everything in one go>
<call dataset>
<show outputs / tags / tokens>

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rafi! It is well written, but i will be a parrot and repeat myself:

IMO, a measurement of how good a tutorial is is how short it is. I think that there is an opportunity to do end2end with a monolithic example and almost no words paragraphs:

Comment on lines 170 to 189
Renaming columns
----------------

You can remap column names similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`column_map` for more info.

Training on user input
----------------------

You can train on user input similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`train_on_input` for more info.

Adding system prompts
---------------------

You can set a system prompt for your dataset similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`system_prompt` for more info.

Chat templates
--------------

Chat templates are defined the same way as instruct templates in :func:`~torchtune.datasets.instruct_dataset`. See :ref:`instruct_template` for more info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these might merit a tiny bit more explanation beyond just the link to the instruct docs. E.g. for system prompts: is it the first message of the chat or does it precede each user message? I guess chat templates is kind of a 1:1 mapping with our implementation, but I can imagine that being confusing for folks who are used to more of a separation between how chat and instruct data is prepared

@SalmanMohammadi
Copy link
Collaborator

@RdoubleA are you mostly done with the changes for me to emulate with the preference/text completion docs?

@RdoubleA
Copy link
Collaborator Author

are you mostly done with the changes for me to emulate with the preference/text completion docs?

Yes, this should be the bulk of it

conversation_column: conversations
conversation_style: sharegpt
train_on_input: True
new_system_prompt: You are an AI assistant.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing from the above example? or no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, let me remove this. Mistral doesn't use system prompts

split: train


Loading local and remote chat datasets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not already covered above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I guess. was thinking it's worth having a direct section on it, vs just implicit in the example.

]
}

If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.
Copy link
Collaborator

@SalmanMohammadi SalmanMohammadi Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point to a custom message transform for chat data in the codebase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I don't think we have one...

Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@RdoubleA RdoubleA merged commit 9a863c8 into pytorch:main Sep 20, 2024
17 checks passed
@RdoubleA RdoubleA deleted the instruct_chat_docs branch September 20, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants