Instruct and chat datasets docs #1571

RdoubleA · 2024-09-12T23:38:55Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Details around how to set up custom instruct and chat datasets and the expected format for each of these. As per #1529.

Test plan

doc build

pytorch-bot · 2024-09-12T23:39:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1571

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 90cb020 with merge base e3718e8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

docs/source/basics/instruct_datasets.rst

docs/source/basics/chat_datasets.rst

docs/source/basics/instruct_datasets.rst

docs/source/basics/chat_datasets.rst

docs/source/basics/instruct_datasets.rst

docs/source/basics/chat_datasets.rst

SalmanMohammadi · 2024-09-13T20:35:30Z

docs/source/basics/chat_datasets.rst

+        ]
+    }
+
+If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.


going to link somewhere else for this at some point?

yes, after the messages docs lands

docs/source/basics/chat_datasets.rst

docs/source/basics/instruct_datasets.rst

docs/source/basics/chat_datasets.rst

felipemello1

Thanks Rafi! It is well written, but i will be a parrot and repeat myself:

IMO, a measurement of how good a tutorial is is how short it is. I think that there is an opportunity to do end2end with a monolithic example and almost no words paragraphs:

Dataset Type one
<2 line summary>

<create dummy data>
<instantiate dataset with a bunch of args, e.g. train_on_input, new_prompt, replace_columns, so the user can understand everything in one go>
<call dataset>
<show outputs / tags / tokens>

Dataset Type two
<2 line summary>

<create dummy data>
<instantiate dataset with a bunch of args, e.g. train_on_input, new_prompt, replace_columns, so the user can understand everything in one go>
<call dataset>
<show outputs / tags / tokens>

docs/source/basics/chat_datasets.rst

docs/source/basics/instruct_datasets.rst

felipemello1

Thanks Rafi! It is well written, but i will be a parrot and repeat myself:

IMO, a measurement of how good a tutorial is is how short it is. I think that there is an opportunity to do end2end with a monolithic example and almost no words paragraphs:

docs/source/basics/chat_datasets.rst

ebsmothers · 2024-09-13T21:03:50Z

docs/source/basics/chat_datasets.rst

+Renaming columns
+----------------
+
+You can remap column names similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`column_map` for more info.
+
+Training on user input
+----------------------
+
+You can train on user input similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`train_on_input` for more info.
+
+Adding system prompts
+---------------------
+
+You can set a system prompt for your dataset similarly to :func:`~torchtune.datasets.instruct_dataset`. See :ref:`system_prompt` for more info.
+
+Chat templates
+--------------
+
+Chat templates are defined the same way as instruct templates in :func:`~torchtune.datasets.instruct_dataset`. See :ref:`instruct_template` for more info.
+


Some of these might merit a tiny bit more explanation beyond just the link to the instruct docs. E.g. for system prompts: is it the first message of the chat or does it precede each user message? I guess chat templates is kind of a 1:1 mapping with our implementation, but I can imagine that being confusing for folks who are used to more of a separation between how chat and instruct data is prepared

SalmanMohammadi · 2024-09-18T12:06:20Z

@RdoubleA are you mostly done with the changes for me to emulate with the preference/text completion docs?

RdoubleA · 2024-09-18T18:29:02Z

are you mostly done with the changes for me to emulate with the preference/text completion docs?

Yes, this should be the bulk of it

docs/source/basics/chat_datasets.rst

SalmanMohammadi · 2024-09-20T11:29:24Z

docs/source/basics/chat_datasets.rst

+      conversation_column: conversations
+      conversation_style: sharegpt
+      train_on_input: True
+      new_system_prompt: You are an AI assistant.


this is missing from the above example? or no?

ah, let me remove this. Mistral doesn't use system prompts

SalmanMohammadi · 2024-09-20T11:31:11Z

docs/source/basics/chat_datasets.rst

+      split: train
+
+
+Loading local and remote chat datasets


is this not already covered above?

yeah I guess. was thinking it's worth having a direct section on it, vs just implicit in the example.

docs/source/basics/chat_datasets.rst

SalmanMohammadi · 2024-09-20T11:33:10Z

docs/source/basics/chat_datasets.rst

+        ]
+    }
+
+If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.


can you point to a custom message transform for chat data in the codebase?

hmm I don't think we have one...

docs/source/basics/chat_datasets.rst

pbontrager

Thanks!

pbontrager

Thanks!

RdoubleA added 2 commits September 12, 2024 15:10

first

5c63779

docs

945fc53

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2024