Conversation style appears tied to the dataset rather than the model

If I'm not mistaken, the conversation style that applies during a fine-tune is defined by the dataset defaults, rather than by the tokenizer being used ([docs here](https://pytorch.org/torchtune/main/generated/torchtune.datasets.chat_dataset.html#torchtune.datasets.chat_dataset).

What happens if the tokenizer+model do not have the tokens required for a given conversation style? Are those special tokens created? I assume not.

Is there an option whereby one can:
- default to using tokenizer.chat_template for the conversation style? (most models on huggingface have this defined)

I'm guessing one issue here is that - since tokenizer.chat_template is not known in advance, this poses issues for controlling the loss mask on the prompt vs completions?

So maybe that's the dilemna? Either one can:
a) load a default conversation style from the model/tokenizer, but then it's hard to implement loss masks, or
b) load the default conversation style based on the dataset choice, but then there risks being token incompatibilities with the model/tokenizer being trained.

The practical task I'm interested in is fine-tuning llama 3 and qwen 2.5 using conversation styles that match their chat templates (so as to minimise the re-training/over-writing that I'm doing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Conversation style appears tied to the dataset rather than the model #2096

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Conversation style appears tied to the dataset rather than the model #2096

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions