Skip to content

Conversation style appears tied to the dataset rather than the model #2096

Closed
@RonanKMcGovern

Description

@RonanKMcGovern

If I'm not mistaken, the conversation style that applies during a fine-tune is defined by the dataset defaults, rather than by the tokenizer being used (docs here.

What happens if the tokenizer+model do not have the tokens required for a given conversation style? Are those special tokens created? I assume not.

Is there an option whereby one can:

  • default to using tokenizer.chat_template for the conversation style? (most models on huggingface have this defined)

I'm guessing one issue here is that - since tokenizer.chat_template is not known in advance, this poses issues for controlling the loss mask on the prompt vs completions?

So maybe that's the dilemna? Either one can:
a) load a default conversation style from the model/tokenizer, but then it's hard to implement loss masks, or
b) load the default conversation style based on the dataset choice, but then there risks being token incompatibilities with the model/tokenizer being trained.

The practical task I'm interested in is fine-tuning llama 3 and qwen 2.5 using conversation styles that match their chat templates (so as to minimise the re-training/over-writing that I'm doing).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions