Skip to content

Adding Dataset sampling weights #1508

Open
@worldveil

Description

@worldveil

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Often when instruction tuning, we find the trickiest part is finding good data mixture ratios. We might be balancing a number of evaluations and a number of datasets.

One thing missing from axolotl is an ability to sample from different datasets in a predetermined weighting.

✔️ Solution

An example of a configuration with this feature:

datasets:
   - path: vicgalle/alpaca-gpt4
      weight: 0.8
   - path: /path/to/dataset/B
     weight: 0.1
   - path: /path/to/dataset/C
     weight: 0.1

alternatively, to automatically sample by token count, there could be a single setting for that:

sampling_method: token_weighted

or something. This might allow users to sample evenly by number of tokens, ensuring each dataset was equally represented.

A stretch goal or something truly ideal might be more complex, like:

datasets:
   - path: vicgalle/alpaca-gpt4
      weight_func: 5 * num_tokens
   - path: /path/to/dataset/B
     weight_func: 1 * num_tokens
   - path: /path/to/dataset
     weight_func: 2 * num_tokens

This would connote that we want to weight by tokens, but then additionally overweight the vicgalle/alpaca-gpt4 dataset 5x, in other words, by increasing the liklihood we sample from that dataset by 5x (after normalizing for token count). This would be very useful in the data mixture optimization stage, and more such variables like num_tokens could be introduced.

❓ Alternatives

The alternative I use is to create a fresh dataset mixture myself using HF datasets each and every time I want to run an experiment. This is less than ideal, and doesn't allow me to keep in axolotl configuration files the data mixture breakdown, which is a missed opportunity for good experiment tracking (and sanity).

📝 Additional Context

This direction is also first step towards automating data mixture compositions which will become more valuable as time goes on and instruction tuning becomes more common.

I would be curious if the maintainers are open to this feature, and if so, how I might get started implementing it. Thanks!

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions