Description
⚠️ Please check that this feature request hasn't been suggested before.
- I searched previous Ideas in Discussions didn't find any similar feature requests.
- I searched previous Issues didn't find any similar feature requests.
🔖 Feature description
Often when instruction tuning, we find the trickiest part is finding good data mixture ratios. We might be balancing a number of evaluations and a number of datasets.
One thing missing from axolotl is an ability to sample from different datasets in a predetermined weighting.
✔️ Solution
An example of a configuration with this feature:
datasets:
- path: vicgalle/alpaca-gpt4
weight: 0.8
- path: /path/to/dataset/B
weight: 0.1
- path: /path/to/dataset/C
weight: 0.1
alternatively, to automatically sample by token count, there could be a single setting for that:
sampling_method: token_weighted
or something. This might allow users to sample evenly by number of tokens, ensuring each dataset was equally represented.
A stretch goal or something truly ideal might be more complex, like:
datasets:
- path: vicgalle/alpaca-gpt4
weight_func: 5 * num_tokens
- path: /path/to/dataset/B
weight_func: 1 * num_tokens
- path: /path/to/dataset
weight_func: 2 * num_tokens
This would connote that we want to weight by tokens, but then additionally overweight the vicgalle/alpaca-gpt4
dataset 5x, in other words, by increasing the liklihood we sample from that dataset by 5x (after normalizing for token count). This would be very useful in the data mixture optimization stage, and more such variables like num_tokens
could be introduced.
❓ Alternatives
The alternative I use is to create a fresh dataset mixture myself using HF datasets each and every time I want to run an experiment. This is less than ideal, and doesn't allow me to keep in axolotl configuration files the data mixture breakdown, which is a missed opportunity for good experiment tracking (and sanity).
📝 Additional Context
This direction is also first step towards automating data mixture compositions which will become more valuable as time goes on and instruction tuning becomes more common.
I would be curious if the maintainers are open to this feature, and if so, how I might get started implementing it. Thanks!
Acknowledgements
- My issue title is concise, descriptive, and in title casing.
- I have searched the existing issues to make sure this feature has not been requested yet.
- I have provided enough information for the maintainers to understand and evaluate this request.