Adding Dataset sampling weights

### ⚠️ Please check that this feature request hasn't been suggested before.

- [X] I searched previous [Ideas in Discussions](https://github.com/OpenAccess-AI-Collective/axolotl/discussions/categories/ideas) didn't find any similar feature requests.
- [X] I searched previous [Issues](https://github.com/OpenAccess-AI-Collective/axolotl/labels/enhancement) didn't find any similar feature requests.

### 🔖 Feature description

Often when instruction tuning, we find the trickiest part is finding good data mixture ratios. We might be balancing a number of evaluations and a number of datasets. 

One thing missing from axolotl is an ability to sample from different datasets in a predetermined weighting. 

### ✔️ Solution

An example of a configuration with this feature:

```yaml
datasets:
   - path: vicgalle/alpaca-gpt4
      weight: 0.8
   - path: /path/to/dataset/B
     weight: 0.1
   - path: /path/to/dataset/C
     weight: 0.1
```

alternatively, to automatically sample by token count, there could be a single setting for that:

```yaml
sampling_method: token_weighted
```

or something. This might allow users to sample evenly by number of tokens, ensuring each dataset was equally represented. 

A stretch goal or something truly ideal might be more complex, like:

```yaml
datasets:
   - path: vicgalle/alpaca-gpt4
      weight_func: 5 * num_tokens
   - path: /path/to/dataset/B
     weight_func: 1 * num_tokens
   - path: /path/to/dataset
     weight_func: 2 * num_tokens
```

This would connote that we want to weight by tokens, but then additionally overweight the `vicgalle/alpaca-gpt4` dataset 5x, in other words, by increasing the liklihood we sample from that dataset by 5x (after normalizing for token count). This would be very useful in the data mixture optimization stage, and more such variables like `num_tokens` could be introduced. 

### ❓ Alternatives

The alternative I use is to create a fresh dataset mixture myself using HF datasets each and every time I want to run an experiment. This is less than ideal, and doesn't allow me to keep in axolotl configuration files the data mixture breakdown, which is a missed opportunity for good experiment tracking (and sanity). 

### 📝 Additional Context

This direction is also first step towards automating data mixture compositions which will become more valuable as time goes on and instruction tuning becomes more common. 

I would be curious if the maintainers are open to this feature, and if so, how I might get started implementing it. Thanks!

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this feature has not been requested yet.
- [X] I have provided enough information for the maintainers to understand and evaluate this request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding Dataset sampling weights #1508

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Adding Dataset sampling weights #1508

Description

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions