Closed as not planned
Description
🔧 Proposed code refactoring
Check if our default hyperparameters (e.g. kl_target) are correct, see: huggingface/trl@b56e8b3 and huggingface/trl#462
Also, RLHF training is quite unstable w.r.t. parameter choices, see e.g. issues in trl. Try to find good defaults that work for one (or more) of our finetuned models.