Evaluation of BLOOM on the task of code generation using the HumanEval benchmark.
This generates code for the 164 prompts present in the benchmark (200 generations are made for each problem). The experiment is done three times for 3 different temperatures 0.2
, 0.6
and 0.8
:
transformers
and accelerate
are installed from source along with datasets, we also clone the HumanEval benchmark to use it offline.
bash setup.sh
The following commands generate code for each experiment/temperature
, you can increase the batch size if you have enough memory. This outputs two files generations.json
and references.json
placed in the corresponding output_file
. You can also change MODEL_CKPT
to a local repository to load it offline.
export HF_DATASETS_OFFLINE=1
OUTPUT_file1=code_generations_exp1
OUTPUT_file2=code_generations_exp2
OUTPUT_file3=code_generations_exp3
MODEL_CKPT=bigscience/bloom
echo using $MODEL_CKPT as model checkpoint, if not done change it to a local repository
python code_eval.py --model_ckpt $MODEL_CKPT \
--batch_size 1 \
--do_sample True \
--temperature 0.2 \
--top_p 0.95 \
--n_samples 200 \
--output_file $OUTPUT_file1
python code_eval.py --model_ckpt $MODEL_CKPT \
--batch_size 1 \
--do_sample True \
--temperature 0.6 \
--top_p 0.95 \
--n_samples 200 \
--output_file $OUTPUT_file2
python code_eval.py --model_ckpt $MODEL_CKPT \
--batch_size 1 \
--do_sample True \
--temperature 0.8 \
--top_p 0.95 \
--n_samples 200 \
--output_file $OUTPUT_file3
All experiments must be placed in the folder output_file
. HF_ALLOW_CODE_EVAL=1
allwos executing the code generated by the model. This prints the pass@k
scores for each experiment and saves them in json files in output_file
.
pip install datasets transformers
python run_evaluation.py --HF_ALLOW_CODE_EVAL 1 --output_file bloom --num_tasks 164
As a final score we take the best results out of the three experiments of each of the pass@1
, pass@10
and pass@100
scores.
Note: If you are evaluating the existing generations from bloom in this repo, please set replace_eos=True
in run_evaluation.py
.