Skip to content

loubnabnl/bloom-code-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bloom-code-evaluation

Evaluation of BLOOM on the task of code generation using the HumanEval benchmark.

On JZ

This generates code for the 164 prompts present in the benchmark (200 generations are made for each problem). The experiment is done three times for 3 different temperatures 0.2, 0.6 and 0.8:

Setup

transformersand accelerate are installed from source along with datasets, we also clone the HumanEval benchmark to use it offline.

bash setup.sh

Code generation

The following commands generate code for each experiment/temperature, you can increase the batch size if you have enough memory. This outputs two files generations.jsonand references.json placed in the corresponding output_file. You can also change MODEL_CKPT to a local repository to load it offline.

export HF_DATASETS_OFFLINE=1

OUTPUT_file1=code_generations_exp1
OUTPUT_file2=code_generations_exp2
OUTPUT_file3=code_generations_exp3

MODEL_CKPT=bigscience/bloom
echo using $MODEL_CKPT as model checkpoint, if not done change it to a local repository

python  code_eval.py --model_ckpt $MODEL_CKPT \
--batch_size 1 \
--do_sample True \
--temperature 0.2 \
--top_p 0.95 \
--n_samples 200 \
--output_file $OUTPUT_file1

python  code_eval.py --model_ckpt $MODEL_CKPT \
--batch_size 1 \
--do_sample True \
--temperature 0.6 \
--top_p 0.95 \
--n_samples 200 \
--output_file $OUTPUT_file2

python  code_eval.py --model_ckpt $MODEL_CKPT \
--batch_size 1 \
--do_sample True \
--temperature 0.8 \
--top_p 0.95 \
--n_samples 200 \
--output_file $OUTPUT_file3

Evaluation on GCP

All experiments must be placed in the folder output_file. HF_ALLOW_CODE_EVAL=1 allwos executing the code generated by the model. This prints the pass@k scores for each experiment and saves them in json files in output_file.

pip install datasets transformers
python run_evaluation.py --HF_ALLOW_CODE_EVAL 1 --output_file bloom --num_tasks 164

As a final score we take the best results out of the three experiments of each of the pass@1, pass@10 and pass@100 scores.

Note: If you are evaluating the existing generations from bloom in this repo, please set replace_eos=True in run_evaluation.py.

About

Evaluation of BLOOM on the HumanEval benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •