CALVIN is a benchmark for evaluating vision-language models in robotic long-horizon manipulation tasks.
| Method | Mode | Setting | AVG | CKPT |
|---|---|---|---|---|
| UniVLA | video sft | ABCD->D | 4.63 (5x:4.71) | huggingface |
We follow the RoboVLMs repository for environment setup. This setup is only for evaluation. The following steps are required to set up the environment:
# Install dependencies
cd reference/RoboVLMs
# This will install the required environment and download the calvin dataset.
bash scripts/setup_calvin.sh
# Only for rendering environment.
bash scripts/setup_calvin_vla.sh
# Check if the environment is set up correctly
python eval/calvin/env_test.py# 1. process the dataset
python tools/process/calvin_process.py
# 2. extract the vq tokens, need to change the dataset & output path
bash scripts/tokenizer/extract_vq_emu3.sh
# 3. pickle generation for training
python tools/pickle_gen/pickle_generation_calvin.pyYou can fit the FAST tokenizer on the corresponding dataset. Also, you can adjust the scale in tokenizer for more fine-grained tokenization.
python tools/action_tokenizer/fit_fast.pybash scripts/simulator/calvin/train_calvin_abcd_video.shcd reference/RoboVLMs
# 8 GPUs inference
bash scripts/run_eval_calvin_univla.sh ${CKPT_PATH}
# above command will generate the 8 results (if use 8 GPUs) in the `results` folder, calculate the final average score
python tools/evaluation/calvin_score.py