We release our pre-training annotations built on top of Open-X Embodiment. We extract the RLDS format demonstrations and transfer them to the annotation format using in LLaVA. You can download the images.tar.gz and the instruction tuning json file train/validation annotations. The instruction tuning data format is as follows:
train/val.json
│
└── image-instruction pair 1
│ ├── conversations
│ │ ├── human
│ │ │ └── (A string. Instruction from human, including conditions like robot type, robot state, task and ask agent to predict n step action and visual traces.)
│ │ │
│ │ └── gpt
│ │ └── (A string. Answers from agent for the predicted actions and visual traces.)
│ │
│ └── image
│ │ └── (A string. Image path.)
│ │
│ └── id
│ └── (A int. Annotation index.)
│
└── image-instruction pair 2
...
If you want to adapt your own dataset to the instruction tuning format, you may need to run our end-effector detector to get the visual traces. The detector is built on detectron2, and we release our detector and the weights below.
For instructions on using the end-effector detector, please see Gripper_detector.
