Skip to content

Latest commit

 

History

History
38 lines (30 loc) · 2.4 KB

File metadata and controls

38 lines (30 loc) · 2.4 KB

Vision-Action Tuning Dataset

For pre-training LLARVA, we generate 8.5M image-visual trace pairs from the Open X-Embodiment (OXE) dataset. Our dataset consists of images from a diverse collection of 37 OXE subsets with 13 different robots, including a wide assortment of tasks, environments, cameras (and thus images), and end-effectors, among other factors. For each image in an episode, we calculate the 2-D visual trace of the end-effector. For this purpose, we use a bounding box detector that is trained specifically on each of the different end-effectors in OXE.

Instruction Tuning Dataset

We release our pre-training annotations built on top of Open-X Embodiment. We extract the RLDS format demonstrations and transfer them to the annotation format using in LLaVA. You can download the images.tar.gz and the instruction tuning json file train/validation annotations. The instruction tuning data format is as follows:

train/val.json
│ 
└── image-instruction pair 1
│   ├── conversations 
│   │   ├── human 
│   │   │   └── (A string. Instruction from human, including conditions like robot type, robot state, task and ask agent to predict n step action and visual traces.)
│   │   │
│   │   └── gpt
│   │       └── (A string. Answers from agent for the predicted actions and visual traces.)
│   │
│   └── image
│   │   └── (A string. Image path.)
│   │
│   └── id 
│       └── (A int. Annotation index.)         
│
└── image-instruction pair 2
    ... 

If you want to adapt your own dataset to the instruction tuning format, you may need to run our end-effector detector to get the visual traces. The detector is built on detectron2, and we release our detector and the weights below.

End-effector Detector

For instructions on using the end-effector detector, please see Gripper_detector.