Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Min Sun, Rajasimman Madhivanan, Dinesh Manocha
- [2025/02]: We submitted the paper to IROS 2025;
- [2025/06]: Our paper is accepeted by IROS 2025;
We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on SeamnticKITTI test, with a notably low training memory consumption of 10.9 GB.
conda create -n etformer python=3.10
conda activate etformer
conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip3 install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.0/index.html
pip3 install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip3 install spconv-cu118
pip3 install -r requirements.txt
# Check the correct version: https://shi-labs.com/natten/wheels/
pip3 install natten==0.14.6+torch200cu118 -f https://shi-labs.com/natten/wheels/
pip3 install -v -e submodules/deform_attn_3d/
- SemanticKITTI
Download datasets:
- The semantic scene completion dataset v1.1, odometery data, poses (SemanticKITTI voxel data, 700 MB) from SemanticKITTI website.
- The RGB images (Download odometry data set (color, 65 GB)) from KITTI Odometry website.
- The calibration and pose files from voxformer/preprocess/data_odometry_calib/sequences.
- The preprocessed ground truth (~700MB) from labels.
- The voxelized psuedo point cloud and query proposals (~400MB) based on MobileStereoNet from sequences_msnet3d_sweep10.
preprocess targets:
python preprocess_data.py --data_root="DATA_ROOT_kitti_folder" --data_config="data/semantic_kitti/semantic-kitti.yaml" --batch=1 --index=0 --type=0
mkdir pretrained_models
cd pretrained_models
download StageOne and StageTwo
Train stage one:
python3 -m torch.distributed.launch --nproc_per_node=4 main.py --only_load_model --model_type=3 --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --snapshot="pretrained_models/stage1.pth" --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --wandb_api="W&B api" --wandb_proj="W&B project name"
Generate data for the 2nd stage:
python3 evaluate.py --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --model_type=3 --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --snapshot="pretrained_models/stage1.pth" --generate_data
Train stage two:
python3 -m torch.distributed.launch --nproc_per_node=4 main.py --only_load_model --model_type=2 --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --snapshot="pretrained_models/stage2.pth" --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --wandb_api="W&B api" --wandb_proj="W&B project name"
Evaluate stage one:
python3 evaluate.py --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --model_type=3 --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --snapshot="pretrained_models/stage1.pth"
Evaluate stage two:
python3 evaluate.py --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --model_type=2 --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --snapshot="pretrained_models/stage2.pth"
If this work is helpful for your research, please cite the following BibTeX entry.
@inproceedings{liang2024etformer,
title={ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera},
author={\textbf{Jing Liang} and He Yin and Xuewei Qi and Jong Jin Park and Min Sun and Rajasimman Madhivanan and Dinesh Manocha},
booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year={2025},
organization={IEEE}
}
Many thanks to these excellent open source projects:
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.