Skip to content

amazon-science/ET-Former

ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Min Sun, Rajasimman Madhivanan, Dinesh Manocha

[PDF] [Project] [Intro Video]

News

  • [2025/02]: We submitted the paper to IROS 2025;
  • [2025/06]: Our paper is accepeted by IROS 2025;

Abstract

We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on SeamnticKITTI test, with a notably low training memory consumption of 10.9 GB.

Method

space-1.jpg
Figure 1. Overall Architecture of ET-Former. We present a two-stage pipeline for processing mono-cam images and generate both a semantic occupancy map m_s and its corresponding uncertainty map m_u. In stage 1, we introduce a novel triplane-based deformable attention model to generate the occupancy queries m_o from the given mono-cam images, which reduces high-dimensional 3D feature processing to 2D computations. In stage 2, we employ the efficient triplane-based deformable attention mechanism to generate the semantic map, with the inferred voxels from stage 1 as input and conditioned on the RGB image. To estimate the uncertainty in the semantic map, we incorporate a CVAE method, and quantify the uncertainty using the variance of the CVAE latent samples.

Installation

conda create -n etformer python=3.10
conda activate etformer
conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip3 install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.0/index.html
pip3 install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip3 install spconv-cu118
pip3 install -r requirements.txt

# Check the correct version: https://shi-labs.com/natten/wheels/
pip3 install natten==0.14.6+torch200cu118 -f https://shi-labs.com/natten/wheels/

pip3 install -v -e submodules/deform_attn_3d/

Dataset

  • SemanticKITTI

Download datasets:

  • The semantic scene completion dataset v1.1, odometery data, poses (SemanticKITTI voxel data, 700 MB) from SemanticKITTI website.
  • The RGB images (Download odometry data set (color, 65 GB)) from KITTI Odometry website.
  • The calibration and pose files from voxformer/preprocess/data_odometry_calib/sequences.
  • The preprocessed ground truth (~700MB) from labels.
  • The voxelized psuedo point cloud and query proposals (~400MB) based on MobileStereoNet from sequences_msnet3d_sweep10.

preprocess targets:

python preprocess_data.py --data_root="DATA_ROOT_kitti_folder" --data_config="data/semantic_kitti/semantic-kitti.yaml" --batch=1 --index=0 --type=0

Download Pretrained Models

mkdir pretrained_models
cd pretrained_models

download StageOne and StageTwo

Training

Train stage one:

python3 -m torch.distributed.launch --nproc_per_node=4 main.py --only_load_model --model_type=3 --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --snapshot="pretrained_models/stage1.pth" --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --wandb_api="W&B api" --wandb_proj="W&B project name"

Generate data for the 2nd stage:

python3 evaluate.py --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --model_type=3 --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --snapshot="pretrained_models/stage1.pth" --generate_data

Train stage two:

python3 -m torch.distributed.launch --nproc_per_node=4 main.py --only_load_model --model_type=2 --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --snapshot="pretrained_models/stage2.pth" --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --wandb_api="W&B api" --wandb_proj="W&B project name"

Evaluation

Evaluate stage one:

python3 evaluate.py --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --model_type=3 --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --snapshot="pretrained_models/stage1.pth"

Evaluate stage two:

python3 evaluate.py --data_cfg="data/semantic_kitti/semantic-kitti.yaml" --model_type=2 --data_root="DATA_ROOT_kitti_folder" --batch_size=1 --snapshot="pretrained_models/stage2.pth"

Bibtex

If this work is helpful for your research, please cite the following BibTeX entry.

@inproceedings{liang2024etformer,
  title={ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera}, 
  author={\textbf{Jing Liang} and He Yin and Xuewei Qi and Jong Jin Park and Min Sun and Rajasimman Madhivanan and Dinesh Manocha},
  booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  year={2025},
  organization={IEEE}
}

Acknowledgement

Many thanks to these excellent open source projects:

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published