We introduce SoundVista: a neural network pipeline to generate the ambient sound of arbitrary scene at novel viewpoints, without requiring any constraint or prior knowledge of sound source details.
Please watch with your headphones or speaker that supports binaural audio!
👉 Click here to watch the demo video
data folder structure mp3d
- sim_scenes: pano rgb-d pkl. Render sound-spaces, ref scripts/demo/mp3d_continuous_pano_render.py
- benchmark_pkl
- metadata sound-spaces
- sim_audios
- sounds sound-spaces
- 1s_all
- semantic_splits
- acoustic_params echo t60 npy
- binaural_rirs
- ambisonic_rirs
- benchmark index files: mp3d_mulv3_sparse_new.pkl (train)
- budget number: ref_sampler_budget.pkl
Compile sound-spaces first to render the SoundSpace-Ambient Matterport data.
CUDA_VISIBLE_DEVICES=0 python3 tools/train_vab.py --cfg configs/vab_mp3d.yaml
CUDA_VISIBLE_DEVICES=0 python3 tools/ref_sample_mp3d.py --cfg configs/vab_mp3d.yaml --visualize-path output/ref_sampling/ --eval-metrics model.resume_path data/pretrained_weights/vab_pretrain.pth
CUDA_VISIBLE_DEVICES=0 python3 tools/train_mp3d.py --cfg configs/soundvista_mp3d.yaml --visualize-path output/ref_sampling/ train.pretrained data/pretrained_weights/vab_pretrain.pth dataset.img_num_per_gpu 16 output_dir soundvista_mp3d
CUDA_VISIBLE_DEVICES=0 python3 tools/eval_mp3d.py --cfg configs/soundvista_mp3d.yaml --visualize-path output/ref_sampling/ --eval-scenes unseen model.resume_path data/pretrained_weights/soundvista_mp3d.pth output_dir soundvista_mp3d_eval
# step 1: render route and reference pano RGBD and audio
python3 scripts/demo/mp3d_demovis.py
# step 2: render continuous target pano RGBD and video
python3 scripts/demo/mp3d_continuous_pano_render.py #pano RGB-D pkl file
python3 scripts/demo/mp3d_continuous_video_render.py #video
# step 3: render demo audio with SoundVista
CUDA_VISIBLE_DEVICES=0 python3 tools/demo_mp3d.py --cfg configs/soundvista_mp3d.yaml --visualize-path output/ref_sampling/ model.resume_path data/pretrained_weights/soundvista_mp3d.pth
# step 4: combine audio and video for the final demo video (fps=18)
e.g.:
ffmpeg -i 'demo_files/sT4fr6TAbpF_continuous_vis.mp4' -i 'demo_files/sT4fr6TAbpF.wav' -c:v copy -c:a aac 'demo_files/sT4fr6TAbpF_output.mp4'
If you find this repository and dataset useful in your research, please consider giving a star ⭐ and cite our paper by using the following BibTeX entrys.
@inproceedings{chen2025soundvista,
title={SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding},
author={Chen, Mingfei and Gebru, Israel D and Ananthabhotla, Ishwarya and Richardt, Christian and Markovic, Dejan and Sandakly, Jake and Krenn, Steven and Keebler, Todd and Shlizerman, Eli and Richard, Alexander},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={8331--8341},
year={2025}
}
The code and dataset are released under CC-NC 4.0 International license.
