Loading...

Visual Acoustic Fields

Yuelei Li1,  Hyunjin Kim1,  Fangneng Zhan2,3,  Ri-Zhao Qiu1,  Mazeyu Ji1,   Xiaojun Shan1,
Xueyan Zou1,   Paul Liang3,   Hanspeter Pfister2,   Xiaolong Wang1

1UC San Diego    2Harvard University    3MIT


missing
Paper
                     
missing
Poster
                     
missing
Video
                     
missing
Dataset
                     
missing
GitHub

Open the audio to hear the interactive sound.




Abstract

Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources.

missing

Method

Overview of the Visual Acoustic Fields framework. The model consists of two main components: sound generation and sound localization. Given multiview images, a feature-augmented 3D Gaussian Splatting (feature 3DGS) representation is constructed. For sound generation, localized multi-level features queried from the feature 3DGS are used as conditions to fine-tune a pretrained Stable Audio diffusion model to synthesize impact sounds. For sound localization, a fine-tuned AudioCLIP encoder maps input audio queries to the feature 3DGS, allowing the model to localize the corresponding impact location by computing feature similarity. Trainable, frozen, and fine-tuned components are indicated in the diagram.

missing

Dataset

Pipeline for data collection.

missing


Collected Samples

missing

Results

missing
Visualization of sound localization results.
                     
missing
Mel Spectrogram of generated sounds.


Top