Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

University of Science and Technology of China
Arxiv, 2025

*Indicates Equal Contribution

TL;DR

Sonic4D is a novel training-free framework that enables spatial audio generation for immersive exploration of 4D scenes. It provides plausible spatial audio that varies across different viewpoints and timestamps.

Pipeline

MY ALT TEXT

Pipeline of Sonic4D. We propose a three-stage framework to achieve spatial audio generation: 1) Dynamic Scene and Monaural Audio Generation to extract semantically aligned visual and audio priors from a single video; 2) 3D Sound-Source Localization and Tracking to recover the sound source's 3D trajectory for precise acoustic simulation; 3) Physics-Driven Spatial Audio Synthesis to render dynamic, viewpoint‑adaptive binaural audio via physics-based room impulse response simulation.

🎧 Please wear headphones and turn up the volume to enjoy the examples. 🎧
Fixed View Rendering
Sonic4D enables users to observe dynamic 4D scenes from static camera viewpoints and synthesizes binaural audio that spatially aligns with the (non-centering) subject's position
Source Video + MMAudio
Sonic4D
A blue crow in the forest.
[Fix the camera on the right while rotating it slightly | x+1.5 & φ+5°]
A man playing ukulele in the street.
[Fix the camera on the left | x-1.0]
A monk is beating a wooden fish.
[Fixed camera (no movement)]
Dynamic Sound Source Tracking
Sonic4D tracks the 3D location and trajectory of the sound source, enabling the synthesis of spatial audio that consistently follows the movement of the scene's subject.
Source Video + MMAudio
Sonic4D
A helicopter hovering in the air.
[Fixed camera (no movement)]
F1 car speeding on the track.
[Fixed camera (no movement)]
The tractor moves from left to right.
[Fixed camera (no movement)]
Train whistles as it enters the station.
[Fixed camera (no movement)]
Dynamic Camera Trajectories
Sonic4D supports diverse and customizable camera trajectories while maintaining spatial audio consistency with changing viewpoints, delivering an immersive audiovisual experience.
Source Video + MMAudio
Sonic4D
Drummer playing African drums.
[Orbit the camera around the subject | φ-20° → φ+0° → φ+20°]
A car speeds away.
[Pan the camera to the right | x+0 → x+2.0]
A man is playing the piano.
[Sweeping from the top-left to bottom-center and up to the top-right | φ-30°, θ+20° → φ+0°, θ+0° → φ+30°, θ+20°]
A woman playing the flute.
[Pan the camera to the right | x+0 → x+2.5]
The train is moving forward.
[Orbit the camera around the subject | φ-20° → φ+0° → φ+20°]
Fountain water splashing .
[Pan the camera to the right | x+0 → x+2.0]
Push in / Pull out
Sonic4D enables camera push-in and pull-out effects relative to the subject, creating natural variations in audio amplitude (i.e., perceived loudness) as the distance changes.
Source Video + MMAudio
Sonic4D
A baby is crying.
[Pull out the camera | r+0 → r-0.8]
The volcano is erupting.
[Push in the camera | r+0 → r+0.8]
Pianist playing the piano close-up.
[Pull out the camera | r+0 → r-0.8]

Abstract

Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users.

BibTeX

BibTex Code Here

References

  1. GroundingGPT:Language Enhanced Multi-modal Grounding Model
  2. A density-based algorithm for discovering clusters in large spatial databases with noise
  3. TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
  4. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis