Linfeng Feng, Lei Zhao, Boyu Zhu, Xiao-Lei Zhang, Xuelong Li
Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single-source sound event datasets. To address this, we propose AudioSpa, an end-to-end model that applies large language models to process both acoustic and textual information. We employ fusion multi-head attention (FMHA) to integrate text tokens, which enhances the generation capability of the multimodal learning. Additionally, we propose a binaural source localization model to assess the quality of the generated audio. Finally, we design a data augmentation strategy to generate diverse datasets, which enables the model to spatialize sound events across various spatial positions. Experimental results demonstrate that our model is able to put sounds at the specified locations accurately. It achieves competitive performance in both localization accuracy and signal distortion.
We present generation results from our proposed AudioSpa across multiple conditions. Fig. 1 illustrates the directional angles referenced in this work, where 0 degrees corresponds to the front of the listener, 90 degrees to the left, and so on.
Since the experimental setup with a sound source in clean environments is relatively simple, we only use a single sound to make comparisons. We selected a violin song, evenly segmenting the entire track and placing the segments uniformly from 0 to 360 degrees. Listeners can use this to experience the directionality described in Fig. 1. The baseline is simulated using HRIR.
Mono | Baseline | AudioSpa |
---|---|---|
The baseline is simulated using HRIR. Previous methods typically place all sound events in a single direction. By using text captions, AudioSpa can control the direction of specific sound event.
Mono | Baseline | AudioSpa |
---|---|---|
At 250 degrees, the sound of drums breaks through the quiet. | ||
A faint run can be heard approaching from 270 degrees. | ||
Bass echoes from the 130 degrees angle. | ||
Camera can be distinctly heard from 90 degrees. | ||
From the 210 degrees angle, the wind instrument and woodwind instrument steadily increases in volume. | ||
From 140 degrees, the music resonates sharply. | ||
Piano emerges from 140 degrees. | ||
The sound of rattle (instrument) resonates at 240 degrees. | ||
You hear a distant wind instrument and woodwind instrument originating from 250 degrees. | ||
You hear the sound of guitar coming from 280 degrees. | ||