Towards a flexible and unified architecture for speech enhancement
Deploying neural networks on devices with vastly different computational budgets is critical yet challenging. This paper aims to create a single network that can be sliced into sub-networks of varying sizes—without fine-tuning or distillation, which makes it directly run under varying resource constraints. To scale broadly, we make both the width and depth of the network flexible. For width, we introduce FlexLinear, a linear layer with adjustable neuron counts, and extend it to FlexAttention, which supports adjustable numbers of attention heads. We also propose FlexRMSNorm, a normalization layer that adapts to different widths. Combined with early-exit strategies, these components form a network that scales in both width and depth. Built from these flexible modules, we present SEFlow, a causal and sampling-rate-agnostic model that handles a wide range of speech enhancement tasks, including denoising, dereverberation, declipping, and packet loss concealment. Experimental results demonstrate that SEFlow is comparable to the state-of-the-art task-specific models across multiple speech enhancement tasks. Remarkably, even sub-networks as small as 1% of the full network remain effective in low-resource scenarios.
Description
We present multiple demonstrations on clipping, noise, reverberation, packet loss, and complex distortions. The notation "f-B-H" indicates a model with B residual blocks and H attention heads. For example, "f-1-1" refers to the smallest model with 1 block and 1 head, while "f-12-4" denotes the full model with 12 blocks and 4 heads. The results show that even extremely small models are effective for simpler distortions, and performance improves consistently with model size.