Audio and acoustics

Video

Distant conversational speech recognition: Challenges and Opportunities

October 17, 2025 | Dr. Samuele Cornell, Sunit Sivasankaran

State-of-the-art ASR systems excel on close-talk benchmarks but struggle with far-field conversational speech, where error rates remain above 20%. Current benchmark datasets inadequately assess generalization across domains and real-world conditions, often relying on oracle segmentation…

01:28:41

Video

FOA Tokenizer: Learning Discrete Representations of Spatial Audio with Multichannel VQ-GAN

October 17, 2025 | Parthasaarathy Sudarsanam, Hannes Gamper

Spatial audio captures the directional and environmental characteristics of sound, enabling immersive listening experiences. First-Order Ambisonics (FOA) provides a compact representation of spatial audio by encoding the sound field’s directional components across four channels, allowing…

graphical user interface, text, application

54:08

Publication

Distributed Asynchronous Device Speech Enhancement via Windowed Cross-Attention

Gene-Ping Yang, Sebastian Braun

Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) | October 2025

Project

Publication

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

Jiaqi Li, Yao Qian, Yuxuan Hu, Leying Zhang, Xiaofei Wang, Heng Lu, Manthan Thakker, Jinyu Li, Sheng Zhao, Zhizheng Wu

ICLR 2026 | September 2025

Publication

OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics

Wei Chu, Yuanzhe Dong, Ke Tan, Dong Han, Xavier Menendez-Pidal, Ruchao Fan, Chenfeng Miao, Chanwoo Kim, Bhiksha Raj, Rita Singh

September 2025

Publication

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

Ali Vosoughi, Hannes Gamper, Dimitra Emmanouilidou

Proc. Eur. Signal Process. Conf. (EUSIPCO) | September 2025

Editor(s): ISBN: 978-9-46-459362-4

Project

Video

Make some noise: Teaching the language of audio to an LLM using sound tokens

July 28, 2025 | Shivam Mehta

We investigate the use of low bitrate causal quantized audio representations to fine-tune large language models (LLMs) using LoRA for comprehending and generating audio. Differing from earlier approaches that depend on continuous audio representations for…

44:54

Video

Final intern talk: Distilling Self-Supervised-Learning-Based Speech Quality Assessment into Compact Models

July 20, 2025 | Benjamin Stahl

In this talk, we explore advancements in computational models for speech quality assessment. Self-supervised learning models have emerged as powerful front-ends, outperforming supervised-only models. However, their large size renders them impractical for production tasks. We…

42:02

Publication

SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li, Jianwei Yu

NeurIPS 2025 | June 2025

Publication

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

Leying Zhang, Yao Qian, Xiaofei Wang, Manthan Thakker, Dongmei Wang, Jianwei Yu, Haibin Wu, Yuxuan Hu, Jinyu Li, Yanmin Qian, Sheng Zhao

NeurIPS 2025 | May 2025