Puyuan Peng

Hi! I’m Puyuan Peng (彭浦源), a Research Scientist at Meta Superintelligence Lab, working on SpeechLLMs. I got my PhD in CS from UT Austin advised by David Harwath. Master’s in Stats from The University of Chicago, and Bachelor’s in Mathematics from Beijing Normal University.

PhD Research Highlights:

VoiceCraft (ACL2024), the zero-shot TTS and Speech Editing model, garnered 7.7k stars on GitHub within just five months of its release, trending globally #1.
Audio-Visual Latent Diffusion Model (ECCV2024) generates realistic action sounds for silent egocentric videos and demonstrates zero-shot transfer capabilities in VR games.
PromptingWhisper (Interspeech2023) pioneered the application of prompt-based techniques to large speech models for zero-shot tasks such as audio-visual speech recognition and speech translation without fine-tuning.
Visually Grounded Speech Research (Interspeech2023, 2022, ICASSP2022, ASRU2023) sets state-of-the-art performance in speech-image retrieval, zero-resource speech recognition, and data-efficient representation learning; draws parallels to human language development, analyzed at the Annual Meeting of the Cognitive Science Society (CogSci).

In addition to my advisor, I have the pleasure to work with and learn from many amazing senior researchers, including (in chronological order): Karen Livescu (TTIC/UChicago), Raymond Mooney (UT), James Glass (MIT), Yoon Kim (MIT), Abdelrahman Mohamed (Rembrand), Jonathan Le Roux (MERL), Shinji Watanabe (CMU), Hung-yi Lee (NTU), Kristen Grauman (UT/Meta), Wei-Ning Hsu (Meta) etc.

I have a Master’s degree in Statistics from The University of Chicago, and a Bachelor’s degree in Mathematics from Beijing Normal University.

contact: pyp@utexas.edu

Papers

(The asterisk ‘*’ denotes equal contribution)

VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
Puyuan Peng, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
Preprint, 2025
pdf website code
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath
ICCV, 2025
pdf website
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath
EMNLP, 2025
pdf
Temporally Streaming Audio-Visual Synchronization for Real-World Videos
Jordan Voas*, Wei-Cheng Tseng*, Layne Berry, Xixi Hu, Puyuan Peng, James Stuedemann, David Harwath
WACV, 2025
pdf code
Dynamic-superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks
Chien-yu Huang and 78 authors including Puyuan Peng
ICLR, 2025
pdf
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
Alan Baade, Puyuan Peng, David Harwath
ICLR, 2025
pdf
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
ACL, 2024 (Oral)
pdf website interactive demo code
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen*, Puyuan Peng*, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
ECCV, 2024 (Oral)
pdf website code data
BAT: Learning to Reason about Spatial Sounds with Large Language Models
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
ICML, 2024
pdf website Spatial-AST code BAT code
Neural Codec Language Models for Disentangled and Textless Voice Conversion
Alan Baade, Puyuan Peng, David Harwath
Interspeech, 2024
pdf code
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
ICASSP, 2024
pdf code website
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Puyuan Peng, Brian Yan, Shinji Watanabe, David Hawarth
Interspeech, 2023
pdf code
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
Interspeech, 2023
pdf code
Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos
Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
Interspeech, 2023
pdf
Audio-Visual Neural Syntax Acquisition
Cheng-I Jeff Lai*, Freda Shi*, Puyuan Peng*, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
ASRU, 2023
pdf code
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
Anuj Diwan*, Puyuan Peng*, Raymond J. Mooney
Workshop on Transfer Learning for Natural Language Processing, 2022
pdf
Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
Interspeech, 2022
pdf code
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
Interspeech, 2022
pdf code
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Puyuan Peng, David Harwath
The 2nd Workshop on Self-supervised Learning for Audio and Speech Processing at AAAI, 2022
pdf code
Fast-Slow Transformer for Visually Grounding Speech
Puyuan Peng, David Harwath
ICASSP, 2022
pdf code
A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings
Puyuan Peng, Herman Kamper, and Karen Livescu
The 1st Workshop on Self-Supervised Learning for Speech and Audio Processing at NeurIPS, 2020
pdf

Talks

May 2024 at Meta AI, New York, USA
May 2022 at Developmental Intelligence Laboratory, Department of Psychology, UT Austin, USA
Jan 2022 at Karen Livescu Group, Toyota Technological Institute at Chicago, USA.
Jan 2022 at Cognitive Machine Learning Group, Departement d’Etudes Cognitives, Ecole Normale Supérieure, France.