I’m a 4th year (final year) PhD student at UT Austin under David Harwath and part-time student researcher at FAIR, Meta under Wei-Ning Hsu (Meta) I’ll be graduating in May 2025 and on the job market, please reach out if interested! I have published first-authored papers on speech & audio recognition and generation in Interspeech/ICASSP/ACL/ECCV. In particular, the tasks I have published on include:
- Generation: Text-to-Speech, Speech Editing, Video-to-Audio
- Recognition: Automatic Speech Recognition, Speech Translation, Audio-Visual Speech Recognition, Speech-Image Retrieval, Speech Segmentation, Speech Quantization, Speech Representation Learning
Research Highlights:
- VoiceCraft (ACL2024), the zero-shot TTS and Speech Editing model, garnered 7.5k stars on GitHub within just five months of its release, trending globally #1.
- Audio-Visual Latent Diffusion Model (ECCV2024) generates realistic action sounds for silent egocentric videos and demonstrates zero-shot transfer capabilities in VR games.
- PromptingWhisper (Interspeech2023) pioneered the application of prompt-based techniques to large speech models for zero-shot tasks such as audio-visual speech recognition and speech translation without fine-tuning.
- Visually Grounded Speech Research (Interspeech2023, 2022, ICASSP2022, ASRU2023, AAAIW2022) not only sets state-of-the-art performance in speech-image retrieval, zero-resource speech recognition, and data-efficient representation learning but also draws parallels to human language development, analyzed at the Annual Meeting of the Cognitive Science Society (CogSci).
In addition to my advisor, I have the pleasure to work with and learn from many amazing senior researchers, including (in chronological order): Karen Livescu (TTIC/UChicago), Raymond Mooney (UT), James Glass (MIT), Yoon Kim (MIT), Abdelrahman Mohamed (Rembrand), Jonathan Le Roux (MERL), Shinji Watanabe (CMU), Hung-yi Lee (NTU), Kristen Grauman (UT/Meta), Wei-Ning Hsu (Meta) etc.
I have a Master’s degree in Statistics from The University of Chicago, and a Bachelor’s degree in Mathematics from Beijing Normal University.
contact: pyp@utexas.edu
Papers
(The asterisk ‘*’ denotes equal contribution)
-
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
ACL, 2024 (Oral)
pdf website interactive demo code -
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen*, Puyuan Peng*, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
ECCV, 2024 (Oral)
pdf website code data -
BAT: Learning to Reason about Spatial Sounds with Large Language Models
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
ICML, 2024
pdf website Spatial-AST code BAT code -
Neural Codec Language Models for Disentangled and Textless Voice Conversion
Alan Baade, Puyuan Peng, David Harwath
Interspeech, 2024
pdf code -
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
ICASSP, 2024
pdf code website -
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Puyuan Peng, Brian Yan, Shinji Watanabe, David Hawarth
Interspeech, 2023
pdf code -
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
Interspeech, 2023
pdf code -
Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos
Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
Interspeech, 2023
pdf -
Audio-Visual Neural Syntax Acquisition
Cheng-I Jeff Lai*, Freda Shi*, Puyuan Peng*, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
ASRU, 2023
pdf code -
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
Anuj Diwan*, Puyuan Peng*, Raymond J. Mooney
Workshop on Transfer Learning for Natural Language Processing, 2022
pdf -
Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
Interspeech, 2022
pdf code -
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
Interspeech, 2022
pdf code -
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Puyuan Peng, David Harwath
The 2nd Workshop on Self-supervised Learning for Audio and Speech Processing at AAAI, 2022
pdf code -
Fast-Slow Transformer for Visually Grounding Speech
Puyuan Peng, David Harwath
ICASSP, 2022
pdf code -
A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings
Puyuan Peng, Herman Kamper, and Karen Livescu
The 1st Workshop on Self-Supervised Learning for Speech and Audio Processing at NeurIPS, 2020
pdf
Talks
May 2024 at Meta AI, New York, USA
May 2022 at Developmental Intelligence Laboratory, Department of Psychology, UT Austin, USA
Jan 2022 at Karen Livescu Group, Toyota Technological Institute at Chicago, USA.
Jan 2022 at Cognitive Machine Learning Group, Departement d’Etudes Cognitives, Ecole Normale Supérieure, France.