Puyuan Peng

Hi! I’m Puyuan Peng, a third year Computer Science PhD student at UT Austin. I mainly work on speech/audio recognition, understanding, and generation, usually under multimodal context (e.g. text, vision).

I’m very fortunate to have David Harwath as my advisor. In addition to my advisor, I have the pleasure to work with and learn from many amazing senior researchers, including (in chronological order): Karen Livescu (TTIC/UChicago), Raymond Mooney (UT), James Glass (MIT), Yoon Kim (MIT), Abdelrahman Mohamed (Rembrand), Jonathan Le Roux (MERL), Shinji Watanabe (CMU), Hung-yi Lee (NTU), Kristen Grauman (UT/Meta), Wei-Ning Hsu (Meta) etc.

I have a Master’s degree in Statistics from The University of Chicago, and a Bachelor’s degree in Mathematics from Beijing Normal University.

In the Summer of 2024, I’ll be interning at Fundamental AI Research (FAIR) at Meta, working with Wei-Ning Hsu.

In my free time, I like to workout and sing (here is a funny video).

contact: pyp@utexas.edu

Papers

(The asterisk ‘*’ denotes equal contribution)

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath
preprint, 2024
pdf code website
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen*, Puyuan Peng*, Ami Baid, Sherry Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
preprint, 2024
pdf (coming soon) code (coming soon)
BAT: Learning to Reason about Spatial Sounds with Large Language Models
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
preprint, 2024
pdf code
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
ICASSP, 2024
pdf code website
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Puyuan Peng, Brian Yan, Shinji Watanabe, David Hawarth
Interspeech, 2023
pdf code
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
Interspeech, 2023
pdf code
Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos
Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
Interspeech, 2023
pdf
Audio-Visual Neural Syntax Acquisition
Cheng-I Jeff Lai*, Freda Shi*, Puyuan Peng*, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
ASRU, 2023
pdf code
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
Anuj Diwan*, Puyuan Peng*, Raymond J. Mooney
Workshop on Transfer Learning for Natural Language Processing, 2022
pdf
Word Discovery in Visually Grounded, Self-Supervised Speech Models
Puyuan Peng, David Harwath
Interspeech, 2022
pdf code
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade, Puyuan Peng, David Harwath
Interspeech, 2022
pdf code
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
Puyuan Peng, David Harwath
The 2nd Workshop on Self-supervised Learning for Audio and Speech Processing at AAAI, 2022
pdf code
Fast-Slow Transformer for Visually Grounding Speech
Puyuan Peng, David Harwath
ICASSP, 2022
pdf code
A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings
Puyuan Peng, Herman Kamper, and Karen Livescu
The 1st Workshop on Self-Supervised Learning for Speech and Audio Processing at NeurIPS, 2020
pdf

Talks

May 2022 at Developmental Intelligence Laboratory, Department of Psychology, UT Austin, USA
Jan 2022 at Karen Livescu Group, Toyota Technological Institute at Chicago, USA.
Jan 2022 at Cognitive Machine Learning Group, Departement d’Etudes Cognitives, Ecole Normale Supérieure, France.