I’m a 4th year (final year) PhD student at UT Austin under David Harwath and part-time student researcher at FAIR, Meta under Wei-Ning Hsu (Meta) I’ll be graduating in May 2025 and on the job market, please reach out if interested! I have published first-authored papers on speech & audio recognition and generation in Interspeech/ICASSP/ACL/ECCV. In particular, the tasks I have published on include:

  1. Generation: Text-to-Speech, Speech Editing, Video-to-Audio
  2. Recognition: Automatic Speech Recognition, Speech Translation, Audio-Visual Speech Recognition, Speech-Image Retrieval, Speech Segmentation, Speech Quantization, Speech Representation Learning

Research Highlights:

  • VoiceCraft (ACL2024), the zero-shot TTS and Speech Editing model, garnered 7.5k stars on GitHub within just five months of its release, trending globally #1.
  • Audio-Visual Latent Diffusion Model (ECCV2024) generates realistic action sounds for silent egocentric videos and demonstrates zero-shot transfer capabilities in VR games.
  • PromptingWhisper (Interspeech2023) pioneered the application of prompt-based techniques to large speech models for zero-shot tasks such as audio-visual speech recognition and speech translation without fine-tuning.
  • Visually Grounded Speech Research (Interspeech2023, 2022, ICASSP2022, ASRU2023, AAAIW2022) not only sets state-of-the-art performance in speech-image retrieval, zero-resource speech recognition, and data-efficient representation learning but also draws parallels to human language development, analyzed at the Annual Meeting of the Cognitive Science Society (CogSci).

In addition to my advisor, I have the pleasure to work with and learn from many amazing senior researchers, including (in chronological order): Karen Livescu (TTIC/UChicago), Raymond Mooney (UT), James Glass (MIT), Yoon Kim (MIT), Abdelrahman Mohamed (Rembrand), Jonathan Le Roux (MERL), Shinji Watanabe (CMU), Hung-yi Lee (NTU), Kristen Grauman (UT/Meta), Wei-Ning Hsu (Meta) etc.

I have a Master’s degree in Statistics from The University of Chicago, and a Bachelor’s degree in Mathematics from Beijing Normal University.

contact: pyp@utexas.edu

Papers

(The asterisk ‘*’ denotes equal contribution)

  1. VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
    Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
    ACL, 2024 (Oral)
    pdf website interactive demo code GitHub Repo stars
  2. Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
    Changan Chen*, Puyuan Peng*, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
    ECCV, 2024 (Oral)
    pdf website code data
  3. BAT: Learning to Reason about Spatial Sounds with Large Language Models
    Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
    ICML, 2024
    pdf website Spatial-AST code BAT code
  4. Neural Codec Language Models for Disentangled and Textless Voice Conversion
    Alan Baade, Puyuan Peng, David Harwath
    Interspeech, 2024
    pdf code
  5. AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
    Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
    ICASSP, 2024
    pdf code website
  6. Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
    Puyuan Peng, Brian Yan, Shinji Watanabe, David Hawarth
    Interspeech, 2023
    pdf code
  7. Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
    Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
    Interspeech, 2023
    pdf code
  8. Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos
    Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
    Interspeech, 2023
    pdf
  9. Audio-Visual Neural Syntax Acquisition
    Cheng-I Jeff Lai*, Freda Shi*, Puyuan Peng*, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
    ASRU, 2023
    pdf code
  10. Zero-shot Video Moment Retrieval With Off-the-Shelf Models
    Anuj Diwan*, Puyuan Peng*, Raymond J. Mooney
    Workshop on Transfer Learning for Natural Language Processing, 2022
    pdf
  11. Word Discovery in Visually Grounded, Self-Supervised Speech Models
    Puyuan Peng, David Harwath
    Interspeech, 2022
    pdf code
  12. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
    Alan Baade, Puyuan Peng, David Harwath
    Interspeech, 2022
    pdf code
  13. Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
    Puyuan Peng, David Harwath
    The 2nd Workshop on Self-supervised Learning for Audio and Speech Processing at AAAI, 2022
    pdf code
  14. Fast-Slow Transformer for Visually Grounding Speech
    Puyuan Peng, David Harwath
    ICASSP, 2022
    pdf code
  15. A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings
    Puyuan Peng, Herman Kamper, and Karen Livescu
    The 1st Workshop on Self-Supervised Learning for Speech and Audio Processing at NeurIPS, 2020
    pdf

Talks

May 2024 at Meta AI, New York, USA
May 2022 at Developmental Intelligence Laboratory, Department of Psychology, UT Austin, USA
Jan 2022 at Karen Livescu Group, Toyota Technological Institute at Chicago, USA.
Jan 2022 at Cognitive Machine Learning Group, Departement d’Etudes Cognitives, Ecole Normale Supérieure, France.