Hi! I’m Puyuan Peng, a third year Computer Science PhD student at UT Austin. I mainly work on speech/audio recognition, understanding, and generation, usually under multimodal context (e.g. text, vision).

I’m very fortunate to have David Harwath as my advisor. In addition to my advisor, I have the pleasure to work with and learn from many amazing senior researchers, including (in chronological order): Karen Livescu (TTIC/UChicago), Raymond Mooney (UT), James Glass (MIT), Yoon Kim (MIT), Abdelrahman Mohamed (Rembrand), Jonathan Le Roux (MERL), Shinji Watanabe (CMU), Hung-yi Lee (NTU), Kristen Grauman (UT/Meta), Wei-Ning Hsu (Meta) etc.

I have a Master’s degree in Statistics from The University of Chicago, and a Bachelor’s degree in Mathematics from Beijing Normal University.

In the Summer of 2024, I’ll be interning at Fundamental AI Research (FAIR) at Meta, working with Wei-Ning Hsu.

In my free time, I like to workout and sing (here is a funny video).

contact: pyp@utexas.edu

Papers

(The asterisk ‘*’ denotes equal contribution)

  1. VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
    Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath
    preprint, 2024
    pdf code website
  2. Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
    Changan Chen*, Puyuan Peng*, Ami Baid, Sherry Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman
    preprint, 2024
    pdf (coming soon) code (coming soon)
  3. BAT: Learning to Reason about Spatial Sounds with Large Language Models
    Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath
    preprint, 2024
    pdf code
  4. AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
    Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
    ICASSP, 2024
    pdf code website
  5. Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
    Puyuan Peng, Brian Yan, Shinji Watanabe, David Hawarth
    Interspeech, 2023
    pdf code
  6. Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
    Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath
    Interspeech, 2023
    pdf code
  7. Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos
    Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
    Interspeech, 2023
    pdf
  8. Audio-Visual Neural Syntax Acquisition
    Cheng-I Jeff Lai*, Freda Shi*, Puyuan Peng*, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
    ASRU, 2023
    pdf code
  9. Zero-shot Video Moment Retrieval With Off-the-Shelf Models
    Anuj Diwan*, Puyuan Peng*, Raymond J. Mooney
    Workshop on Transfer Learning for Natural Language Processing, 2022
    pdf
  10. Word Discovery in Visually Grounded, Self-Supervised Speech Models
    Puyuan Peng, David Harwath
    Interspeech, 2022
    pdf code
  11. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
    Alan Baade, Puyuan Peng, David Harwath
    Interspeech, 2022
    pdf code
  12. Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling
    Puyuan Peng, David Harwath
    The 2nd Workshop on Self-supervised Learning for Audio and Speech Processing at AAAI, 2022
    pdf code
  13. Fast-Slow Transformer for Visually Grounding Speech
    Puyuan Peng, David Harwath
    ICASSP, 2022
    pdf code
  14. A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings
    Puyuan Peng, Herman Kamper, and Karen Livescu
    The 1st Workshop on Self-Supervised Learning for Speech and Audio Processing at NeurIPS, 2020
    pdf

Talks

May 2022 at Developmental Intelligence Laboratory, Department of Psychology, UT Austin, USA
Jan 2022 at Karen Livescu Group, Toyota Technological Institute at Chicago, USA.
Jan 2022 at Cognitive Machine Learning Group, Departement d’Etudes Cognitives, Ecole Normale Supérieure, France.