Jiaqi Li's Homepage

jiaqili3[at]link.cuhk.edu.cn

I am a third-year Ph.D. student at the Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), SDS, supervised by Professor Zhizheng Wu. Before that, I received B.S. degree at CUHK-Shenzhen.

My research interest includes speech language models, text-to-speech synthesis, and neural audio coding. I am one of the main contributors and leaders of the open-source Amphion toolkit. My recent work includes DualCodec and FlexiCodec, two neural audio codec systems for low-frame-rate speech generation.

news

Apr 26, 2026	I presented our FlexiCodec paper at ICLR 2026 in Brazil.
May 17, 2025	Our DualCodec paper was accepted to InterSpeech 2025!
Feb 01, 2025	We released the Amphion v0.2 technical report, summarizing our development of Amphion in 2024.
Dec 03, 2024	I presented our new paper, Investigating neural audio codecs for speech language model-based speech generation in SLT 2024.
Aug 25, 2024	🎉 Our papers, Amphion and Emila, got accepted by IEEE SLT 2024!
Jul 28, 2024	🔥 We released Emila: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation, with 101k hours of speech in six languages and features diverse speech with varied speaking styles.
Apr 19, 2024	I presented our paper, An initial investigation of neural replay simulator for over-the-air adversarial perturbations to automatic speaker verification in ICASSP 2024 in Korea!
Nov 26, 2023	🔥 We released Amphion v0.1 , which is an open-source toolkit for audio, music, and speech generation.

selected publications

ICLR 2026

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

Jiaqi Li, Yao Qian, Yuxuan Hu, Leying Zhang, Xiaofei Wang, Heng Lu, Manthan Thakker, Jinyu Li, Sheng Zhao, and Zhizheng Wu

In International Conference on Learning Representations (ICLR), 2026

arXiv Bib Code Website

@inproceedings{li2026flexicodec,
  title = {FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates},
  author = {Li, Jiaqi and Qian, Yao and Hu, Yuxuan and Zhang, Leying and Wang, Xiaofei and Lu, Heng and Thakker, Manthan and Li, Jinyu and Zhao, Sheng and Wu, Zhizheng},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2026},
  tldr = {We develop a dynamic neural audio codec with controllable low frame rates for efficient speech generation.},
}

TL;DR: We develop a dynamic neural audio codec with controllable low frame rates for efficient speech generation.

Interspeech 2025

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, and Zhizheng Wu

In Proceedings of Interspeech 2025, 2025

arXiv Bib Code Website

@inproceedings{li2025dualcodec,
  title = {DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation},
  author = {Li, Jiaqi and Lin, Xiaolong and Li, Zhekai and Huang, Shixi and Wang, Yuancheng and Wang, Chaoren and Zhan, Zhenpeng and Wu, Zhizheng},
  booktitle = {Proceedings of Interspeech 2025},
  year = {2025},
  tldr = {We propose a low-frame-rate neural audio codec that strengthens semantic information for speech generation.},
}

TL;DR: We propose a low-frame-rate neural audio codec that strengthens semantic information for speech generation.

Tech Report

Overview of the Amphion Toolkit (v0. 2)

Jiaqi Li, Xueyao Zhang, Yuancheng Wang, Haorui He, Chaoren Wang, Li Wang, Huan Liao, Junyi Ao, Zeyu Xie, Yiqiao Huang, and others

arXiv preprint arXiv:2501.15442, 2025

arXiv Bib Code Huggingface

@article{li2025overview,
  title = {Overview of the Amphion Toolkit (v0. 2)},
  author = {Li, Jiaqi and Zhang, Xueyao and Wang, Yuancheng and He, Haorui and Wang, Chaoren and Wang, Li and Liao, Huan and Ao, Junyi and Xie, Zeyu and Huang, Yiqiao and others},
  journal = {arXiv preprint arXiv:2501.15442},
  year = {2025},
  huggingface = {https://huggingface.co/amphion},
  tldr = {This is the technical report for the second version of the Amphion toolkit.},
}

TL;DR: This is the technical report for the second version of the Amphion toolkit.

SLT 2024

Investigating neural audio codecs for speech language model-based speech generation

Jiaqi Li, Dongmei Wang, Xiaofei Wang, Yao Qian, Long Zhou, Shujie Liu, Midia Yousefi, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, and others

In 2024 IEEE Spoken Language Technology Workshop (SLT), 2024

arXiv Bib

@inproceedings{li2024investigating,
  title = {Investigating neural audio codecs for speech language model-based speech generation},
  author = {Li, Jiaqi and Wang, Dongmei and Wang, Xiaofei and Qian, Yao and Zhou, Long and Liu, Shujie and Yousefi, Midia and Li, Canrun and Tsai, Chung-Hsien and Xiao, Zhen and others},
  booktitle = {2024 IEEE Spoken Language Technology Workshop (SLT)},
  pages = {554--561},
  year = {2024},
  organization = {IEEE},
}

SLT 2024

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu

In 2024 IEEE Spoken Language Technology Workshop (SLT), 2024

arXiv Bib Huggingface

@inproceedings{he2024emilia,
  title = {Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation},
  author = {He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
  booktitle = {2024 IEEE Spoken Language Technology Workshop (SLT)},
  year = {2024},
  huggingface = {https://huggingface.co/datasets/amphion/Emilia},
  tldr = {We collect a 100k hours in-the-wild speech dataset for speech generation.},
}

TL;DR: We collect a 100k hours in-the-wild speech dataset for speech generation.

SLT 2024

Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit

Xueyao Zhang^*, Liumeng Xue^*, Yicheng Gu^*, Yuancheng Wang^*, Jiaqi Li, Haorui He, Chaoren Wang, Songting Liu, Xi Chen, Junan Zhang, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, and Zhizheng Wu

In 2024 IEEE Spoken Language Technology Workshop (SLT), 2024

arXiv Bib Code Huggingface

@inproceedings{zhang2024amphion,
  title = {Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit
  },
  author = {Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Liu, Songting and Chen, Xi and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
  booktitle = {2024 IEEE Spoken Language Technology Workshop (SLT)},
  year = {2024},
  huggingface = {https://huggingface.co/amphion},
  tldr = {We develop a unified toolkit for audio, music, and speech generation.},
}

TL;DR: We develop a unified toolkit for audio, music, and speech generation.

ICASSP 2024

An initial investigation of neural replay simulator for over-the-air adversarial perturbations to automatic speaker verification

Jiaqi Li, Li Wang, Liumeng Xue, Lei Wang, and Zhizheng Wu

In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

arXiv Bib

@inproceedings{li2024initial,
  title = {An initial investigation of neural replay simulator for over-the-air adversarial perturbations to automatic speaker verification},
  author = {Li, Jiaqi and Wang, Li and Xue, Liumeng and Wang, Lei and Wu, Zhizheng},
  booktitle = {ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages = {4635--4639},
  year = {2024},
  organization = {IEEE},
}

ICASSP 2024

Advsv: An over-the-air adversarial attack dataset for speaker verification

Li Wang, Jiaqi Li, Yuhao Luo, Jiahao Zheng, Lei Wang, Hao Li, Ke Xu, Chengfang Fang, Jie Shi, and Zhizheng Wu

In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

arXiv Bib

@inproceedings{wang2024advsv,
  title = {Advsv: An over-the-air adversarial attack dataset for speaker verification},
  author = {Wang, Li and Li, Jiaqi and Luo, Yuhao and Zheng, Jiahao and Wang, Lei and Li, Hao and Xu, Ke and Fang, Chengfang and Shi, Jie and Wu, Zhizheng},
  booktitle = {ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages = {4555--4559},
  year = {2024},
  organization = {IEEE},
}

internships

Microsoft Research	Research Intern · Redmond, Washington (Remote) · 2025.04 - 2025.08 Dynamic Frame Rate Neural Audio Coding Dynamic Frame Rate Neural Audio Coding
Baidu Inc., Shenzhen	Research Intern · Shenzhen, China · 2024.6 - 2025.01 Efficient, High-Quality Neural Audio Codec for Zero-Shot TTS Neural Audio Codec Speech Synthesis
Microsoft Research	Research Intern · Redmond, Washington (Remote) · 2023.10 - 2024.04 Zero-Shot TTS with Neural Audio Codecs Neural Audio Codec Speech Synthesis