ESPnet | End To End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit

system/pytorch ver. 1.13.1 2.0.1 2.1.2 2.2.2 2.3.1 2.4.0
ubuntu/python3.10/pip ci on ubuntu ci on ubuntu ci on ubuntu ci on ubuntu ci on ubuntu
ubuntu/python3.9/pip ci on ubuntu ci on ubuntu ci on ubuntu ci on ubuntu ci on ubuntu
ubuntu/python3.8/pip ci on ubuntu ci on ubuntu ci on ubuntu ci on ubuntu ci on ubuntu
ubuntu/python3.7/pip ci on ubuntu
debian11/python3.10/conda ci on debian11
windows/python3.10/pip ci on windows
macos/python3.10/pip ci on macos
macos/python3.10/conda ci on macos

PyPI version Python Versions Downloads GitHub license codecov Code style: black Imports: isort pre-commit.ci status Mergify Status Discord


Docs | Example | Example (ESPnet2) | Docker | Notebook


ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

Key Features

Kaldi-style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
  • Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT’18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT’14, IWSLT’16, the above ST recipes etc.)
  • Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
  • Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
  • Support voice conversion recipe (VCC2020 baseline)
  • Support speaker diarization recipe (mini_librispeech, librimix)
  • Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
    • Decoder: RNN (LSTM/GRU), Transformer, or S4
  • Attention: Dot product, location-aware attention, variants of multi-head
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Data augmentation
  • Transducer based end-to-end ASR
    • Architecture:
      • Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
      • Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
      • Pre-encoder: VGG2L or Conv2D available.
    • Search algorithms:
    • Features:
      • Unified interface for offline and streaming speech recognition.
      • Multi-task learning with various auxiliary losses:
        • Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
        • Decoder: cross-entropy w/ label smoothing.
      • Transfer learning with an acoustic model and/or language model.
      • Training with FastEmit regularization method [Yu et al., 2021].

Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
  • Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
    • Set frontend to s3prl
    • Select any upstream model by setting the frontend_conf to the corresponding name.
  • Transfer Learning :
  • Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
  • Restricted Self-Attention based on Longformer as an encoder for long sequences
  • OpenAI Whisper model, robust ASR based on large-scale, weakly-supervised multitask learning

Demonstration

TTS: Text-to-speech

  • Architecture
    • Tacotron2
    • Transformer-TTS
    • FastSpeech
    • FastSpeech2
    • Conformer FastSpeech & FastSpeech2
    • VITS
    • JETS
  • Multi-speaker & multi-language extension
    • Pre-trained speaker embedding (e.g., X-vector)
    • Speaker ID embedding
    • Language ID embedding
    • Global style token (GST) embedding
    • Mix of the above embeddings
  • End-to-end training
    • End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
    • Joint training of text2mel and vocoder
  • Various language support
    • En / Jp / Zn / De / Ru / And more…
  • Integration with neural vocoders
    • Parallel WaveGAN
    • MelGAN
    • Multi-band MelGAN
    • HiFiGAN
    • StyleMelGAN
    • Mix of the above models

Demonstration

To train the neural vocoder, please check the following repositories:

SE: Speech enhancement (and separation)

  • Single-speaker speech enhancement
  • Multi-speaker speech separation
  • Unified encoder-separator-decoder structure for time-domain and frequency-domain models
  • Flexible ASR integration: working as an individual task or as the ASR frontend
  • Easy to import pre-trained models from Asteroid
    • Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration

  • Interactive SE demo with ESPnet2 Open In Colab
  • Streaming SE demo with ESPnet2 Open In Colab

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer-based end-to-end ST (new!)
  • Transformer-based end-to-end MT (new!)

VC: Voice conversion

  • Transformer and Tacotron2-based parallel VC using Mel spectrogram
  • End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

SLU: Spoken Language Understanding

  • Architecture
    • Transformer-based Encoder
    • Conformer-based Encoder
    • Branchformer based Encoder
    • E-Branchformer based Encoder
    • RNN based Decoder
    • Transformer-based Decoder
  • Support Multitasking with ASR
    • Predict both intent and ASR transcript
  • Support Multitasking with NLU
    • Deliberation encoder based 2 pass model
  • Support using pre-trained ASR models
    • Hubert
    • Wav2vec2
    • VQ-APC
    • TERA and more …
  • Support using pre-trained NLP models
    • BERT
    • MPNet And more…
  • Various language support
    • En / Jp / Zn / Nl / And more…
  • Supports using context from previous utterances
  • Supports using other tasks like SE in a pipeline manner
  • Supports Two Pass SLU that combines audio and ASR transcript Demonstration
  • Performing noisy spoken language understanding using a speech enhancement model followed by a spoken language understanding model. Open In Colab
  • Performing two-pass spoken language understanding where the second pass model attends to both acoustic and semantic information. Open In Colab
  • Integrated to Hugging Face Spaces with Gradio. See SLU demo on multiple languages: Hugging Face Spaces

SUM: Speech Summarization

  • End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [Sharma et al., 2022]

SVS: Singing Voice Synthesis

  • Framework merge from Muskits
  • Architecture
    • RNN-based non-autoregressive model
    • Xiaoice
    • Tacotron-singing
    • DiffSinger (in progress)
    • VISinger
    • VISinger 2 (its variations with different vocoders-architecture)
  • Support multi-speaker & multilingual singing synthesis
    • Speaker ID embedding
    • Language ID embedding
  • Various language support
    • Jp / En / Kr / Zh
  • Tight integration with neural vocoders (the same as TTS)

SSL: Self-supervised Learning

UASR: Unsupervised ASR (EURO: ESPnet Unsupervised Recognition - Open-source)

  • Architecture
    • wav2vec-U (with different self-supervised models)
    • wav2vec-U 2.0 (in progress)
  • Support PrefixBeamSearch and K2-based WFST decoding

S2T: Speech-to-text with Whisper-style multilingual multitask models

  • Reproduces Whisper-style training from scratch using public data: OWSM
  • Supports multiple tasks in a single model
    • Multilingual speech recognition
    • Any-to-any speech translation
    • Language identification
    • Utterance-level timestamp prediction (segmentation)

DNN Framework

  • Flexible network architecture thanks to Chainer and PyTorch
  • Flexible front-end processing thanks to kaldiio and HDF5 support
  • Tensorboard-based monitoring

ESPnet2

See ESPnet2.

  • Independent from Kaldi/Chainer, unlike ESPnet1
  • On-the-fly feature extraction and text processing when training
  • Supporting DistributedDataParallel and DaraParallel both
  • Supporting multiple nodes training and integrated with Slurm or MPI
  • Supporting Sharded Training provided by fairscale
  • A template recipe that can be applied to all corpora
  • Possible to train any size of corpus without CPU memory error
  • ESPnet Model Zoo
  • Integrated with wandb

Installation

  • If you intend to do full experiments, including DNN training, then see Installation.
  • If you just need the Python module only:

We recommend you install PyTorch before installing espnet following https://pytorch.org/get-started/locally/ pip install espnet # To install the latest # pip install git+https://github.com/espnet/espnet # To install additional packages # pip install “espnet[all]”

If you use ESPnet1, please install chainer and cupy.

pip install chainer==6.0.0 cupy==6.0.0 # [Option]

You might need to install some packages depending on each task. We prepared various installation scripts at tools/installers.

  • (ESPnet2) Once installed, run wandb login and set --use_wandb true to enable tracking runs using W&B.

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking the time for ESPnet! Any contributions to ESPnet are welcome, and feel free to ask any questions or requests to issues. If it’s your first ESPnet contribution, please follow the contribution guide.

ASR results

expand

``

ASR demo

expand

``

``

SE results

expand

SE demos

expand

Open In Colab

Open In Colab

ST results

expand

``

ST demo

expand

Open In Colab


``

``

MT results

expand

TTS results

ESPnet2

``

``

ESPnet1

TTS demo

ESPnet2

  • Open In Colab

ESPnet1

  • Open In Colab

``

``

VC results

expand

SLU results

expand

``

CTC Segmentation demo

ESPnet1

``

``

``

ESPnet2

``

``

``

Citations

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@inproceedings{hayashi2020espnet,
  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7654--7658},
  year={2020},
  organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
    title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
    author = "Inaguma, Hirofumi  and
      Kiyono, Shun  and
      Duh, Kevin  and
      Karita, Shigeki  and
      Yalta, Nelson  and
      Hayashi, Tomoki  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
    pages = "302--311",
}
@article{hayashi2021espnet2,
  title={{ESP}net2-{TTS}: Extending the edge of {TTS} research},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Yoshimura, Takenori and Wu, Peter and Shi, Jiatong and Saeki, Takaaki and Ju, Yooncheol and Yasuda, Yusuke and Takamichi, Shinnosuke and Watanabe, Shinji},
  journal={arXiv preprint arXiv:2110.07840},
  year={2021}
}
@inproceedings{li2020espnet,
  title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},
  author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},
  booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)},
  pages={785--792},
  year={2021},
  organization={IEEE},
}
@inproceedings{arora2021espnet,
  title={{ESPnet-SLU}: Advancing Spoken Language Understanding through ESPnet},
  author={Arora, Siddhant and Dalmia, Siddharth and Denisov, Pavel and Chang, Xuankai and Ueda, Yushi and Peng, Yifan and Zhang, Yuekai and Kumar, Sujay and Ganesan, Karthik and Yan, Brian and others},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7167--7171},
  year={2022},
  organization={IEEE}
}
@inproceedings{shi2022muskits,
  author={Shi, Jiatong and Guo, Shuai and Qian, Tao and Huo, Nan and Hayashi, Tomoki and Wu, Yuning and Xu, Frank and Chang, Xuankai and Li, Huazhe and Wu, Peter and Watanabe, Shinji and Jin, Qin},
  title={{Muskits}: an End-to-End Music Processing Toolkit for Singing Voice Synthesis},
  year={2022},
  booktitle={Proceedings of Interspeech},
  pages={4277-4281},
  url={https://www.isca-speech.org/archive/pdfs/interspeech_2022/shi22d_interspeech.pdf}
}
@inproceedings{lu22c_interspeech,
  author={Yen-Ju Lu and Xuankai Chang and Chenda Li and Wangyou Zhang and Samuele Cornell and Zhaoheng Ni and Yoshiki Masuyama and Brian Yan and Robin Scheibler and Zhong-Qiu Wang and Yu Tsao and Yanmin Qian and Shinji Watanabe},
  title={{ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={5458--5462},
}
@inproceedings{gao2023euro,
  title={{EURO: ESP}net unsupervised {ASR} open-source toolkit},
  author={Gao, Dongji and Shi, Jiatong and Chuang, Shun-Po and Garcia, Leibny Paola and Lee, Hung-yi and Watanabe, Shinji and Khudanpur, Sanjeev},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}
@inproceedings{peng2023reproducing,
  title={Reproducing {W}hisper-style training using an open-source toolkit and publicly available data},
  author={Peng, Yifan and Tian, Jinchuan and Yan, Brian and Berrebbi, Dan and Chang, Xuankai and Li, Xinjian and Shi, Jiatong and Arora, Siddhant and Chen, William and Sharma, Roshan and others},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  pages={1--8},
  year={2023},
  organization={IEEE}
}
@inproceedings{sharma2023espnet,
  title={ESPnet-{SUMM}: Introducing a novel large dataset, toolkit, and a cross-corpora evaluation of speech summarization systems},
  author={Sharma, Roshan and Chen, William and Kano, Takatomo and Sharma, Ruchira and Arora, Siddhant and Watanabe, Shinji and Ogawa, Atsunori and Delcroix, Marc and Singh, Rita and Raj, Bhiksha},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  pages={1--8},
  year={2023},
  organization={IEEE}
}
@article{jung2024espnet,
  title={{ESPnet-SPK}: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models},
  author={Jung, Jee-weon and Zhang, Wangyou and Shi, Jiatong and Aldeneh, Zakaria and Higuchi, Takuya and Theobald, Barry-John and Abdelaziz, Ahmed Hussen and Watanabe, Shinji},
  journal={Proc. Interspeech 2024},
  year={2024}
}
@inproceedings{yan-etal-2023-espnet,
    title = "{ESP}net-{ST}-v2: Multipurpose Spoken Language Translation Toolkit",
    author = "Yan, Brian  and
      Shi, Jiatong  and
      Tang, Yun  and
      Inaguma, Hirofumi  and
      Peng, Yifan  and
      Dalmia, Siddharth  and
      Pol{\'a}k, Peter  and
      Fernandes, Patrick  and
      Berrebbi, Dan  and
      Hayashi, Tomoki  and
      Zhang, Xiaohui  and
      Ni, Zhaoheng  and
      Hira, Moto  and
      Maiti, Soumi  and
      Pino, Juan  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    year = "2023",
    publisher = "Association for Computational Linguistics",
    pages = "400--411",
}

GitHub:

2 Likes