HiFi-GAN vocoder for discrete tokens#2571
Conversation
… sb_discrete_ssl_new_vocoder
…nks/speechbrain into sb_discrete_ssl_new_vocoder
|
Thanks @Chaanks for this PR.. It looks really good. I just added couple of comments. Apart from them:
|
| logger.info(f"Loading K-means model from {kmeans_ckpt} ...") | ||
| kmeans_model = joblib.load(open(kmeans_ckpt, "rb")) | ||
| kmeans_model.verbose = False | ||
| discrete_encoder = DiscreteSSL( |
There was a problem hiding this comment.
better to define in yaml file and pass as an argument
There was a problem hiding this comment.
See my other comment. I added a parameter for the save path.
| @@ -500,11 +505,14 @@ def audio_pipeline(utt_id, wav, segment): | |||
| "data_folder": hparams["save_folder"], | |||
There was a problem hiding this comment.
isn't it possible to it in forward function?
| Task,Dataset,Script_file,Hparam_file,Data_prep_file,Readme_file,Result_url,HF_repo,test_debug_flags,test_debug_checks | ||
| TTS,LibriTTS,recipes/LibriTTS/vocoder/hifigan/train.py,recipes/LibriTTS/vocoder/hifigan/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,https://www.dropbox.com/sh/gjs1kslxkxz819q/AABPriN4dOoD1qL7NoIyVk0Oa?dl=0 ,https://huggingface.co/speechbrain/tts-hifigan-libritts-16kHz,--epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000, | ||
| TTS,LibriTTS,recipes/LibriTTS/TTS/mstacotron2/train.py,recipes/LibriTTS/TTS/mstacotron2/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,https://www.dropbox.com/sh/ti2vk7sce8f9fgd/AABcDGWCrBvLX_ZQs76mlJRYa?dl=0,,--batch_size=1 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000, | ||
| TTS,LJSpeech,recipes/LibriTTS/vocoder/hifigan_unit/train.py,recipes/LibriTTS/vocoder/hifigan_unit/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --skip_extract=True,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]" |
There was a problem hiding this comment.
test are failed due to following error:
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/train_log.txt
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/log.txt
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/env.log
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/train_spk.py
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/hyperparams.yaml
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/1/synthesized.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/1/target.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/2/synthesized.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/2/target.wa
| TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,https://www.dropbox.com/sh/m2xrdssiroipn8g/AAD-TqPYLrSg6eNxUkcImeg4a?dl=0,https://huggingface.co/speechbrain/tts-hifigan-ljspeech,--epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]" | ||
| TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/diffwave/train.py,recipes/LJSpeech/TTS/vocoder/diffwave/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--number_of_epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000 --num_workers 0,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml]" | ||
| TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --kmeans_folder=null,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]" | ||
| TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --skip_extract=True,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]" |
There was a problem hiding this comment.
ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/train_log.txt
ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/1/synthesized.wav
ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/1/target.wav
ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/2/synthesized.wav
ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/2/target.wav
| VALID_JSON = "valid.json" | ||
| TEST_JSON = "test.json" | ||
|
|
||
| ENCODER_CLASSES = { |
There was a problem hiding this comment.
you could use this in yaml:
ssl_model: !apply:speechbrain.utils.hparams.choice
value: !ref <ssl_model_type>
choices:
wavlm: !new:speechbrain.lobes.models.huggingface_transformers.wavlm.WavLM
source: !ref <token_model_src>
save_path: !ref <pretrained_model_save_folder>
freeze: !ref <freeze_token_model>
output_all_hiddens: True
hubert: !new:speechbrain.lobes.models.huggingface_transformers.hubert.HuBERT
source: !ref <token_model_src>
save_path: !ref <pretrained_model_save_folder>
freeze: !ref <freeze_token_model>
output_all_hiddens: True
wav2vec2: !new:speechbrain.lobes.models.huggingface_transformers.wav2vec2.Wav2Vec2
source: !ref <token_model_src>
save_path: !ref <pretrained_model_save_folder>
freeze: !ref <freeze_token_model>
output_all_hiddens: True
There was a problem hiding this comment.
I prefer the way I'm doing it currently. The ssl_model is only used for token extraction and not during training, so I don't see why we need to instantiate the model in the YAML and it will require manual cleanup to free memory for HiFi-GAN training.
There was a problem hiding this comment.
i dont see what is the problem in defining it in yaml file instead of the python file... why ther isa difference for manual cleanup?
|
Thank you @Chaanks! Here are some comments:
I would recommend to double check if all the vocoder training is running when following step by step the instructions in the README file. |
|
1 Yes it comes from this work: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations and was then used for S2ST. What do you think of renaming it 6 Yes, we have it uploaded for the Benchmark paper. Should we directly add a reference to this? 7 Deleted. I also added backward compatibility to the Vocoder for previous models without scalability (https://huggingface.co/speechbrain/tts-hifigan-unit-hubert-l6-k100-ljspeech). |
1.I personally prefer the name hifi_gan_discrete. |
|
Thank you @Chaanks! In the README.md it is written, "The kmeans_folder, kmeans_dataset and num_clusters should be specified based on SSL_Quantization." This part should be elaborated more. Here, to simplify running the recipe, we created an HF repository with all the needed models. However, it could be important to also describe what users should do to use their own quantization model or to just replicate what we did from scratch. I think they should go to speechbrain/recipes/LJSpeech/quantization/, run the quantizers, and store it somewhere. These steps should be described and tested as I feel we need to better connect in the documentation the vocoder part and the quantization parts. |
|
Thank you @Chaanks! |
What does this PR do?
This PR adds support for HiFi-GAN to work with new SSL Discrete Tokens and supports for bitrate-scalable training.
Including recipe for LJSpeech and LibriTTS.
Imade a few changes to the LibriTTS data preparation script. Please check them.
Before submitting
PR review
Reviewer checklist