HiFi-GAN vocoder for discrete tokens by Chaanks · Pull Request #2571 · speechbrain/speechbrain

Chaanks · 2024-06-11T15:41:53Z

What does this PR do?

This PR adds support for HiFi-GAN to work with new SSL Discrete Tokens and supports for bitrate-scalable training.
Including recipe for LJSpeech and LibriTTS.

Imade a few changes to the LibriTTS data preparation script. Please check them.

Before submitting

[x ] Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

… sb_discrete_ssl_new_vocoder

…nks/speechbrain into sb_discrete_ssl_new_vocoder

poonehmousavi · 2024-07-04T22:11:29Z

Thanks @Chaanks for this PR.. It looks really good. I just added couple of comments. Apart from them:

make sure that all the test are working: You could use following command to run the test:
python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Task"], filters=[["TTS"]], do_checks=True, run_opts="--device=cuda")) else print("TEST PASSED")'
Please transfer all the vocoder checkpoints from your HF to SPeechbrain and fix the reference in the readme and docstring.
The kmeans repository is transferred to speechbrain/SSL_Quantization, please update it in your recipe.
ALso once this PR is merged.... you might need to change the path to discrete_ssl.py since it is moved to a different folder.

poonehmousavi · 2024-07-04T21:49:56Z

-    logger.info(f"Loading K-means model from {kmeans_ckpt} ...")
-    kmeans_model = joblib.load(open(kmeans_ckpt, "rb"))
-    kmeans_model.verbose = False
+    discrete_encoder = DiscreteSSL(


better to define in yaml file and pass as an argument

See my other comment. I added a parameter for the save path.

poonehmousavi · 2024-07-04T21:58:37Z

@@ -500,11 +505,14 @@ def audio_pipeline(utt_id, wav, segment):
            "data_folder": hparams["save_folder"],


isn't it possible to it in forward function?

poonehmousavi · 2024-07-04T22:04:41Z

 Task,Dataset,Script_file,Hparam_file,Data_prep_file,Readme_file,Result_url,HF_repo,test_debug_flags,test_debug_checks
 TTS,LibriTTS,recipes/LibriTTS/vocoder/hifigan/train.py,recipes/LibriTTS/vocoder/hifigan/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,https://www.dropbox.com/sh/gjs1kslxkxz819q/AABPriN4dOoD1qL7NoIyVk0Oa?dl=0 ,https://huggingface.co/speechbrain/tts-hifigan-libritts-16kHz,--epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000,
 TTS,LibriTTS,recipes/LibriTTS/TTS/mstacotron2/train.py,recipes/LibriTTS/TTS/mstacotron2/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,https://www.dropbox.com/sh/ti2vk7sce8f9fgd/AABcDGWCrBvLX_ZQs76mlJRYa?dl=0,,--batch_size=1 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000,
+TTS,LJSpeech,recipes/LibriTTS/vocoder/hifigan_unit/train.py,recipes/LibriTTS/vocoder/hifigan_unit/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --skip_extract=True,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"


test are failed due to following error:
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/train_log.txt
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/log.txt
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/env.log
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/train_spk.py
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/hyperparams.yaml
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/1/synthesized.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/1/target.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/2/synthesized.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/2/target.wa

poonehmousavi · 2024-07-04T22:06:29Z

 TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,https://www.dropbox.com/sh/m2xrdssiroipn8g/AAD-TqPYLrSg6eNxUkcImeg4a?dl=0,https://huggingface.co/speechbrain/tts-hifigan-ljspeech,--epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"
 TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/diffwave/train.py,recipes/LJSpeech/TTS/vocoder/diffwave/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--number_of_epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000 --num_workers 0,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml]"
-TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --kmeans_folder=null,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"
+TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --skip_extract=True,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"


ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/train_log.txt ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/1/synthesized.wav ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/1/target.wav ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/2/synthesized.wav ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/2/target.wav

poonehmousavi · 2024-07-04T23:23:14Z

 VALID_JSON = "valid.json"
 TEST_JSON = "test.json"

+ENCODER_CLASSES = {


you could use this in yaml:

ssl_model: !apply:speechbrain.utils.hparams.choice value: !ref <ssl_model_type> choices: wavlm: !new:speechbrain.lobes.models.huggingface_transformers.wavlm.WavLM source: !ref <token_model_src> save_path: !ref <pretrained_model_save_folder> freeze: !ref <freeze_token_model> output_all_hiddens: True hubert: !new:speechbrain.lobes.models.huggingface_transformers.hubert.HuBERT source: !ref <token_model_src> save_path: !ref <pretrained_model_save_folder> freeze: !ref <freeze_token_model> output_all_hiddens: True wav2vec2: !new:speechbrain.lobes.models.huggingface_transformers.wav2vec2.Wav2Vec2 source: !ref <token_model_src> save_path: !ref <pretrained_model_save_folder> freeze: !ref <freeze_token_model> output_all_hiddens: True

I prefer the way I'm doing it currently. The ssl_model is only used for token extraction and not during training, so I don't see why we need to instantiate the model in the YAML and it will require manual cleanup to free memory for HiFi-GAN training.

i dont see what is the problem in defining it in yaml file instead of the python file... why ther isa difference for manual cleanup?

mravanelli · 2024-07-16T01:31:26Z

Thank you @Chaanks! Here are some comments:

hifi_gan_unit => Is this name from the literature?
encoder_hub: facebook/hubert-large-ll60k => Add comments in the YAML and in the README file with the options that users need to set for wav2vec2 and wavLM.
In the README.md: This vocoder is a neural network designed to transform discrete self-supervised representations into waveform data and is suitable for speech-to-speech translation on top of CVSS/S2ST models => This is not only needed for speech translation but in general for generative tasks like TTS, speech enhancement, separation, voice cloning, etc. We should clarify that. Maybe we can also mention that we use this in our DASB benchmark
In the yaml, we need to add a comment on "kmeans_folder:" to describe what it is. Is this supposed to be a !PLACEHOLDER variable?
In https://huggingface.co/speechbrain/hifigan-hubert-l1-3-7-12-18-23-k1000-LibriTTS, we can add the reference to our interspeech paper (or any other paper) that describes well the scalable vocoder. We need to reference that also in README.md
In README.md, we normally reference the Dropbox folder with the checkpoints and logs. Do we have it?
It looks like recipes/LJSpeech/TTS/quantization is outdated, right? If so, it can be deleted and the README file should be updated with the right instructions.

I would recommend to double check if all the vocoder training is running when following step by step the instructions in the README file.

Chaanks · 2024-07-17T13:50:12Z

1 Yes it comes from this work: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations and was then used for S2ST. What do you think of renaming it hifi_gan_discrete ? It will better correlate with the discrete module of SB.

6 Yes, we have it uploaded for the Benchmark paper. Should we directly add a reference to this?

7 Deleted. I also added backward compatibility to the Vocoder for previous models without scalability (https://huggingface.co/speechbrain/tts-hifigan-unit-hubert-l6-k100-ljspeech).

poonehmousavi · 2024-07-17T15:37:53Z

Yes it comes from this work: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations and was then used for S2ST. What do you think of renaming it hifi_gan_discrete ? It will better correlate with the discrete module of SB.

Yes, we have it uploaded for the Benchmark paper. Should we directly add a reference to this?

Deleted. I also added backward compatibility to the Vocoder for previous models without scalability (https://huggingface.co/speechbrain/tts-hifigan-unit-hubert-l6-k100-ljspeech).

1.I personally prefer the name hifi_gan_discrete.
2. We could add the interspeech paper as a reference since the main idea of scalable vocoder is introduced there.
3. In the readme file, it is mentioned to install extra-dependecies.. but I could not find it. Also, I think it would be better to add some explanation for train_spk.yaml in the readme file.

…nks/speechbrain into sb_discrete_ssl_new_vocoder

mravanelli · 2024-07-18T01:09:48Z

Thank you @Chaanks!
It looks like everything is running.
I only have this final comment on the README files (speechbrain/recipes/LJSpeech/TTS/README.md and speechbrain/recipes/LJSpeech/TTS/README.md).

In the README.md it is written, "The kmeans_folder, kmeans_dataset and num_clusters should be specified based on SSL_Quantization."

This part should be elaborated more. Here, to simplify running the recipe, we created an HF repository with all the needed models.

However, it could be important to also describe what users should do to use their own quantization model or to just replicate what we did from scratch. I think they should go to speechbrain/recipes/LJSpeech/quantization/, run the quantizers, and store it somewhere. These steps should be described and tested as I feel we need to better connect in the documentation the vocoder part and the quantization parts.

mravanelli · 2024-07-18T15:17:25Z

Thank you @Chaanks!

poonehmousavi and others added 27 commits April 2, 2024 10:23

remove quntziation recepie for diffrent dataset + transfer to benchmark

6249750

remove unne

de7fe02

remove old discrete ssl models

31cd7f5

remove correcponding test rec for qunatization

e900fe2

add new discrete interface

b6ce3c5

update kmeans recepie +chckpointing

33b010f

fix discrete tokenizer docstring

47ee5a4

fix test

be2916b

fix docstring

5bcf357

Update Unit Hifi-GAN and Add LibriTTS recipe

f3660c2

Update Unit Hifi-GAN LibriTTS

1bc0e3f

Add an option to bypass token embedding layer

450b7a3

Update LibriTTS data preparation

d4c5395

Update LJSPeech hifigan_unit yaml

54f8c62

Merge branch 'develop' of https://github.com/Chaanks/speechbrain into…

6b67937

… sb_discrete_ssl_new_vocoder

Fix bug in discrete_ssl interface

ba8fb41

Update LJSpeech unit_hifigan

7c308b4

Add LibriTTS HiFi-GAN Unit

e9f61ff

Fix bug

4051517

Add tests

76a5711

Update train_spk.yaml

056468d

fix consistency tests

7f9e87e

Merge branch 'sb_discrete_ssl_new_vocoder' of https://github.com/Chaa…

4cf709b

…nks/speechbrain into sb_discrete_ssl_new_vocoder

fix pre-commit tests

59907ef

fix pre-commit tests

992421f

fix doc tests

a56ffc4

fix doc tests

73e7026

mravanelli requested a review from poonehmousavi June 17, 2024 16:19

mravanelli assigned Chaanks Jun 17, 2024

mravanelli added the enhancement New feature or request label Jun 17, 2024

Merge branch 'speechbrain:develop' into sb_discrete_ssl_new_vocoder

74b2fa1

poonehmousavi reviewed Jul 9, 2024

View reviewed changes

Chaanks added 5 commits July 13, 2024 16:07

Resolve comments

077bb7f

black

d13780e

tests

3df6b05

fix docs

50bf6ec

fix docs & yaml

247c994

Chaanks added 5 commits July 17, 2024 14:26

update README & yaml

2836264

Make pooling optional for backward compatibility

4b77335

Remove old quantization recipe

6c27957

Update for new HiFi-GAN interface

e870154

Merge branch 'develop' into sb_discrete_ssl_new_vocoder

b2d8dd4

Chaanks added 3 commits July 17, 2024 21:55

Refactor vocoder path & resolve conflicts

d714142

Merge branch 'sb_discrete_ssl_new_vocoder' of https://github.com/Chaa…

54ad5ab

…nks/speechbrain into sb_discrete_ssl_new_vocoder

Merge branch 'develop' into sb_discrete_ssl_new_vocoder

d092f63

Chaanks added 3 commits July 18, 2024 16:47

update LJSpeech and LibriTTS README

fdebf80

update LJSpeech and LibriTTS README

caaefcd

update LJSpeech and LibriTTS README

1cdbc76

mravanelli self-requested a review July 18, 2024 15:17

mravanelli approved these changes Jul 18, 2024

View reviewed changes

mravanelli merged commit 99052e0 into speechbrain:develop Jul 18, 2024

		@@ -500,11 +505,14 @@ def audio_pipeline(utt_id, wav, segment):
		"data_folder": hparams["save_folder"],

Uh oh!

Conversation

Chaanks commented Jun 11, 2024

What does this PR do?

PR review

Uh oh!

poonehmousavi commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poonehmousavi Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

Chaanks Jul 13, 2024

Choose a reason for hiding this comment

Uh oh!

poonehmousavi Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

poonehmousavi Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

poonehmousavi Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

poonehmousavi Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

Chaanks Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poonehmousavi Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

mravanelli commented Jul 16, 2024

Uh oh!

Chaanks commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poonehmousavi commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mravanelli commented Jul 18, 2024

Uh oh!

mravanelli commented Jul 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

poonehmousavi commented Jul 4, 2024 •

edited

Loading

Chaanks Jul 13, 2024 •

edited

Loading

Chaanks commented Jul 17, 2024 •

edited

Loading

poonehmousavi commented Jul 17, 2024 •

edited

Loading