Skip to content

HiFi-GAN vocoder for discrete tokens#2571

Merged
mravanelli merged 44 commits into
speechbrain:developfrom
Chaanks:sb_discrete_ssl_new_vocoder
Jul 18, 2024
Merged

HiFi-GAN vocoder for discrete tokens#2571
mravanelli merged 44 commits into
speechbrain:developfrom
Chaanks:sb_discrete_ssl_new_vocoder

Conversation

@Chaanks

@Chaanks Chaanks commented Jun 11, 2024

Copy link
Copy Markdown
Collaborator

What does this PR do?

This PR adds support for HiFi-GAN to work with new SSL Discrete Tokens and supports for bitrate-scalable training.
Including recipe for LJSpeech and LibriTTS.

Imade a few changes to the LibriTTS data preparation script. Please check them.

Before submitting
  • [x ] Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@mravanelli mravanelli requested a review from poonehmousavi June 17, 2024 16:19
@mravanelli mravanelli added the enhancement New feature or request label Jun 17, 2024
@poonehmousavi

poonehmousavi commented Jul 4, 2024

Copy link
Copy Markdown
Collaborator

Thanks @Chaanks for this PR.. It looks really good. I just added couple of comments. Apart from them:

  1. make sure that all the test are working: You could use following command to run the test:
    python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Task"], filters=[["TTS"]], do_checks=True, run_opts="--device=cuda")) else print("TEST PASSED")'

  2. Please transfer all the vocoder checkpoints from your HF to SPeechbrain and fix the reference in the readme and docstring.

  3. The kmeans repository is transferred to speechbrain/SSL_Quantization, please update it in your recipe.

  4. ALso once this PR is merged.... you might need to change the path to discrete_ssl.py since it is moved to a different folder.

Comment thread speechbrain/lobes/models/HifiGAN.py Outdated
Comment thread speechbrain/lobes/models/HifiGAN.py
Comment thread speechbrain/inference/vocoders.py
Comment thread recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/extract_code.py
logger.info(f"Loading K-means model from {kmeans_ckpt} ...")
kmeans_model = joblib.load(open(kmeans_ckpt, "rb"))
kmeans_model.verbose = False
discrete_encoder = DiscreteSSL(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to define in yaml file and pass as an argument

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment. I added a parameter for the save path.

@@ -500,11 +505,14 @@ def audio_pipeline(utt_id, wav, segment):
"data_folder": hparams["save_folder"],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it possible to it in forward function?

Comment thread recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml
Comment thread tests/recipes/LibriTTS.csv Outdated
Task,Dataset,Script_file,Hparam_file,Data_prep_file,Readme_file,Result_url,HF_repo,test_debug_flags,test_debug_checks
TTS,LibriTTS,recipes/LibriTTS/vocoder/hifigan/train.py,recipes/LibriTTS/vocoder/hifigan/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,https://www.dropbox.com/sh/gjs1kslxkxz819q/AABPriN4dOoD1qL7NoIyVk0Oa?dl=0 ,https://huggingface.co/speechbrain/tts-hifigan-libritts-16kHz,--epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000,
TTS,LibriTTS,recipes/LibriTTS/TTS/mstacotron2/train.py,recipes/LibriTTS/TTS/mstacotron2/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,https://www.dropbox.com/sh/ti2vk7sce8f9fgd/AABcDGWCrBvLX_ZQs76mlJRYa?dl=0,,--batch_size=1 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000,
TTS,LJSpeech,recipes/LibriTTS/vocoder/hifigan_unit/train.py,recipes/LibriTTS/vocoder/hifigan_unit/hparams/train.yaml,recipes/LibriTTS/libritts_prepare.py,recipes/LibriTTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --skip_extract=True,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test are failed due to following error:
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/train_log.txt
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/log.txt
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/env.log
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/train_spk.py
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/hyperparams.yaml
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/1/synthesized.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/1/target.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/2/synthesized.wav
ERROR: The recipe LibriTTS_row_05 does not contain the expected file tests/tmp/LibriTTS_row_05/samples/2/target.wa

Comment thread tests/recipes/LJSpeech.csv Outdated
TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,https://www.dropbox.com/sh/m2xrdssiroipn8g/AAD-TqPYLrSg6eNxUkcImeg4a?dl=0,https://huggingface.co/speechbrain/tts-hifigan-ljspeech,--epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"
TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/diffwave/train.py,recipes/LJSpeech/TTS/vocoder/diffwave/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--number_of_epochs=2 --data_folder=tests/samples/ASR --train_json=tests/samples/annotation/ASR_train.json --valid_json=tests/samples/annotation/ASR_dev.json --test_json=tests/samples/annotation/ASR_dev.json --skip_prep=True --sample_rate=16000 --num_workers 0,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml]"
TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --kmeans_folder=null,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"
TTS,LJSpeech,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/train.py,recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml,recipes/LJSpeech/ljspeech_prepare.py,recipes/LJSpeech/TTS/README.md,,,--batch_size=2 --epochs=2 --data_folder=tests/samples/TTS --train_json=tests/samples/annotation/TTS_train.json --valid_json=tests/samples/annotation/TTS_train.json --test_json=tests/samples/annotation/TTS_train.json --skip_prep=True --sample_rate=16000 --codes_folder=tests/samples/TTS/codes --skip_extract=True,"file_exists=[train_log.txt,log.txt,env.log,train.py,hyperparams.yaml,samples/1/synthesized.wav,samples/1/target.wav,samples/2/synthesized.wav,samples/2/target.wav]"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/train_log.txt
    ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/1/synthesized.wav
    ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/1/target.wav
    ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/2/synthesized.wav
    ERROR: The recipe LJSpeech_row_07 does not contain the expected file tests/tmp/LJSpeech_row_07/samples/2/target.wav

VALID_JSON = "valid.json"
TEST_JSON = "test.json"

ENCODER_CLASSES = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use this in yaml:

ssl_model: !apply:speechbrain.utils.hparams.choice
    value: !ref <ssl_model_type>
    choices:
        wavlm: !new:speechbrain.lobes.models.huggingface_transformers.wavlm.WavLM
            source: !ref <token_model_src>
            save_path: !ref <pretrained_model_save_folder>
            freeze: !ref <freeze_token_model>
            output_all_hiddens: True
        hubert: !new:speechbrain.lobes.models.huggingface_transformers.hubert.HuBERT
            source: !ref <token_model_src>
            save_path: !ref <pretrained_model_save_folder>
            freeze: !ref <freeze_token_model>
            output_all_hiddens: True
        wav2vec2: !new:speechbrain.lobes.models.huggingface_transformers.wav2vec2.Wav2Vec2
            source: !ref <token_model_src>
            save_path: !ref <pretrained_model_save_folder>
            freeze: !ref <freeze_token_model>
            output_all_hiddens: True

@Chaanks Chaanks Jul 13, 2024

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the way I'm doing it currently. The ssl_model is only used for token extraction and not during training, so I don't see why we need to instantiate the model in the YAML and it will require manual cleanup to free memory for HiFi-GAN training.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont see what is the problem in defining it in yaml file instead of the python file... why ther isa difference for manual cleanup?

@mravanelli

Copy link
Copy Markdown
Collaborator

Thank you @Chaanks! Here are some comments:

  1. hifi_gan_unit => Is this name from the literature?

  2. encoder_hub: facebook/hubert-large-ll60k => Add comments in the YAML and in the README file with the options that users need to set for wav2vec2 and wavLM.

  3. In the README.md: This vocoder is a neural network designed to transform discrete self-supervised representations into waveform data and is suitable for speech-to-speech translation on top of CVSS/S2ST models => This is not only needed for speech translation but in general for generative tasks like TTS, speech enhancement, separation, voice cloning, etc. We should clarify that. Maybe we can also mention that we use this in our DASB benchmark

  4. In the yaml, we need to add a comment on "kmeans_folder:" to describe what it is. Is this supposed to be a !PLACEHOLDER variable?

  5. In https://huggingface.co/speechbrain/hifigan-hubert-l1-3-7-12-18-23-k1000-LibriTTS, we can add the reference to our interspeech paper (or any other paper) that describes well the scalable vocoder. We need to reference that also in README.md

  6. In README.md, we normally reference the Dropbox folder with the checkpoints and logs. Do we have it?

  7. It looks like recipes/LJSpeech/TTS/quantization is outdated, right? If so, it can be deleted and the README file should be updated with the right instructions.

I would recommend to double check if all the vocoder training is running when following step by step the instructions in the README file.

@Chaanks

Chaanks commented Jul 17, 2024

Copy link
Copy Markdown
Collaborator Author

1 Yes it comes from this work: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations and was then used for S2ST. What do you think of renaming it hifi_gan_discrete ? It will better correlate with the discrete module of SB.

6 Yes, we have it uploaded for the Benchmark paper. Should we directly add a reference to this?

7 Deleted. I also added backward compatibility to the Vocoder for previous models without scalability (https://huggingface.co/speechbrain/tts-hifigan-unit-hubert-l6-k100-ljspeech).

@poonehmousavi

poonehmousavi commented Jul 17, 2024

Copy link
Copy Markdown
Collaborator
  1. Yes it comes from this work: Speech Resynthesis from Discrete Disentangled Self-Supervised Representations and was then used for S2ST. What do you think of renaming it hifi_gan_discrete ? It will better correlate with the discrete module of SB.
  2. Yes, we have it uploaded for the Benchmark paper. Should we directly add a reference to this?
  3. Deleted. I also added backward compatibility to the Vocoder for previous models without scalability (https://huggingface.co/speechbrain/tts-hifigan-unit-hubert-l6-k100-ljspeech).

1.I personally prefer the name hifi_gan_discrete.
2. We could add the interspeech paper as a reference since the main idea of scalable vocoder is introduced there.
3. In the readme file, it is mentioned to install extra-dependecies.. but I could not find it. Also, I think it would be better to add some explanation for train_spk.yaml in the readme file.

@mravanelli

Copy link
Copy Markdown
Collaborator

Thank you @Chaanks!
It looks like everything is running.
I only have this final comment on the README files (speechbrain/recipes/LJSpeech/TTS/README.md and speechbrain/recipes/LJSpeech/TTS/README.md).

In the README.md it is written, "The kmeans_folder, kmeans_dataset and num_clusters should be specified based on SSL_Quantization."

This part should be elaborated more. Here, to simplify running the recipe, we created an HF repository with all the needed models.

However, it could be important to also describe what users should do to use their own quantization model or to just replicate what we did from scratch. I think they should go to speechbrain/recipes/LJSpeech/quantization/, run the quantizers, and store it somewhere. These steps should be described and tested as I feel we need to better connect in the documentation the vocoder part and the quantization parts.

@mravanelli mravanelli self-requested a review July 18, 2024 15:17
@mravanelli

Copy link
Copy Markdown
Collaborator

Thank you @Chaanks!

@mravanelli mravanelli merged commit 99052e0 into speechbrain:develop Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants