Skip to content

Add TF32 flags + general improvements in DDP backend#2682

Merged
Adel-Moumen merged 33 commits into
speechbrain:developfrom
Adel-Moumen:improve_backend
Sep 14, 2024
Merged

Add TF32 flags + general improvements in DDP backend#2682
Adel-Moumen merged 33 commits into
speechbrain:developfrom
Adel-Moumen:improve_backend

Conversation

@Adel-Moumen

@Adel-Moumen Adel-Moumen commented Sep 11, 2024

Copy link
Copy Markdown
Collaborator

What does this PR do?

This PR tries to improve our DDP backend. Most of the changes tries to improve some corner case of SpeechBrain.

List of additions/modifications:

  • add context manager to make sure that a block of code is always entering in __enter__ and __exit__ which makes sure that MAIN_PROC_ONLY is always increased/decreased even when a sub-function is throwing an exception (which in the past could lead to an issue).
  • fix an issue with the collect_files(). Indeed, in DDP you expect to see in save/ a symlink of your pre-trained model. This indeed, led to an issue where only the main proc could load the LM while the other processes where throwing an error because they couldn't load the collected models. One effective solution was to turn internal_ddp_handling flag to True but it is not necessary as run_on_main does also work on a system that is not distributed.
  • Add a new get_logger function as a drop-in replacement to logger.get_logger. The key idea here is to have the possibility to choose if we want to log on all the processes or only the main proc. Generally speaking, we only want the main proc to log informations (e.g. loading of ckpts etc), in some cases like seed_everything you would like to know the seed for each proc but these are corner case which can be solved by just passing main_process_only=False. This new logger leaves us also the possibility to only emit one warning as wanted in some previous PRs.
  • By default, now, we are leveraging Tensor Floats 32 which is absolutely critical and doesn't leads to any instabilities.
  • Improvement of DDP barrier function to support the case where we use NCCL.
  • Some new distributed functions that can be helpful.
Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@Adel-Moumen Adel-Moumen changed the title Add TF32 flags + some small improvements on DDP Add TF32 flags + general improvements in DDP backend Sep 12, 2024
Comment thread speechbrain/utils/logger.py
@Adel-Moumen

Copy link
Copy Markdown
Collaborator Author

After some checks, I confirm that DDP works in multinodes and mononode on Compute Canada. (which was the case but this PR doesn't introduce regression on DDP.)

@pplantinga pplantinga left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Comment thread setup.py
Comment thread speechbrain/utils/distributed.py Outdated
Comment thread speechbrain/utils/logger.py
@Adel-Moumen Adel-Moumen marked this pull request as ready for review September 14, 2024 11:38
@Adel-Moumen Adel-Moumen merged commit 7037eb2 into speechbrain:develop Sep 14, 2024
@Adel-Moumen Adel-Moumen deleted the improve_backend branch September 14, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants