Skip to content

25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge)#2802

Closed
TParcollet wants to merge 41 commits into
speechbrain:developfrom
TParcollet:titou/LargeScaleASR
Closed

25,000 of diverse English ASR data (dataset name hidden) (code from Samsung AI Center Cambridge)#2802
TParcollet wants to merge 41 commits into
speechbrain:developfrom
TParcollet:titou/LargeScaleASR

Conversation

@TParcollet

Copy link
Copy Markdown
Collaborator

This PR introduces the data preparation of [insert hidden name](anonymity due to conference rules for now).

The 6 subsets are (more details in the readme):

  1. large contains 25,000 hours of read / spontaneous and clean / noisy transcribed speech.
  2. medium contains 2,500 hours of read / spontaneous and clean / noisy transcribed speech.
  3. small contains 250 hours of read / spontaneous and clean / noisy transcribed speech.
  4. clean contains 13,000 hours of read and clean / less noisy transcribed speech.
  5. dev contains 17 hours.
  6. test contains 17 hours.

According to already trained models, this is the best English ASR model that SpeechBrain can have so far. This PR is just here so that someone recreate the dataset and upload it on HuggingFace (as the data preparation leads to a properly sharded HuggingFace dataset)...

The code is is progress until the dataset has been uploaded by someone onto HuggingFace.

@TParcollet TParcollet added the recipes Changes to recipes only (add/edit) label Jan 16, 2025
@TParcollet TParcollet self-assigned this Jan 16, 2025
@TParcollet

Copy link
Copy Markdown
Collaborator Author

Here i'd love to have an opinion about having this kind of recipe into SB from @Adel-Moumen @pplantinga and @mravanelli . It's quite uncommon to have such a big part of the code devoted to preparing a dataset. I could use an official review by someone as well.

@pplantinga

Copy link
Copy Markdown
Collaborator

I think this sort of recipe is sorely needed for open-source research. NeMo has a similar recipe that is not open-sourced for their ASRset. I'm wondering if this recipe could be developed further to involve a more sophisticated sample filtering by automatically transcribing via multiple ASR systems and taking the samples with low rates of transcription differences -- from what I understand this is a comon technique for large-scale ASR systems these days.

As for the recipe itself, it looks like it repeats a lot of the dataset preparation for datasets we already have. Is there any way we can re-use some of the scripts already available?

@TParcollet

Copy link
Copy Markdown
Collaborator Author

Agreed @pplantinga . Let me answer for the recipe part. All the scripts are different as the csv rows and steps / filtering are not the same. There is also some file copying involved. I cannot reuse existing data prep. As you can see in the PR there is also a new TextNormaliser class that I use. The ASR recipe PR will be much easier to review and merge ...

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

Hey, what should we do about this PR? If I am not mistaken, at some point you were thinking of closing this PR, right?

@TParcollet

Copy link
Copy Markdown
Collaborator Author

Yes, but not now.

@TParcollet TParcollet closed this Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

recipes Changes to recipes only (add/edit)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants