Skip to content

SpeechLLM (with LLaMA) and Conformer recipe for speech translation on CoVoST (Code from Samsung AI Center Cambridge) #2865

Merged
Adel-Moumen merged 37 commits into
speechbrain:developfrom
TParcollet:ast_covost
Aug 4, 2025
Merged

SpeechLLM (with LLaMA) and Conformer recipe for speech translation on CoVoST (Code from Samsung AI Center Cambridge) #2865
Adel-Moumen merged 37 commits into
speechbrain:developfrom
TParcollet:ast_covost

Conversation

@TParcollet

@TParcollet TParcollet commented Mar 17, 2025

Copy link
Copy Markdown
Collaborator

This PR introduces two speech translation recipes: one training a conformer encoder decoder with MT training (ASR+ST) from scratch. The other finetuning an XLS-R into an adapter before a LLama 3 7B model. I'd need some help to test these ones. These two recipes are basic things that we can find on the literature, nothing new, but should serve a good entry points for baselines of papers.

This is based on the new LLaMA interface of #2850

I only tried for En - De.

@TParcollet TParcollet added the work in progress Not ready for merge label Mar 17, 2025
@TParcollet

Copy link
Copy Markdown
Collaborator Author

I could use a first quick review on looking briefly at the code (forget about llama.py as it's from another PR). Maybe @Adel-Moumen / @pplantinga ?

@poonehmousavi this is an example as how to refactor multiwoz.

@Adel-Moumen Adel-Moumen self-requested a review April 8, 2025 13:00

@Adel-Moumen Adel-Moumen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round (more details will come later when I will have the chance to run the recipe)

How is the LLM processing Speech embeddings ? I see in one doctstring that the LLM takes [speech embds] + [prompt] + [translation]. I was wondering but in the vLLM literature, its more common to do <image_start> <image embeddings> <image_end> + [prompt] + [translation] and I belive this might be easier for the LLM to reason about what is an image thanks to the <image> token i.e. you can ask him question and relate to the <image> token. My question is more, how easy it is for us to modify the way the model is processing the input? Can we modify it easily to accommodate this? (I understand the code and how you concat embds, but I am wondering if we could maybe think of ways to make this more beautiful and avoid having to copy and paste the recipe elsewhere, maybe a function combine_embeddings could be useful idk).,

Also, I tend to think that some things could be improved. I understnd the point of exposing the forward pass to make it easier to understand, but maybe we want here to hide some ugly tricks. For instance, I tend to think we could just have a custom generate function in our LLM interface + a custom GenerationCofing. We also need to have a pointer to the un-DDP wrapped modules to avoid having to check if module key is present etc.

Comment thread recipes/CoVoST/AST/train.py Outdated
Comment thread recipes/CoVoST/AST/train.py Outdated
Comment thread recipes/CoVoST/AST/train.py
Comment thread recipes/CoVoST/AST/train_w2v2_llama.py Outdated
Comment thread recipes/CoVoST/AST/hparams/w2v2_llama3.yaml Outdated
Comment thread recipes/CoVoST/AST/train_w2v2_llama.py Outdated
Comment on lines +95 to +98
if hasattr(self.modules.lora_llm, "module"):
gen_func = self.modules.lora_llm.module.model.generate
else:
gen_func = self.modules.lora_llm.model.generate

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before this is so ugly xD

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is absurdly ugly. I have strictly no idea how we can fix it despite spending a significant amount of time on the problem. We would have to rethink entirely our DDP structure and even then, not sure that we can do something about that.

Comment thread recipes/CoVoST/AST/train_w2v2_llama.py Outdated
Comment thread recipes/CoVoST/AST/train_w2v2_llama.py
Comment thread recipes/CoVoST/AST/train_w2v2_llama.py
Comment thread recipes/CoVoST/AST/train_w2v2_llama.py Outdated
Comment thread recipes/CoVoST/AST/hparams/w2v2_llama3.yaml
Comment thread recipes/CoVoST/AST/hparams/w2v2_llama3.yaml
Comment thread recipes/CoVoST/AST/hparams/w2v2_llama3.yaml Outdated
@TParcollet TParcollet removed the work in progress Not ready for merge label Jul 15, 2025
@TParcollet TParcollet added the ready to review Waiting on reviewer to provide feedback label Jul 15, 2025
@Adel-Moumen

Copy link
Copy Markdown
Collaborator

FYI, currently working on reviewing / making some edits. Then, will run tests and merge.

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

Added a recipe test for Conformer in the AST recipe. Couldn't do it for LLaMA since it's relying on a large pretrained LLM.

Gonna fix the pre-commit (couldn't run them on CCA).

@Adel-Moumen Adel-Moumen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

We still have minor edits to do such as the generation part, how we construct the prompt etc but I think we need to move on on that topic. Will try to get back to this one and refactor a bit.

Thanks Sensei.

@Adel-Moumen Adel-Moumen merged commit 96b00ca into speechbrain:develop Aug 4, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to review Waiting on reviewer to provide feedback

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants