SpeechLLM (with LLaMA) and Conformer recipe for speech translation on CoVoST (Code from Samsung AI Center Cambridge) #2865
Conversation
|
I could use a first quick review on looking briefly at the code (forget about llama.py as it's from another PR). Maybe @Adel-Moumen / @pplantinga ? @poonehmousavi this is an example as how to refactor multiwoz. |
There was a problem hiding this comment.
First round (more details will come later when I will have the chance to run the recipe)
How is the LLM processing Speech embeddings ? I see in one doctstring that the LLM takes [speech embds] + [prompt] + [translation]. I was wondering but in the vLLM literature, its more common to do <image_start> <image embeddings> <image_end> + [prompt] + [translation] and I belive this might be easier for the LLM to reason about what is an image thanks to the <image> token i.e. you can ask him question and relate to the <image> token. My question is more, how easy it is for us to modify the way the model is processing the input? Can we modify it easily to accommodate this? (I understand the code and how you concat embds, but I am wondering if we could maybe think of ways to make this more beautiful and avoid having to copy and paste the recipe elsewhere, maybe a function combine_embeddings could be useful idk).,
Also, I tend to think that some things could be improved. I understnd the point of exposing the forward pass to make it easier to understand, but maybe we want here to hide some ugly tricks. For instance, I tend to think we could just have a custom generate function in our LLM interface + a custom GenerationCofing. We also need to have a pointer to the un-DDP wrapped modules to avoid having to check if module key is present etc.
| if hasattr(self.modules.lora_llm, "module"): | ||
| gen_func = self.modules.lora_llm.module.model.generate | ||
| else: | ||
| gen_func = self.modules.lora_llm.model.generate |
There was a problem hiding this comment.
same as before this is so ugly xD
There was a problem hiding this comment.
Yes, this is absurdly ugly. I have strictly no idea how we can fix it despite spending a significant amount of time on the problem. We would have to rethink entirely our DDP structure and even then, not sure that we can do something about that.
|
FYI, currently working on reviewing / making some edits. Then, will run tests and merge. |
|
Added a recipe test for Conformer in the AST recipe. Couldn't do it for LLaMA since it's relying on a large pretrained LLM. Gonna fix the pre-commit (couldn't run them on CCA). |
Adel-Moumen
left a comment
There was a problem hiding this comment.
LGTM!
We still have minor edits to do such as the generation part, how we construct the prompt etc but I think we need to move on on that topic. Will try to get back to this one and refactor a bit.
Thanks Sensei.
This PR introduces two speech translation recipes: one training a conformer encoder decoder with MT training (ASR+ST) from scratch. The other finetuning an XLS-R into an adapter before a LLama 3 7B model. I'd need some help to test these ones. These two recipes are basic things that we can find on the literature, nothing new, but should serve a good entry points for baselines of papers.
This is based on the new LLaMA interface of #2850
I only tried for En - De.