In this article we have covered everything about the latest multilingual speech model from the basics of how it works to the step-by-step implementation of the model in Python.
Meta, the company that owns Facebook, released a new AI model called Massively Multilingual Speech (MMS) that can convert text to speech and speech to text in over 1,100 languages. It is available for free. It will not only help academicians and researchers across the world but also language preservationists or activists to document and preserve endangered languages to prevent their extinction.
Uses of MMS Model
Massively Multilingual Speech (MMS) can be used for a variety of purposes. Some of them are as follows :
- Creating Audiobooks
MMS can be used to convert books and tutorials to audiobooks. It is useful for people who have difficulty in reading.
- Analyze audio
Suppose you have a few speech audio files of a politician and you want to analyze them. You may be interested in knowing the topics he mostly talks about and identifying his main focus area.
- Creating audio recordings of endangered languages
MMS can be used to create audio recordings of endangered languages which are at danger of being lost forever.
- MMS model can be used to create closed captioning for videos and other audio content.
How MMS Model works?
Meta AI Team made use of religious texts like the Bible which has been translated into many languages and extensively studied for language translation research.
Although the data mainly consists of male speakers and pertains to religious content, the researchers found that their models performed equally well for both male and female voices. Error rate for male and female speakers is almost same.
Speech datasets already exist publicly covering 100 languages. They trained a model to align these existing datasets. It means they made sure the audio matched up with the text.
They also developed a method called wav2vec 2.0
which helps train speech recognition models with less labeled data.
Python Code : Text to Speech
In this section, we will cover how to convert text to speech with MMS model. You can use the colab notebook by clicking on the button below
You can install ttsmms library by using pip command.
!pip install ttsmms
It is important to find out the language code (ISO Code) for the language which you want to translate text to speech. You can refer to the table at the end of this article to look up the ISO code for the specific language. In the code below, I am using hin
ISO code for hindi language. Replace hin.tar.gz
with eng.tar.gz
if your text language is english.
!curl https://dl.fbaipublicfiles.com/mms/tts/hin.tar.gz --output hin.tar.gz
Unzip (extract) files from .tar.gz and move the unzipped files to data folder. Make sure you update language code.
!mkdir -p data && tar -xzf hin.tar.gz -C data/
We are running the MMS model in this step to convert text to speech. Don't forget to modify the language code at each step.
from ttsmms import TTS tts=TTS("data/hin") wav=tts.synthesis("आप कैसे हैं?")
In this step we are asking Python to play audio which model generated.
# Display Audio from IPython.display import Audio Audio(wav["x"], rate=wav["sampling_rate"])
# Install library !pip install ttsmms # Download TTS model !curl https://dl.fbaipublicfiles.com/mms/tts/hin.tar.gz --output hin.tar.gz # Extract !mkdir -p data && tar -xzf hin.tar.gz -C data/ from ttsmms import TTS tts=TTS("data/hin") wav=tts.synthesis("आप कैसे हैं?") # Display Audio from IPython.display import Audio Audio(wav["x"], rate=wav["sampling_rate"])
You can refer the python program below to download audio file from Google Colab. Audio file will be saved in wav format and named as audio_file.wav.
# Download the audio file from google.colab import files from scipy.io import wavfile import numpy as np # Convert audio data to 16-bit signed integer format audio_data = np.int16(wav["x"] * 32767) # Save the audio data as a WAV file wavfile.write('audio_file.wav', wav["sampling_rate"], audio_data) # Download the audio file files.download('audio_file.wav')
Similarly you can convert english text to speech. See the code below. The only change I have done from the previous code is updating language code and text for English.
!curl https://dl.fbaipublicfiles.com/mms/tts/eng.tar.gz --output eng.tar.gz !mkdir -p data && tar -xzf eng.tar.gz -C data/ from ttsmms import TTS tts=TTS("data/eng") wav=tts.synthesis("It's a lovely day today and whatever you've got to do I'd be so happy to be doing it with you") from IPython.display import Audio Audio(wav["x"], rate=wav["sampling_rate"])
Python Code : Speech to Text
To convert speech to text using MMS model, follow the steps below. You can use the colab notebook below for testing the automatic speech recognition model (ASR) quickly.
hypo.word file missing
Fairseq is a toolkit for sequence modeling that lets us to train personalized models for translation, text summarization, language modeling etc.
import os !apt install ffmpeg !apt install sox !pip install -U --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 !git clone https://github.com/pytorch/fairseq os.chdir('fairseq') !pip install -e . os.environ["PYTHONPATH"] = "." !pip install git+https://github.com/abdeladim-s/easymms
In the code below, we are using MMS-FL102 model. It uses FLEURS dataset and supports 102 languages. It is less memory intensive and can run easily on free version of Google Colab.
# @title Download Model { display-mode: "form" } model = 'mms1b_fl102' #@param ["mms1b_fl102", "mms1b_l1107", "mms1b_all"] {allow-input: true} if model == "mms1b_fl102": !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt' elif model == "mms1b_l1107": !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt' elif model == "mms1b_all": !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt'
You can also use the mms1b_l1107 dataset which supports 1107 languages.
If you have access to powerful machine (or Colab paid version), you should use mms1b_all model which includes all the datasets MMS-lab + FLEURS + CV + VP + MLS for more accurate conversion of speech to text. It supports 1162 languages.
Next step is to download the audio file which you want to convert it to text. I have prepared a sample audio file and save it to my github repo. After downloading the audio file, we are storing it to a folder audio_samples
!wget -P ./audio_samples/ https://github.com/deepanshu88/Datasets/raw/master/Audio/audio_file_test.wav files = ['./audio_samples/audio_file_test.wav']
Make sure to update language in the following code. I am using eng as audio is in english language. Refer the table below to find the language code.
from easymms.models.asr import ASRModel asr = ASRModel(model=f'./models/{model}.pt') transcriptions = asr.transcribe(files, lang='eng', align=False) for i, transcription in enumerate(transcriptions): print(f">>> file {files[i]}") print(transcription)
Input: /content/fairseq/audio_samples/audio_file_test.wav
Output: It's so lovely day today and what ever you've got to do would be so happy to b doing it with you
import os !apt install ffmpeg !apt install sox !pip install -U --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 !git clone https://github.com/pytorch/fairseq os.chdir('fairseq') !pip install -e . os.environ["PYTHONPATH"] = "." !pip install git+https://github.com/abdeladim-s/easymms # @title Download Model { display-mode: "form" } model = 'mms1b_fl102' #@param ["mms1b_fl102", "mms1b_l1107", "mms1b_all"] {allow-input: true} if model == "mms1b_fl102": !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt' elif model == "mms1b_l1107": !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt' elif model == "mms1b_all": !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt' !wget -P ./audio_samples/ https://github.com/deepanshu88/Datasets/raw/master/Audio/audio_file_test.wav files = ['./audio_samples/audio_file_test.wav'] from easymms.models.asr import ASRModel asr = ASRModel(model=f'./models/{model}.pt') transcriptions = asr.transcribe(files, lang='eng', align=False) for i, transcription in enumerate(transcriptions): print(f">>> file {files[i]}") print(transcription)
In case you have audio file in MP3 format, it is important you convert it to WAV format before using the model. Make sure to set the sample rate to 16kHz
!pip install pydub !apt install ffmpeg from pydub import AudioSegment # convert mp3 to wav sound = AudioSegment.from_file('./audio_samples/MP3_audio_file_test.mp3', format="mp3") sound.export('./audio_samples/MP3_audio_file_test.wav', format="wav")
Share Share Tweet