Complete Guide to Massively Multilingual Speech (MMS) Model

Deepanshu Bhalla Add Comment ,

In this article we have covered everything about the latest multilingual speech model from the basics of how it works to the step-by-step implementation of the model in Python.

Meta, the company that owns Facebook, released a new AI model called Massively Multilingual Speech (MMS) that can convert text to speech and speech to text in over 1,100 languages. It is available for free. It will not only help academicians and researchers across the world but also language preservationists or activists to document and preserve endangered languages to prevent their extinction.

MMS is trained on a large dataset of text and audio in over 1,100 languages. Another best part about the model is that it generates audio which sounds very natural, like human speech. It is also able to identify more than 4,000 spoken languages.

Meta - Massively Multilingual Speech Model
Table of Contents

Possible Uses of MMS Model

Massively Multilingual Speech (MMS) can be used for a variety of purposes. Some of them are as follows :

  1. Creating Audiobooks

    MMS can be used to convert books and tutorials to audiobooks. It is useful for people who have difficulty in reading.

  2. Preparing Documentation

    In the workplace, preparing documentation is one of the key tasks of an analyst/coder. Sometimes, we have videos and want to convert them into structured documents so that people don't have to go through lengthy videos to understand a specific topic. We can convert the videos to audio and then use this model to convert the audio into text.

  3. Analyze audio

    Suppose you have a few speech audio files of a politician, and you want to analyze them. You may be interested in knowing the topics he mostly talks about and identifying his main focus area.

  4. Creating audio recordings of endangered languages

    MMS can be used to create audio recordings of endangered languages. It is important because endangered languages are at danger of being lost forever. By creating audio recordings of these languages, we can help to preserve them for future generations.

  5. Providing closed captioning for videos and other audio content

    We can use this model to provide closed captioning for videos and other audio content. It benefits people who are deaf or having learning disabilities.

How MMS Model works?

There was a challenge in collecting audio data for thousands of languages as existing speech datasets only covered a limited number of languages. To address this, Meta AI Team made use of religious texts like the Bible, which have been translated into many languages and extensively studied for language translation research. These translations had publicly available audio recordings of people reading the texts in different languages, approx. 32 hours of data per language on average.

Although the data mainly consists of male speakers and pertains to religious content, the researchers found that their models performed equally well for both male and female voices. Error rate for male and female speakers is almost same. They also discovered that the model was not excessively biased towards producing religious language, possibly due to their use of a Connectionist Temporal Classification approach.

Speech datasets already exist publicly covering 100 languages. They trained a model to align these existing datasets, meaning they made sure the audio matched up with the corresponding text.

They also developed a method called wav2vec 2.0, which helps train speech recognition models with less labeled data. Normally, 32 hours of data per language is not enough to train these models effectively. However, using their self-supervised learning approach, they trained models on a massive amount of speech data, around 500,000 hours in over 1,400 languages. This allowed them to create models that could recognize speech and identify languages in a multilingual setting.

Python Code : Text to Speech

In this section, we will cover how to convert text to speech with MMS model. You can use the colab notebook by clicking on the button below

Install library

You can install ttsmms library by using pip command.

!pip install ttsmms
Download TTS model

It is important to find out the language code (ISO Code) for the language which you want to translate text to speech. You can refer to the table at the end of this article to look up the ISO code for the specific language. In the code below, I am using hin ISO code for hindi language. Replace hin.tar.gz with eng.tar.gz if your text language is english.

!curl https://dl.fbaipublicfiles.com/mms/tts/hin.tar.gz --output hin.tar.gz
Extract

Unzip (extract) files from .tar.gz and move the unzipped files to data folder. Make sure you update language code.

!mkdir -p data && tar -xzf hin.tar.gz -C data/
Run MMS Model

We are running the MMS model in this step to convert text to speech. Don't forget to modify the language code at each step.

from ttsmms import TTS
tts=TTS("data/hin") 
wav=tts.synthesis("आप कैसे हैं?")
Play Audio

In this step we are asking Python to play audio which model generated.

# Display Audio 
from IPython.display import Audio
Audio(wav["x"], rate=wav["sampling_rate"])
Complete Code : Text to Speech
# Install library
!pip install ttsmms

# Download TTS model
!curl https://dl.fbaipublicfiles.com/mms/tts/hin.tar.gz --output hin.tar.gz

# Extract 
!mkdir -p data && tar -xzf hin.tar.gz -C data/

from ttsmms import TTS
tts=TTS("data/hin") 
wav=tts.synthesis("आप कैसे हैं?")

# Display Audio 
from IPython.display import Audio
Audio(wav["x"], rate=wav["sampling_rate"])
Download Audio File

You can refer the python program below to download audio file from Google Colab. Audio file will be saved in wav format and named as audio_file.wav.

# Download the audio file
from google.colab import files
from scipy.io import wavfile
import numpy as np

# Convert audio data to 16-bit signed integer format
audio_data = np.int16(wav["x"] * 32767)

# Save the audio data as a WAV file
wavfile.write('audio_file.wav', wav["sampling_rate"], audio_data)

# Download the audio file
files.download('audio_file.wav')

Similarly you can convert english text to speech. See the code below. The only change I have done from the previous code is updating language code and text for English.

!curl https://dl.fbaipublicfiles.com/mms/tts/eng.tar.gz --output eng.tar.gz
!mkdir -p data && tar -xzf eng.tar.gz -C data/
  
from ttsmms import TTS
tts=TTS("data/eng") 
wav=tts.synthesis("It's a lovely day today and whatever you've got to do I'd be so happy to be doing it with you")
from IPython.display import Audio
Audio(wav["x"], rate=wav["sampling_rate"])

Python Code : Speech to Text

To convert speech to text using MMS model, follow the steps below. The colab notebook is also available for testing the model quickly.

Clone Fairseq repository

Fairseq is a toolkit for sequence modeling that lets us to train personalized models for translation, text summarization, language modeling etc.

!git clone https://github.com/pytorch/fairseq
Install the required libraries
%cd "/content/fairseq"
!pip install --editable ./ 
!pip install tensorboardX
Download MMS Model

In the code below, we are using MMS-FL102 model. It uses FLEURS dataset and supports 102 languages. It is less memory intensive and can run easily on free version of Google Colab.

!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'

You can also use the below MMS-lab dataset which supports 1107 languages.

!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt'

If you have access to powerful machine (or Colab paid version), you should use MMS-1B-ALL model which includes all the datasets MMS-lab + FLEURS + CV + VP + MLS for more accurate conversion of speech to text. It supports 1162 languages.

!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt
Download the audio file

Next step is to download the audio file which you want to convert it to text. I have prepared a sample audio file and save it to my github repo. After downloading the audio file, we are storing it to a folder audio_samples

!wget -P ./audio_samples/ https://github.com/deepanshu88/Datasets/raw/master/Audio/audio_file_test.wav
Run the model

Make sure to update language in the following code. I am using eng as audio is in english language. Refer the table below to find the language code.

# Create temp folder
!mkdir /content/temp

# Run Speech to Text Model
import os
os.environ["TMPDIR"] = '/content/temp'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio_file_test.wav"
Input: /content/fairseq/audio_samples/audio_file_test.wav
Output: It's so lovely day today and what ever you've got to do would be so happy to b doing it with you
Complete Code : Speech to Text
# Clone fairseq repo
!git clone https://github.com/pytorch/fairseq

%cd "/content/fairseq"
!pip install --editable ./ 
!pip install tensorboardX

# Download MMS Model
!wget -P ./models_new 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'

# Download Audio File
!wget -P ./audio_samples/ https://github.com/deepanshu88/Datasets/raw/master/Audio/audio_file_test.wav

# Create temp folder
!mkdir /content/temp

# Run Speech to Text Model
import os
os.environ["TMPDIR"] = '/content/temp'
os.environ["PYTHONPATH"] = "."
os.environ["PREFIX"] = "INFER"
os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["USER"] = "micro"

!python examples/mms/asr/infer/mms_infer.py --model "/content/fairseq/models_new/mms1b_fl102.pt" --lang "eng" --audio "/content/fairseq/audio_samples/audio_file_test.wav"
Convert MP3 to WAV

In case you have audio file in MP3 format, it is important you convert it to WAV format before using the model. Make sure to set the sample rate to 16kHz

!pip install pydub
!apt install ffmpeg
from pydub import AudioSegment

# convert mp3 to wav                                                            
sound = AudioSegment.from_file('./audio_samples/MP3_audio_file_test.mp3', format="mp3")
sound.export('./audio_samples/MP3_audio_file_test.wav', format="wav")

Table : ISO Code and Language Name

Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 0 Response to "Complete Guide to Massively Multilingual Speech (MMS) Model"
Next → ← Prev