Complete Guide to Massively Multilingual Speech (MMS) Model

In this article we have covered everything about the latest multilingual speech model from the basics of how it works to the step-by-step implementation of the model in Python.

Introduction of Massively Multilingual Speech (MMS)

Meta, the company that owns Facebook, released a new AI model called Massively Multilingual Speech (MMS) that can convert text to speech and speech to text in over 1,100 languages. It is available for free. It will not only help academicians and researchers across the world but also language preservationists or activists to document and preserve endangered languages to prevent their extinction.

Meta - Massively Multilingual Speech Model

Table of Contents

Uses of MMS Model

Massively Multilingual Speech (MMS) can be used for a variety of purposes. Some of them are as follows :

Creating Audiobooks
MMS can be used to convert books and tutorials to audiobooks. It is useful for people who have difficulty in reading.
Analyze audio
Suppose you have a few speech audio files of a politician and you want to analyze them. You may be interested in knowing the topics he mostly talks about and identifying his main focus area.
Creating audio recordings of endangered languages
MMS can be used to create audio recordings of endangered languages which are at danger of being lost forever.
MMS model can be used to create closed captioning for videos and other audio content.

How MMS Model works?

Meta AI Team made use of religious texts like the Bible which has been translated into many languages and extensively studied for language translation research.

Although the data mainly consists of male speakers and pertains to religious content, the researchers found that their models performed equally well for both male and female voices. Error rate for male and female speakers is almost same.

Speech datasets already exist publicly covering 100 languages. They trained a model to align these existing datasets. It means they made sure the audio matched up with the text.

They also developed a method called wav2vec 2.0 which helps train speech recognition models with less labeled data.

MMS model was trained on around 500,000 hours of speech data in over 1,100 languages.

Python Code : Text to Speech

In this section, we will cover how to convert text to speech with MMS model. You can use the colab notebook by clicking on the button below

Install library

You can install ttsmms library by using pip command.

!pip install ttsmms

Download TTS model

It is important to find out the language code (ISO Code) for the language which you want to translate text to speech. You can refer to the table at the end of this article to look up the ISO code for the specific language. In the code below, I am using hin ISO code for hindi language. Replace hin.tar.gz with eng.tar.gz if your text language is english.

!curl https://dl.fbaipublicfiles.com/mms/tts/hin.tar.gz --output hin.tar.gz

Extract

Unzip (extract) files from .tar.gz and move the unzipped files to data folder. Make sure you update language code.

!mkdir -p data && tar -xzf hin.tar.gz -C data/

Run MMS Model

We are running the MMS model in this step to convert text to speech. Don't forget to modify the language code at each step.

from ttsmms import TTS
tts=TTS("data/hin") 
wav=tts.synthesis("आप कैसे हैं?")

Play Audio

In this step we are asking Python to play audio which model generated.

# Display Audio 
from IPython.display import Audio
Audio(wav["x"], rate=wav["sampling_rate"])

Complete Code : Text to Speech

# Install library
!pip install ttsmms

# Download TTS model
!curl https://dl.fbaipublicfiles.com/mms/tts/hin.tar.gz --output hin.tar.gz

# Extract 
!mkdir -p data && tar -xzf hin.tar.gz -C data/

from ttsmms import TTS
tts=TTS("data/hin") 
wav=tts.synthesis("आप कैसे हैं?")

# Display Audio 
from IPython.display import Audio
Audio(wav["x"], rate=wav["sampling_rate"])

Download Audio File

You can refer the python program below to download audio file from Google Colab. Audio file will be saved in wav format and named as audio_file.wav.

# Download the audio file
from google.colab import files
from scipy.io import wavfile
import numpy as np

# Convert audio data to 16-bit signed integer format
audio_data = np.int16(wav["x"] * 32767)

# Save the audio data as a WAV file
wavfile.write('audio_file.wav', wav["sampling_rate"], audio_data)

# Download the audio file
files.download('audio_file.wav')

Similarly you can convert english text to speech. See the code below. The only change I have done from the previous code is updating language code and text for English.

!curl https://dl.fbaipublicfiles.com/mms/tts/eng.tar.gz --output eng.tar.gz
!mkdir -p data && tar -xzf eng.tar.gz -C data/
  
from ttsmms import TTS
tts=TTS("data/eng") 
wav=tts.synthesis("It's a lovely day today and whatever you've got to do I'd be so happy to be doing it with you")
from IPython.display import Audio
Audio(wav["x"], rate=wav["sampling_rate"])

Python Code : Speech to Text

To convert speech to text using MMS model, follow the steps below. You can use the colab notebook below for testing the automatic speech recognition model (ASR) quickly.

Update : 12th June, 2024 : The following code has been updated to fix this error - hypo.word file missing

Fairseq

Fairseq is a toolkit for sequence modeling that lets us to train personalized models for translation, text summarization, language modeling etc.

Install the required libraries

import os

!apt install ffmpeg
!apt install sox
!pip install -U --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
!git clone https://github.com/pytorch/fairseq
os.chdir('fairseq')
!pip install -e .
os.environ["PYTHONPATH"] = "."
!pip install git+https://github.com/abdeladim-s/easymms

Download MMS Model

In the code below, we are using MMS-FL102 model. It uses FLEURS dataset and supports 102 languages. It is less memory intensive and can run easily on free version of Google Colab.

# @title Download Model { display-mode: "form" }
model = 'mms1b_fl102' #@param ["mms1b_fl102", "mms1b_l1107", "mms1b_all"] {allow-input: true}

if model == "mms1b_fl102":
  !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'

elif model == "mms1b_l1107":
  !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt'

elif model == "mms1b_all":
  !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt'

You can also use the mms1b_l1107 dataset which supports 1107 languages.

If you have access to powerful machine (or Colab paid version), you should use mms1b_all model which includes all the datasets MMS-lab + FLEURS + CV + VP + MLS for more accurate conversion of speech to text. It supports 1162 languages.

Download the audio file

Next step is to download the audio file which you want to convert it to text. I have prepared a sample audio file and save it to my github repo. After downloading the audio file, we are storing it to a folder audio_samples

!wget -P ./audio_samples/ https://github.com/deepanshu88/Datasets/raw/master/Audio/audio_file_test.wav
files = ['./audio_samples/audio_file_test.wav']

Run the model

Make sure to update language in the following code. I am using eng as audio is in english language. Refer the table below to find the language code.

from easymms.models.asr import ASRModel
asr = ASRModel(model=f'./models/{model}.pt')
transcriptions = asr.transcribe(files, lang='eng', align=False)
for i, transcription in enumerate(transcriptions):
    print(f">>> file {files[i]}")
    print(transcription)

Input: /content/fairseq/audio_samples/audio_file_test.wav
Output: It's so lovely day today and what ever you've got to do would be so happy to b doing it with you

Complete Code : Automatic Speech Recognition (ASR)


import os

!apt install ffmpeg
!apt install sox
!pip install -U --pre torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
!git clone https://github.com/pytorch/fairseq
os.chdir('fairseq')
!pip install -e .
os.environ["PYTHONPATH"] = "."
!pip install git+https://github.com/abdeladim-s/easymms


# @title Download Model { display-mode: "form" }
model = 'mms1b_fl102' #@param ["mms1b_fl102", "mms1b_l1107", "mms1b_all"] {allow-input: true}

if model == "mms1b_fl102":
  !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt'

elif model == "mms1b_l1107":
  !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_l1107.pt'

elif model == "mms1b_all":
  !wget -P ./models 'https://dl.fbaipublicfiles.com/mms/asr/mms1b_all.pt'


!wget -P ./audio_samples/ https://github.com/deepanshu88/Datasets/raw/master/Audio/audio_file_test.wav
files = ['./audio_samples/audio_file_test.wav']

from easymms.models.asr import ASRModel
asr = ASRModel(model=f'./models/{model}.pt')
transcriptions = asr.transcribe(files, lang='eng', align=False)
for i, transcription in enumerate(transcriptions):
    print(f">>> file {files[i]}")
    print(transcription)

Convert MP3 to WAV

In case you have audio file in MP3 format, it is important you convert it to WAV format before using the model. Make sure to set the sample rate to 16kHz

!pip install pydub
!apt install ffmpeg
from pydub import AudioSegment

# convert mp3 to wav                                                            
sound = AudioSegment.from_file('./audio_samples/MP3_audio_file_test.mp3', format="mp3")
sound.export('./audio_samples/MP3_audio_file_test.wav', format="wav")

Table : ISO Code and Language Name

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn