The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over 4,000 languages (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages. Our goal is to make it easier for people to access information and to use devices in their preferred language.
You can find details in the paper Scaling Speech Technology to 1000+ languages and the blog post.
An overview of the languages covered by MMS can be found here.
ModelLinkMMS-300MdownloadMMS-1Bdownload
Example commands to finetune the pretrained models can be found here.
ModelLanguagesDatasetModelSupported languagesMMS-1B:FL102102FLEURSdownloaddownloadMMS-1B:L11071107MMS-labdownloaddownloadMMS-1B-all1162MMS-lab + FLEURS + CV + VP + MLSdownloaddownload
G_100000.pth
, config.json
, vocab.txt
. The G_100000.pth
is the generator trained for 100K updates, config.json
is the training config, vocab.txt
is the vocabulary for the TTS model.```
wget https://dl.fbaipublicfiles.com/mms/tts/eng.tar.gz # English (eng) wget https://dl.fbaipublicfiles.com/mms/tts/azj-script_latin.tar.gz # North Azerbaijani (azj-script_latin)
```
# LanguagesDatasetModelDictionarySupported languages126FLEURS + VL + MMS-lab-U + MMS-unlabdownloaddownloaddownload256FLEURS + VL + MMS-lab-U + MMS-unlabdownloaddownloaddownload512FLEURS + VL + MMS-lab-U + MMS-unlabdownloaddownloaddownload1024FLEURS + VL + MMS-lab-U + MMS-unlabdownloaddownloaddownload2048FLEURS + VL + MMS-lab-U + MMS-unlabdownloaddownloaddownload4017FLEURS + VL + MMS-lab-U + MMS-unlabdownloaddownloaddownload
Run this command to transcribe one or more audio files:
``` cd /path/to/fairseq-py/ python examples/mms/asr/infer/mms_infer.py --model "/path/to/asr/model" --lang lang_code --audio "/path/to/audio_1.wav" "/path/to/audio_1.wav"
```
For more advance configuration and calculate CER/WER, you could prepare manifest folder by creating a folder with this format:
``` $ ls /path/to/manifest dev.tsv dev.wrd dev.ltr dev.uid
$ cat dev.tsv / /path/to/audio_1 180000 /path/to/audio_2 200000
$ cat dev.ltr t h i s | i s | o n e | t h i s | i s | t w o |
$ cat dev.wrd this is one this is two
$ cat dev.uid audio_1 audio_2
```
Followed by command below:
```
lang_code=
PYTHONPATH=. PREFIX=INFER HYDRA_FULL_ERROR=1 python examples/speech_recognition/new/infer.py -m --config-dir examples/mms/config/ --config-name infer_common decoding.type=viterbi dataset.max_tokens=4000000 distributed_training.distributed_world_size=1 "common_eval.path='/path/to/asr/model'" task.data='/path/to/manifest' dataset.gen_subset="${lang_code}:dev" common_eval.post_process=letter
```
Available options:
To get the raw character-based output, user can change to common_eval.post_process=none
To maximize GPU efficiency or avoid out-of-memory (OOM), user can tune dataset.max_tokens=???
size
To run language model decoding, install flashlight python bindings using
``` git clone --recursive git@github.com:flashlight/flashlight.git cd flashlight; git checkout 035ead6efefb82b47c8c2e643603e87d38850076 cd bindings/python python3 setup.py install
```
Train a KenLM language model and prepare a lexicon file in this format.
```
LANG=
```
We typically sweep lmweight
in the range of 0 to 5 and wordscore
in the range of -3 to 3. The output directory will contain the reference and hypothesis outputs from decoder.
For decoding with character-based language models, use empty lexicon file ( decoding.lexicon=
), decoding.unitlm=True
and sweep over decoding.silweight
instead of wordscore
.
Note: clone and install VITS before running inference.
```
$ PYTHONPATH=$PYTHONPATH:/path/to/vits python examples/mms/tts/infer.py --model-dir /path/to/model/eng \ --wav ./example.wav --txt "Expanding the language coverage of speech technology \ has the potential to improve access to information for many more people"
$ PYTHONPATH=$PYTHONPATH:/path/to/vits python examples/mms/tts/infer.py --model-dir /path/to/model/mai \ --wav ./example.wav --txt "मुदा आइ धरि ई तकनीक सौ सं किछु बेसी भाषा तक सीमित छल जे सात हजार \ सं बेसी ज्ञात भाषाक एकटा अंश अछी"
```
example.wav
contains synthesized audio for the language.
Prepare two files in this format
```
/ /path/to/audio1.wav /path/to/audio2.wav /path/to/audio3.wav
eng 1 eng 1 eng 1
```
Download model and the corresponding dictionary file for the LID model. Use the following command to run inference -
```
$ PYTHONPATH='.' python3 examples/mms/lid/infer.py /path/to/dict/l126/ --path /path/to/models/mms1b_l126.pt \
--task audio_classification --infer-manifest /path/to/manifest.tsv --output-path
```
The above command assumes there is a file named dict.lang.txt
in /path/to/dict/l126/
. <OUTDIR>/predictions.txt
will contain the predictions from the model for the audio files in manifest.tsv
.
Visit Official Website
https://github.com/facebookresearch/fairseq/tree/main/examples/mms