A new artifact for AI subtitles? Meta launches a new speech recognition model that understands thousands of languages and makes fewer mistakes
Since entering the era of AI, many people feel that their work has been challenged. There are creative jobs such as painters and screenwriters, as well as mental tasks such as translation. From ChatGPT's human-like text translation to Whisper's real-time voice transcription, AI is step by step in the field of translation. And Meta AI, today took another step forward.
AI brings new changes to "subtitles"
Although it is too early to replace translation at work, but in the folk subtitle group of "Generating Power for Love", if it is possible to "liberate the labor force", of course the sooner the better, so the development of AI and the help of open source make " Everyone is a subtitle group, everyone can translate English to Chinese" is no longer a fantasy.
Extracting sources, extracting audio tracks, Whisper text transcription + automatic timeline generation, calling DeepL/ChatGPT API for translation, video compression, now there are mature tools to use every step of the way, and there are even tutorials on station B to tell you How to do it.
If you visit station B more often, you may find that videos of "one-person subtitle group" and "Whisper machine translation" are not uncommon these days. Enthusiastic UP masters will also do a fine-tuning of key content on the basis of machine translation. Therefore, the one-hour foreign language program that originally took a long time to "barbecue" (make a video with subtitles) has become "instant rice" that can be started by one person and one computer. It is edible, not delicious, but also full.

This is also what many viewers at station B often say besides "Thank you UP master"-how about a machine flip, or a "quality" machine flip like AI, what kind of bicycle do you want?
A first look at Meta's new speech recognition model
The biggest highlight of Meta AI's new open source large-scale multilingual speech model MMS (Massively Multilingual Speech) is that it supports speech-to-text, text-to-speech functions in more than 1,100 languages, and a lower error rate. According to the official introduction, in the comparison between MMS and the well-known OpenAI Whisper, MMS covered 11 times the number of languages of the opponent with less than half the word error rate of Whisper, which is gratifying.
In the actual language recognition scenario, MMS that supports more than 4,000 languages can still perform well.
You should have discovered that one of the major features of MMS is that it supports "many" languages. It increases the "volume" of languages supported by AI by a whole digit. In fact, this is also related to the original intention of its research and development. We all know that the development of technology in the past has made the language we rely on more advantageous, so English has become the mainstream. Back then, our technology was still under development, and even the long-established Chinese language suffered a lot.
Lin Yutang and his daughter Lin Taiyi are using the Mingkuai typewriter he invented. The internal structure of this typewriter is a genius, but it was not mass-produced in the end.
The emergence of MMS is to alleviate this situation. We need to know that there are more than 7,000 known languages on the planet, and nearly half of them are in danger of disappearing in the foreseeable future. Although it is a capital-led technology company, Meta still hopes that people can obtain information and use technology in their favorite language, so as to maintain the vitality of various languages, which is also a manifestation of "technology for good".
Speaking of the birth of MMS, it is also quite humanistic. In order to cover as many languages as possible, Meta AI's team adopted religious texts (such as the Bible), because these texts have been translated into many different languages, and there are even different languages. text recording. As part of the project, Meta AI created a New Testament reading dataset in more than 1,100 languages, providing an average of 32 hours of data per language. This large-scale multilingual speech model covering thousands of languages was made possible by a similar approach, coupled with the application of a more efficient wav2vec 0.500 self-supervised model.
What MMS will bring to "subtitling"
The birth of MMS has brought people a new choice of "voice transcription" other than OpenAI Whisper. Although there are domestic voice transcription products such as Xunfei Hearing and Netease Jianwai on the market, the attributes of the folk subtitle group are doomed. "Free" and "proficient in foreign languages" are the biggest demands.
Therefore, with the help of the open source community, many Whisper-based simple transcription tools were born, such as Buzz—this is a small tool that supports Whisper voice transcription locally on Mac/Win/Linux. Some netizens tried it. After configuration, Buzz runs 6 hours of audio on the 4090 graphics card, even with the large-volume Whisper model, it takes less than an hour, and the productivity is objective.
The comparison of MMS and Whisper model capabilities mentioned above gives us more reason to believe that in the fertile soil of the open source community, the MMS model will grow as rapidly as Whisper, or even stronger.
AI and "subtitle group"
In the past two days, there was a Po article criticizing the phenomenon that the human-translated fine-tuned subtitle group was complained by the audience for "slow production", while the AI subtitle author was appreciated and loved by users , which aroused discussion in the small circle.
But in the final analysis, this is actually a question of "responsibility". The author has also experienced the "subtitle group era" when the original version was not yet popular. It was an era when subtitle groups "let a hundred flowers bloom". The "fast food group" (the machine translation at the time was a real machine translation without AI), and the "high-quality group" with excellent special effects that only came out after a week or two. The audience spread word of mouth, and they all have a score in their hearts. Good translations are still collected and disseminated, and fast food subtitles have been forgotten in the corners of the Internet.
The old Popsub axis software
The author also used to play the axis late at night in the university dormitory, just to meet the domestic audience with his favorite works. At that time, I was tired of the pain of Popsub axis. If there was an automatic axis axis like Buzz, it would more or less reduce my workload. Back then, I didn’t know a foreign language, if I had a “golden partner” like Whisper + ChatGPT, it would be much easier to chase stars.
For humans, AI is more of a tool and assistant, and simply relying on tools does not mean "good use". From sentence-by-sentence manual translation to AI translation and re-proofreading, from sentence-by-sentence punching to AI automatic recognition, the upgrade of tools has made the "work" of the subtitle group easier and easier, but it has also tested the proofreading of later translators. Translation level, and technical ability of team members.
In the era of AI, it is not difficult to use AI, but it will be the biggest challenge for the subtitle group to use AI to create high-quality goods.
And this is not only applicable to the small circle of "subtitle groups", it is also applicable to other broader fields...