Nvidia Joins Meta and Google in the Speech AI Race

At the Speech AI Summit, Nvidia (NASDAQ: NVDA) unveiled its new voice artificial intelligence (AI) environment.

Apac CIOOutlook | Monday, November 07, 2022

Stay ahead of the industry with exclusive feature stories on the top companies, expert insights and the latest news delivered straight to your inbox. Subscribe today.

Nvidia announced its new speech artificial intelligence (AI) ecosystem, which it developed through a partnership with Mozilla Common Voice.

FREMONT, CA: At the Speech AI Summit, Nvidia (NASDAQ: NVDA) unveiled its new voice artificial intelligence (AI) environment. This ecosystem was created in collaboration with Mozilla Common Voice. The ecosystem focuses on creating open-source pre-trained models and crowdsourcing multilingual voice corpora. The goal of Nvidia and Mozilla Common Voice is to hasten the development of universal automatic voice recognition systems that can be used by speakers of all languages.

Nvidia discovered that less than one per cent of the spoken languages in the world are supported by popular voice assistants like Amazon Alexa and Google Home. The company wants to improve linguistic inclusion in speech AI and increase the accessibility of speech data for languages with less access to resources to address this issue.

Nvidia has entered a contest already being run by Meta (NASDAQ: META) and Google (NASDAQ: GOOGL): Both businesses have unveiled speech AI models to facilitate communication between speakers of various languages. Translation Hub, Google's speech-to-speech AI translation technology, can translate a sizable number of papers into numerous languages. The biggest language model coverage seen in a speech model, according to Google, is that it is developing a universal speech translator trained in more than 400 languages.

Meta AI's universal speech translator (UST) project contributes to the development of AI systems that allow for real-time translation from speech to speech in any language, including those that are spoken but not frequently written.

An Ecosystem for Global Language Users

Nvidia asserts that linguistic inclusion for voice AI has numerous data health advantages, including aiding AI models in comprehending speaker variation and noise characteristics. The new speech AI ecosystem aids in developing, upkeep, and improving speech AI models, datasets, and user interfaces for linguistic diversity, usability, and experience. Users can train their models on Mozilla Common Voice datasets, and those pre-trained models are subsequently made available as top-notch automatic speech recognition architectures. Then, to create their speech AI applications, other companies and people worldwide can modify and use those architectures.

Underserved dialects, sociolects, pidgins, and accents are a few crucial elements influencing speech diversity. Companies hope to develop a dataset ecosystem enabling communities to develop speech datasets and models for every language or situation. Currently, 100 languages are supported via the Mozilla Common Voice platform, which also has 24,000 hours of speech data from 500,000 contributors. The most recent edition of the Common Voice dataset also includes more speech data from female speakers and six new languages, including Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona, and Cantonese.

By recording sentences as brief voice clips and donating them as audio datasets through the Mozilla Common Voice platform, users may check the quality of their datasets before submitting them.

According to Siddharth Sharma, head of product marketing, AI, and deep learning at Nvidia, the voice AI ecosystem largely focuses not only on the diversity of languages but also on accents and noise profiles that different language speakers across the globe have. This has been the exclusive area of concentration for Nvidia, and they have developed a system that can be adjusted to fit every stage of the pipeline for voice AI models.