Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs
Spirit LM Expressive incorporates emotional cues into its speech generation and can detect and reflect anger, surprise, or joy. …
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs.
As such, it competes directly with OpenAI’s GPT-4o (also natively multimodal) and other multimodal models such as Hume’s EVI 2, as well as dedicated text-to-speech and speech-to-text offerings such as ElevenLabs.
Designed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM aims to address the limitations of existing AI voice experiences by offering a more expressive and natural-sounding speech generation, while learning tasks across modalities like automatic speech recognition (ASR), text-to-speech (TTS), and speech classification.
Unfortunately for entrepreneurs and business leaders, the model is only currently available for non-commercial usage under Meta’s FAIR Noncommercial Research License, which e grants users the right to use, reproduce, modify, and create derivative works of the Meta Spirit LM models, but only for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction.
A new approach to text and speech
Traditional AI models for voice rely on automatic speech recognition to process spoken input before synthesizing it with a language model, which is then converted into speech using text-to-speech techniques.
While effective, this process often sacrifices the expressive qualities inherent to human speech, such as tone and emotion. Meta Spirit LM introduces a more advanced solution by incorporating phonetic, pitch, and tone tokens to overcome these limitations.
Meta has released two versions of Spirit LM:
• Spirit LM Base: Uses phonetic tokens to process and generate speech.
• Spirit LM Expressive: Includes additional tokens for pitch and tone, allowing the model to capture more nuanced emotional states, such as excitement or sadness, and reflect those in the generated speech.
Both models are trained on a combination of text and speech datasets, allowing Spirit LM to perform cross-modal tasks like speech-to-text and text-to-speech, while maintaining the natural expressiveness of speech in its outputs.
Open-source noncommercial — only available for research
In line with Meta’s commitment to open science, the company has made Spirit LM fully open-source, providing researchers and developers with the model weights, code, and supporting documentation to build upon.
Meta hopes that the open nature of Spirit LM will encourage the AI research community to explore new methods for integrating speech and text in AI systems.
The release also includes a research paper detailing the model’s architecture and capabilities.
Mark Zuckerberg, Meta’s CEO, has been a strong advocate for open-source AI, stating in a recent open letter that AI has the potential to “increase human productivity, creativity, and quality of life” while accelerating advancements in areas like medical research and scientific discovery.
Applications and future potential
Meta Spirit LM is designed to learn new tasks across various modalities, such as:
• Automatic Speech Recognition (ASR): Converting spoken language into written text.
• Text-to-Speech (TTS): Generating spoken language from written text.
• Speech Classification: Identifying and categorizing speech based on its content or emotional tone.
The Spirit LM Expressive model goes a step further by incorporating emotional cues into its speech generation.
For instance, it can detect and reflect emotional states like anger, surprise, or joy in its output, making the interaction with AI more human-like and engaging.
This has significant implications for applications like virtual assistants, customer service bots, and other interactive AI systems where more nuanced and expressive communication is essential.
A broader effort
Meta Spirit LM is part of a broader set of research tools and models that Meta FAIR is releasing to the public. This includes an update to Meta’s Segment Anything Model 2.1 (SAM 2.1) for image and video segmentation, which has been used across disciplines like medical imaging and meteorology, and research on enhancing the efficiency of large language models.
Meta’s overarching goal is to achieve advanced machine intelligence (AMI), with an emphasis on developing AI systems that are both powerful and accessible.
The FAIR team has been sharing its research for more than a decade, aiming to advance AI in a way that benefits not just the tech community, but society as a whole. Spirit LM is a key component of this effort, supporting open science and reproducibility while pushing the boundaries of what AI can achieve in natural language processing.
What’s next for Spirit LM?
With the release of Meta Spirit LM, Meta is taking a significant step forward in the integration of speech and text in AI systems.
By offering a more natural and expressive approach to AI-generated speech, and making the model open-source, Meta is enabling the broader research community to explore new possibilities for multimodal AI applications.
Whether in ASR, TTS, or beyond, Spirit LM represents a promising advance in the field of machine learning, with the potential to power a new generation of more human-like AI interactions.