Who needs GPT-4o Voice Mode? Hume’s EVI 2 is here with emotionally inflected voice AI and API

For developers looking to build in voice AI features in addition to or beside human call centers, Hume’s EVI 2 may be a compelling option. …

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


When we last reported on Hume, the AI startup co-founded and led by former Google DeepMinder/computational scientist Alan Cowen, it was the spring of 2024 and the firm had just raised $50 million in a Series B private equity funding round for its unique approach to developing a voice AI assistant.

Hume, named for the 18th century Scottish philosopher David Hume, uses cross-cultural voice recordings of different speakers, matched with self-reported emotional survey results, to create its proprietary AI model that delivers lifelike vocal expressions and understanding across a range of languages and dialects.

Even back then, Hume was also one of the first AI model providers to offer an application programming interface (API) straight out of the gate, enabling third-party developers and businesses outside of it to connect apps or build new ones atop its model, or simply incorporate it into a feature such as answering customer service calls and retrieving the appropriate contextual answers from an org’s database.

Now, in the intervening six months, Hume has been busy building an updated version of that AI voice model and API. The new Empathic Voice Interface 2 (EVI 2) was announced last week, and introduces a range of enhanced features aimed at improving naturalness, emotional responsiveness, and customizability while significantly lowering the cost for developers and businesses. It’s also got 40% lower latency and is 30% cheaper than its predecessor through the API.

[embedded content]

“We want developers to build this into any application, create the brand voice they want, and adjust it for their users so the voice feels trusted and personalized,” Cowen told VentureBeat in a video call last week.

In fact, Cowen told VentureBeat that he was seeing and hoping to see even more businesses move beyond kicking people out of their apps and sending them to a separate EVI-equipped AI voice assistant to handle tech and customer support issues.

Instead, he mentioned that especially thanks to EVI 2’s design, it was now possible and in many cases a better user experience for the end user to be connected to a voice assistant powered by EVI 2 directly within an app, and that the EVI 2-powered voice assistant could now fetch information or take actions on behalf of the user without connecting them to any outside phone number — if hooked up to the underlying customer’s application in the correct way using Hume’s developer tools.

“Developers are starting to realize they don’t have to put the voice on a phone line; they can take it and put it anywhere within their app,” Cowen told VentureBeat.

For example, if I wanted to change my address information in an online account, I could simply use EVI 2 if it were integrated to ask it to change my address for me, instead of having it direct me through all the steps and screens.

A well-timed launch

The timing of the EVI 2 launch is particularly beneficial for Hume. Although not nearly as well-publicized as OpenAI or even potentially rival Anthropic — the latter of which is reportedly working on a retooled version of its investor Amazon’s Alexa voice assistant to launch — Hume is ready ahead of both Anthropic and OpenAI in launching a capable, cutting-edge humanlike voice assistant that business can tap into right now.

By contrast, the impressive OpenAI ChatGPT Voice Mode powered by its GPT-4o model shown off back in May is still available only to a limited number of users on a waitlist basis. In addition, Cowen believes EVI 2 is actually superior at detecting and responding to user’s emotions with emotionally-inflected utterances of its own.

“EVI 2 is fully end-to-end. It just takes in audio signals and outputs audio signals, which is more like how [OpenAI’s] GPT for voice does it,” he told VentureBeat. That is, EVI 2 and GPT-4o both convert the audio signal waveforms and data directly into tokens rather than first transcribing them as text and feeding them to language models. The first EVI model used the latter approach — yet was still impressively fast and responsive in VentureBeat’s independent demo usage.

For developers and businesses looking to build in voice AI features to stand apart, or to cut down on costs or keep them low by using voice AI in place of human call centers, Hume’s EVI 2 may be a compelling option.

EVI 2’s conversational AI advancements

Cowen and Hume claim EVI 2 allows for faster, more fluent conversations, sub-second response times, and a variety of voice customizations.

They say EVI 2 designed to anticipate and adapt to user preferences in real time, making it an ideal choice for a wide range of applications, from customer service bots to virtual assistants.

Key improvements in EVI 2 include an advanced voice generation system that enhances the naturalness and clarity of speech, along with emotional intelligence that helps the model understand a user’s tone and adapt its responses accordingly.

EVI 2 also supports features like voice modulation, allowing developers to fine-tune a voice along parameters like pitch, nasality, and gender, making it versatile and customizable without the risks associated with voice cloning.

Here at VentureBeta, we’ve also seen and reported on a number of proprietary and open source voice AI models. And around the web, people have posted examples of having two or more voice AI models engage in conversation, leading to strange, unsettling results like tortured screaming.

Cowen seemed amused when I asked him about these examples but not overly concerned about them occurring with Hume.

“These are definitely issues that these models have. You have to kind of massage these things out of the model with the right data, and we’re very good at that,” he told me. “Maybe very occasionally, people are trying to game it, but it’s rare.”

In addition Cowen said that Hume had no plans to offer “voice cloning,” that is, taking a speaker’s voice and replicating it from a few-seconds-long sample so that it could be used to speak any given text.

“We can clone voices with our model, of course, but we haven’t offered it because it’s such a high risk and the benefits are often unclear,” Cowen said. “What people really want is the ability to customize their voice. We’ve developed new voices where you can create different personalities, which seems to be even more exciting to developers than cloning specific voices.”

A whole new feature set

EVI 2 introduces several new features that set it apart from its predecessor:

Faster Response Times: EVI 2 boasts a 40% reduction in latency compared to EVI 1, with average response times now between 500 milliseconds and 800 milliseconds. This improvement enhances the fluidity of conversations, making them feel more natural and immediate.

Emotional Intelligence: By integrating both voice and language into a single model, EVI 2 can better understand the emotional context behind user inputs. This allows it to generate more appropriate and empathetic responses.

Customizable Voices: A new voice modulation method enables developers to adjust various voice parameters, such as gender and pitch, to create unique voices tailored to specific applications or users. This customization feature does not rely on voice cloning, offering a safer alternative for developers who want flexible yet secure voice options.

In-Conversation Prompts: EVI 2 allows users to modify the AI’s speaking style dynamically. For example, users can prompt it to speak faster or sound more excited during a conversation, enabling more engaging interactions.

Multilingual Capabilities: While EVI 2 currently supports English, Hume plans to roll out support for multiple languages, including Spanish, French, and German, by the end of 2024.

Moreover, Cowen told VentureBeat that thanks to its training, EVI 2 actually learned several languages on its own, without directly being asked to or guided by its human engineer creators.

“We didn’t specifically train the model to output certain languages, but it learned to speak French, Spanish, German, Polish, and more, just from the data,” Cowen explained.

Pricing and upgradability

One of the standout benefits of EVI 2 is its cost-effectiveness. Hume AI has reduced pricing for EVI 2 to $0.072 per minute, a 30% reduction compared to the legacy EVI 1 model, which was priced at $0.102 per minute.

Enterprise users also benefit from volume discounts, making the platform scalable for businesses with high-volume needs.

However, Open AI’s current text-to-speech offerings available through its Voice API — which are not the new GPT-4o/ChatGPT Voice Mode — appear to be significantly cheaper than Hume EVI 2 based on our calculations, with OpenAI TTS costing $0.015 per 1,000 characters (approximately $0.015 per minute of speech) versus $0.072 per minute for Hume’s EVI 2.

EVI 2 is currently available in beta and is open for integration via Hume’s API.

Developers can use the same tools and configuration options that were available for EVI 1, allowing for a smooth migration.

Additionally, developers who wish to continue using EVI 1 have until December 2024, when Hume plans to sunset the older API.

EVI 2 represents a major step forward in Hume AI’s mission to optimize AI for human well-being. The model is designed to enhance user satisfaction by aligning its responses with the user’s emotional cues and preferences. Over the coming months, Hume will continue to make improvements to the model, including expanding its language support and fine-tuning its ability to follow complex instructions.

According to Hume AI, EVI 2 is also being positioned to work seamlessly with other large language models (LLMs) and integrate with tools like web search, ensuring developers have access to a full suite of capabilities for their applications.

Expression measurement API and custom models API

In addition to EVI 2, Hume AI continues to offer its Expression Measurement API and Custom Models API, which provide additional layers of functionality for developers looking to build emotionally responsive AI applications.

Expression Measurement API: This API allows developers to measure speech prosody, facial expressions, vocal bursts, and emotional language. Pricing for this API starts at $0.0276 per minute for video with audio, with enterprise customers benefiting from volume discounts.

Custom Models API: For those needing to train and deploy custom AI models, Hume offers free model training, with inference costs matching those of the Expression Measurement API.

What’s next for Hume and EVI 2?

Hume AI plans to make further improvements to EVI 2 over the next few months, including enhanced support for additional languages, more natural voice outputs, and improved reliability.

The company says it wants to ensure developers have the tools they need to build applications that are both highly functional and empathetically responsive.