Emerging technology fields need industrywide metrics to measure progress. When you’re a pun-loving chatbot startup called Pandorabots and you want to call for better metrics, you put on a flashy Bot Battle. The Bot Battle consisted of two virtual beings chatting 24 hours a day, seven days a week for two weeks (unlike humans, AIs never tire). Viewers were invited to vote on the better chatbot.
The first contestant, “Mark Zuckerb0rg,” is based on Facebook’s Blenderbot. He’s a terse figure who wears a “Make Facebook Great Again” hat and doesn’t shy away from intolerant opinions like “I don’t like feminists.” The Pandorabots chatbot Kuki is arguably more eloquent. But she’s a politician, often taking the conversation back to her comfort zone and delivering the same quips again and again. The winner? Kuki, with 79% of the votes and 40,000 views. But Pandorabots says the real aim of the Bot Battle is to spark an industrywide conversation about the need to agree on a chatbot evaluation framework.
“Holding everybody in the field accountable to a set of transparent rules that prevent people from announcing an unvetted breakthrough or that their AI is ‘basically alive’ will go a long way toward helping the public and other companies understand where we are in the journey of creating humanlike chatbots,” Pandorabots CEO Lauren Kunze told VentureBeat.
It has been a banner year for open domain conversational AI, a dialogue system that’s supposed to be able to talk about anything. Three multi-billion dollar organizations — Facebook, Google, and OpenAI — have made significant announcements around this technology in the past year.
In addition, Facebook and Google have introduced their own evaluation frameworks, with each beating the other using their own metric. While agreed-upon metrics for a variety of discrete NLP benchmarks exist — complete with a leaderboard and buy-in from major technology companies — Google and Facebook’s new competing metrics underscore the lack of agreed-upon measurements for open domain AI.
Google’s metric, “Sensibleness and Specificity Average,” asks human evaluators two questions for each chatbot response: “Does it make sense?” and “Is it specific?” Conveniently for Google, its own chatbot scores 79% on the “Sensibleness and Specific Average” score, while other chatbots do not clear 56%.
Facebook’s metric is called “ACUTE-Eval,” and it also asks two questions: “Who would you prefer to talk to for a long conversation?” and “Which speaker sounds more human?” Facebook found that 75% of human evaluators would rather have a long conversation with the Facebook chatbot than the Google chatbot and 67% described it as more human than the Google chatbot. However, Facebook didn’t have anybody actually use its chatbot — the company simply showed judges side-by-side transcripts of the chatbot versus other chatbots and asked them to pick the best one.
Pandorabots says it’s unfair for a company to crown itself the best open domain AI system based on a metric it made itself.
It’s also problematic that Facebook showed people transcripts of chatbot conversations rather than having people actually chat with BlenderBot, Juji CEO and chatbot entrepreneur Michelle Zhou told VentureBeat. She compared that to judging food based on how the chef described the food on the menu rather than actually trying the food yourself.
Neither Google nor Facebook responded to requests for comment on the critiques of their evaluation metrics.
Kunze and Zhou also spoke of a need to easily access their competitors’ chatbots via an API, citing safety concerns. Google hasn’t released its bot, and OpenAI has allowed very few to access its API.
And while Facebook open-sourced BlenderBot, which allowed Pandorabots to stand up a version of it against Kuki, cost prevented Pandorabots from accessing the most data-rich version of BlenderBot. Training deep learning models requires an astronomical amount of cloud compute power, and Pandorabots had to use the small version of Facebook’s BlenderBot because the large version would have cost $20,000. Google was able to train its chatbot on 2,048 TPU cores for 30 days.
While Pandorabots doesn’t open-source its underlying model, it does offer open API access, and it has a site where anybody can chat with Kuki. This has allowed Facebook and Google to compare their new bots to Kuki, but not the other way around.
“Without industrywide buy-in on an evaluation framework, proclamations about who has the best AI will remain hollow,” Kunze said.
The most iconic evaluation method is the Turing test, in which a human judge chats with a computer and tries to differentiate it from another human. But the Turing test is subjective and hard to replicate, which means it doesn’t hold up to the scientific method. In addition, experts have pointed out that very simple computer programs can deceptively pass the Turing test through clever verbal sleights-of-hand that exploit the human judge’s vanity.
More recent versions of the Turing Test are the Loebner Prize and Amazon’s Alexa Prize. For the Loebner Prize, humans must differentiate between chatting with another human and chatting with a chatbot. For the Alexa Prize, humans talk with chatbots for up to 20 minutes and then rate the interaction. But the Alexa Prize is only offered to university students, and the Loebner Prize, which is facing an uncertain future, didn’t even happen in 2020.
“But even asking the user to provide a score at the end of an interaction is not without problems, as you don’t know what expectations the user had or what exactly the user is judging,” Heriot-Watt University professor Verena Rieser said. Rieser is also cofounder of Alana AI, which has competed in the Amazon Alexa challenge. “For example, during the Amazon Alexa challenge, our system got a low score every time the system mentioned Trump,” Rieser said.
Kunze believes that the ideal metric would have humans actually speak to the chatbots and would ask judges to rate the conversations based on many different metrics, such as engagingness, consistency of personality, context awareness, and emotional intelligence or empathy. Instead of asking people to rate the chatbots directly, researchers could observe the conversations. Another way to measure engagingness is based on total chat time, as more back-and-forth messages could mean the human was more engaged.
Zhou said metrics should be human-centered because chatbots are meant to serve humans. She therefore advocates for metrics such as task effectiveness, level of demonstrated empathy, privacy intrusion, and trustworthiness.
Kunze, Zhou, and Rieser all agreed that current evaluation methods for conversational AI are antiquated and that coming up with appropriate evaluation metrics will take a lot of discussion.
So did the Bot Battle succeed in bringing the tech giants into the ring with Kuki? Kunze said that so far, one tech giant has agreed to talk, though she won’t say which one. Google and OpenAI ignored the invite, and Facebook also appears unwilling to formally engage.
“In our minds, Bot Battle will be a ‘win’ not if Kuki literally wins, but if tech giants and startups come together to create a new competition, open to anyone who wants to participate, with a set of mutually agreed-upon rules,” Kunze said. “Of course, we believe our AI is the best, but more importantly, we are asking for a fair fight.”