Yelp: Bunsen lets us test products at scale

In a conversation at Transform 2020, Yelp head of data science Justin Norman detailed Bunsen, the company’s internal product experimentation platform. …

Last Chance: Register for Transform, VB’s AI event of the year, hosted online July 15-17.


Yelp runs hundreds of experiments to ensure new features within its apps and website remain aligned with key business metrics. To launch, manage, and analyze the results of these experiments, the company’s employees use Bunsen, a proprietary platform developed just under two years ago.

During an interview on Wednesday at VentureBeat’s Transform summit, Justin Norman, head of data science at Yelp, explained that Bunsen was born out of necessity. Historically, Yelp engineers themselves were responsible for experimentation, which meant they had to write custom code to compare the performance of different versions of products. As the company’s portfolio grew over the years, this piecemeal approach became inefficient and expensive.

In 2018, Yelp launched an internal effort to unify the best of its engineers’ experimentation tools and processes into a single solution. This became Bunsen. “Yelp’s culture has always valued quickly learning through experience,” Norman said. “The widespread adoption of rigorous experimentation — especially through a statistical perspective — is in fact relatively recent.”

Today, nearly all data experimentation at Yelp — from products to AI and machine learning — occurs on the Bunsen platform, with over 700 experiments in total being run at any one time. Bunsen supports the deployment of experiments to large but segmented parts of Yelp’s customer population, and it enables the company’s data scientists to roll back these experiments if need be.

“One of the best things about what Bunsen allows us to do is to scale at speed,” Norman said. “The transition to it is the result of a couple of major investments in internally-developed data and product tools as well as statistical infrastructure.”

Bunsen consists of a frontend cheekily dubbed Beaker, which product managers, data scientists, and engineers use to interact with the toolset. A “scorecard” tool facilitates the analysis of experimental run results, while the Bunsen Experiment Analysis Tool — BEAT — packages up all of the underlying statistical models. There’s also a logging system that’s used to track user behavior and to serve as a source of features for AI and machine learning models. (“Features” in this context refer to measurable properties of phenomena being observed.)

Yelp’s use of AI and machine learning runs the gamut from advertising to restaurant, salon, and hotel recommendations. The app’s Collections feature leverages a combination of machine learning, algorithmic sorting, and manual curation to put local hotspots at users’ fingertips. (Deep learning-powered image analysis automatically identifies the color, texture, and shape of objects in user-submitted photos, allowing Yelp to predict attributes like “good for kids” and “ambiance is classy.”) Yelp optimizes photos on businesses’ listings to serve up the most relevant image for browsing potential customers. And advertisers can opt to have an AI system recommend photos and review content to use in banner ads based on their “impactfulness” with users.

“Bunsen can be used as a deployment and testing tool — it can determine whether products and models have any negative impact on the growth of business metrics or if they actually meet the goals we set out to accomplish,” Norman said. “And the Bunsen logs themselves are a goldmine for feature exploration and development. Not only do Yelp employees get the scale of being able to deploy a model into a cohort of individuals depending on how they want to reach them, but also, during the development process, they have the ability to utilize the logging system and the interface tools to build a unified set of features over and over again as they iterate the model.”

Two years ago, Yelp tapped Bunsen to develop Popular Dishes, a feature that highlights the name, photos, and reviews of most-ordered restaurant menu items. The AI models powering Popular Dishes were trained on over 100 million photos and reviews, and they draw on Yelp-submitted restaurant menus and other signals to make inferences about the top entrées.

Norman says bringing together the different data points that feed into Popular Dishes — i.e., names, photos, and reviews — was a “significant challenge.” They didn’t live in a single database, so multiple Yelp teams had to collaborate to build the feature sets and contribute to testing and development cycles.

“That’s what’s nice about Bunsen — it’s a distributed platform that’s meant to be utilized by a variety of different roles,” Norman said. “Product managers, engineers that are not in the machine learning and AI space, machine learning practitioners, data scientists, analysts, and even folks in PR or our external communications teams are consuming information that either comes from Bunsen or working directly with Beaker to gather the information.”

More recently, Bunsen was instrumental in the launch of new features intended to address challenges brought on by the pandemic. In March, Yelp teamed up with GoFundMe to promote local business fundraisers. Two months later, the company added an information category called virtual service offerings that allow businesses to showcase the fact they’re providing things like virtual consultations, classes, tours, and performances. And in June, Yelp added tools to help reopening businesses indicate whether they’re taking steps like enforcing distancing and sanitizing spaces, employing a combination of human moderation and machine learning to update sections with information businesses have posted elsewhere.

Most of these features were well-received — but not all. Yelp’s GoFundMe partnership drew criticism for its implementation, which enrolled an initial group of eligible businesses without notifying them with opt-out instructions. Donation campaigns were created without owners’ permission — and indeed, without their knowledge — in what Yelp spokespeople characterized as an ‘”oversight.”

Bunsen helped here, too, by providing a way for Yelp to quickly shut down the donation issues until they could be rectified. “The platform gives us the flexibility to determine if the functionality that we’re providing is perhaps not optimal or worst-case scenario harmful,” Norman said. “We have a rapid way of turning those experiences off and doing what we need to do to fix them on the backend.”

Bunsen allows Yelp’s C-Suite and engineers to look at demographic and geographic thresholds that would be undesirable to cross because they’d cause negative outcomes for businesses customers or users. According to Norman, in most circumstances, the platform almost immediately shows the positive and negative effects of experiments on business metrics and user experiences.

“Bunsen users can see the effects at the end of experiment runs and in real time as experiments are running through the cohorting engine and the features are being served. Messaging back from the tool and alerting allows us to know if we violated any thresholds,” Norman explained. “In this way, Bunsen is both a visualization solution and operations solution that are put together to give decision-makers both tactical and strategic approaches.”

Live Updates for COVID-19 CASES