AI goes multilingual with Hugging Face’s BLOOM

Hugging Face’s Teven Le Scao says the BLOOM large language model “supports 46 human languages and 13 programming languages.” …

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!

With all the excitement and innovations surrounding artificial intelligence (AI) in recent years, one key thing has often been left behind – support for multiple languages, beyond just English.

That’s now going to change, thanks in part to the launch of BLOOM (which is an acronym for BigScience Large Open-science Open-access Multilingual Language Model). BLOOM got its start in 2021, with development led by machine learning startup Hugging Face, which raised $100 million in May. 

The BigScience effort also benefits from a wide array of contributors including Nvidia’s Megatron and the Microsoft DeepSpeed teams, as well as receiving support from CNRS, the French National Research Agency. The BLOOM model was built and trained using the Jean Zay supercomputer that is located in France. 

BLOOM has an architecture that is similar to OpenAI’s GPT-3 large language model, but with the key fundamental difference being that BLOOM is multilingual.

“GPT-3 is monolingual and BLOOM was designed from the start to be multilingual so it was trained on several languages, and also to incorporate a significant amount of programming language data,” Teven Le Scao, research engineer at Hugging Face, told VentureBeat. “BLOOM supports 46 human languages and 13 programming languages — so that’s a very sizable difference.”

How BLOOM was trained with open-source machine learning models

The BLOOM effort involved multiple components including collecting a large dataset and then building a training model.

Le Scao explained that Hugging Face made use of Nvidia’s Megatron and Microsoft’s DeepSpeed open-source projects, which are both efforts designed to enable data scientists to train large language models. Both Megatron and DeepSpeed are based on the open-source PyTorch machine learning framework. For BLOOM, the researchers developed a fork of the Megatron and DeepSpeed projects that enabled the model to look at all the different languages.

In terms of BLOOM itself, the project was developed in the open and makes use of its own open license that is modeled on the Responsible AI license.

“We’re trying to define what open source means in the context of large AI models, because they don’t really work like software does,” Le Scao said.

He explained that the goal of the licensing for BLOOM was to make the model as open as possible, while still retaining a degree of control on the use cases that organizations have for the model.

How large language models fit into natural language processing

Large language models (LLM) are a subset of the overall field of natural language processing (NLP).

Le Scao said that the language model is like an “atomic unit” for NLP, providing the building-block components on which complex AI interactions and applications can be built.

For example, he noted that it doesn’t make sense for an NLP model to learn how to do summarization as well as speak a language at the same time. Le Scao said that a human doesn’t learn how to speak English and then write a full research report at the same time. Typically it makes sense for the human to learn how to speak the language first.

Use cases for multilanguage models like BLOOM

To date, most AI language models have used either English or Chinese. BLOOM will now extend the use cases, notably for French, Spanish and Arabic speakers, where there has not been an open LLM available before.

In addition to providing a new foundation for multiple spoken human languages, BLOOM could enable a new era for code development as well.

The use of AI for code development is a relatively nascent space, with GitHub’s Copilot, which became generally available at the end of June, being among the early leaders. Le Scao expects that due to the diversity of programming languages that BLOOM understands, it will help to enable new applications for developers.

“BLOOM is going to be a strong platform for coding applications,” Le Scao said.

Now that BLOOM is ready for usage, Le Scao also expects that new and unexpected use cases will emerge.

“This is the fun part, because we’ve done all the hard work of getting BLOOM to run, and now everyone can run whatever crazy experiment they want from a powerful language model,” he said.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.

Live Updates for COVID-19 CASES