Stability AI unveils new FreeWilly language models trained using minimal — and highly synthetic — data

FreeWilly1 and FreeWilly2 were trained with 600,000 data points — just 10% of the size of the original Orca dataset. …

Head over to our on-demand library to view sessions from VB Transform 2023. Register Here


There’s a new large language model (LLM) in town — two of them, in fact — and 90s kids will immediately recognize their names: FreeWilly1 and FreeWilly2.

Unveiled on Friday by Stability AI, the company behind the Stable Diffusion image generation AI and founded by former UK hedge funder Emad Mostaque, who has been accused of exaggerating his resume, the two new LLMs are both based off of versions of Meta’s LLaMA and LLaMA 2 open source models, but trained on an entirely new, smaller dataset, including synthetic data.

Both models excelled in intricate reasoning, linguistic subtleties, and answering complex questions related to specialized domains like law and mathematics.

Stability’s subsidiary CarperAI released the FreeWillys under a “non-commercial license” — meaning they cannot be used for moneymaking/enterprise/business purposes — and are instead aimed at advancing research and promoting open access in the AI community.

Event

VB Transform 2023 On-Demand

Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.

Register Now

Smaller whales, more environmentally friendly

The names of the models are a play on the “Orca” AI training methodology developed by researchers at Microsoft, which allows “smaller” models (those exposed to more limited data) to achieve the performance of large foundational models exposed to more massive datasets. (Not a reference to the IRL boat-sinking orcas.)

Specifically, FreeWilly1 and FreeWilly2 were trained with 600,000 data points — just 10% of the size of the original Orca dataset – using instructions from four datasets created by Enrico Shippole, meaning they were far less costly and far more environmentally friendly (using less energy and having a lower carbon footprint) than the original Orca model and most leading LLMs. The models still produced outstanding performance, comparable to and even exceeding ChatGPT on GPT-3.5 in some cases.

Training on synthetic data shows promise

One issue that has come up as more AI LLMs proliferate is what happens as more content is generated using them, and then future updates to these models, and future models, are trained on that AI-generated content/data?

An open-access paper described a process of “model collapse,” wherein LLMs trained on increasing amounts of AI-generated data performed more poorly than predecessors trained on human-generated data.

However, when training the FreeWillys, Stability AI generated used two other LLMs to generate 500,000 examples and 100,000 synthetic examples, respectively, and found that the FreeWillys still performed well, showing that synthetic data may be an answer to model collapse — and to avoiding the use of copyrighted or proprietary data.

Swimming into the future

Stability AI envisions these models setting new standards in the field of open access LLMs, empowering natural language understanding and enabling complex tasks.

“We are excited about the endless possibilities that these models will bring to the AI community and the new applications they will inspire,” said the Stability AI team. They express their gratitude to the researchers, engineers, and collaborators whose dedication has made this milestone possible.

Researchers and developers can access the weights for FreeWilly2 as-is, while FreeWilly1’s weights are released as deltas over the original model.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Live Updates for COVID-19 CASES