Zyphra’s new Zyda-2 dataset lets enterprises train small LLMs with high accuracy

Zyphra’s Zyda-2 dataset is five times larger than the original Zyda and covers a vast range of topics and domains. …

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Zyphra Technologies, the company working on a multimodal agent system combining advanced research in next-gen SSM hybrid architectures, long-term memory and reinforcement learning, just released Zyda-2, an open pretraining dataset comprising 5 trillion tokens. 

The offering comes as the successor of the original Zyda dataset. It is five times larger in size and covers a vast range of topics and domains to ensure a high level of diversity and quality – which is critical for training robust and competitive language models. 

But, that’s not the user profile of Zyda-2. There are many open datasets on Hugging Face for training cutting-edge AI models.

What makes this dataset unique is that it has been distilled to possess the strengths of the top existing datasets and eliminate their weaknesses.

This gives organizations a way to train language models that show high accuracy even when operating across edge and consumer devices on a given parameter budget.

The company trained its Zamba2 small language model using this dataset and found it to be performing much better than those trained with other state-of-the-art open-source language modeling datasets on HF.

What does Zyda-2 bring to the table?

Earlier this year, as part of the effort to build highly powerful small models that could automate a range of tasks cheaply, Zyphra went beyond model architecture research to start constructing a custom pretraining dataset by combining the best permissively licensed open datasets – often recognized as high-quality within the community.

The first release from this work, Zyda with 1.3 trillion tokens, debuted in June as a filtered and deduplicated mashup of existing premium open datasets, specifically RefinedWeb, Starcoder C4, Pile, Slimpajama, pe2so and arxiv. 

At the time, Zyda performed better than the datasets it was built upon, giving enterprises a strong open option for training. But, 1.3 trillion tokens was never going to be enough. The company needed to scale and push the benchmark of performance, which led it to set up a new data processing pipeline and develop Zyda-2.

At the core, Zyphra built on Zyda-1, further improving it with open-source tokens from DCLM, FineWeb-Edu and the Common-Crawl portion of Dolma v1.7. The original version of Zyda was created with the company’s own CPU-based processing pipeline, but for the latest version, they used Nvidia’s NeMo Curator, a GPU-accelerated data curation library. This helped them reduce the total cost of ownership by 2x and process the data 10x faster, going from three weeks to two days.

“We performed cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we performed model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality classifier, keeping only the ‘high-quality’ subset of these datasets,” Zpyphra wrote in a blog post.

The work created a perfect ensemble of datasets in the form of Zyda-2, leading to improved model performance. As Nvidia noted in a separate developer blog post, the new dataset combines the best elements of additional datasets used in the pipeline with many high-quality educational samples for logical reasoning and factual knowledge. Meanwhile, the Zyda-1 component provides more diversity and variety and excels at more linguistic and writing tasks. 

Distilled dataset leads to improved model performance

In an ablation study, training Zamba2-2.7B with Zyda-2 led to the highest aggregate evaluation score on leading benchmarks, including MMLU, Hellaswag, Piqa, Winogrande, Arc-Easy and Arc-Challenge. This shows model quality improves when training with the distilled dataset as compared to training with individual open datasets.

Zyda-2 performance
Zyda-2 performance

“While each component dataset has its own strengths and weaknesses, the combined Zyda-2 dataset can fill these gaps. The total training budget to obtain a given model quality is reduced compared to the naive combination of these datasets through the use of deduplication and aggressive filtering,” the Nvidia blog added.

Ultimately, the company hopes this work will pave the way for better quality small models, helping enterprises maximize quality and efficiency with specific memory and latency constraints, both for on-device and cloud deployments. 

Teams can already get started with the Zyda-2 dataset by downloading it directly from Hugging Face. It comes with an ODC-By license which enables users to train on or build off of Zyda-2 subject to the license agreements and terms of use of the original data sources.