In a study published on the preprint server Arxiv.org, researchers at Microsoft, Peking University, and Nakai University say they’ve developed an approach — Taking Notes on the Fly (TNF) — that makes unsupervised language model pre-training more efficient by noting rare words to help models understand when (and where) they occur. They claim experimental results show TNF “significantly” bolsters pre-training of Google’s BERT while improving the model’s performance, resulting in a 60% decrease in training time.
One of the advantages of unsupervised pre-training is that it doesn’t require annotated data sets. Instead, models train on massive corpora from the web, which improves performance on various natural language tasks but tends to be computationally expensive. Training a BERT-based model on Wikipedia data requires more than five days using 16 Nvidia Tesla V100 graphics cards; even small models like ELECTRA take upwards of four days on a single card.
The researchers’ work aims to improve efficiency through better data utilization, taking advantage of the fact that a large proportion of words appear only very few times (in around 20% of sentences, according to the team) in training corpora. The embeddings of those words — i.e., the numerical representations from which the models learn — are usually poorly optimized, and the researchers argue these words could slow down the training process of other model parameters because they don’t carry adequate semantic information to make models understand what they mean.
TNF was inspired by how humans grasp information. Note-taking is a useful skill that can help recall tidbits that’d otherwise be lost; if people take notes after encountering a rare word that they don’t know, the next time the rare word appears, they can refer to the notes to better understand the sentence. Similarly, TNF maintains a note dictionary and saves a rare word’s context information when the rare word occurs. If the same rare word occurs again in training, TNF employs the note information to enhance the semantics of the current sentence.
The researchers say TNF introduces little computational overhead at pre-training since the note dictionary is updated on the fly. Moreover, they assert it’s only used to improve the training efficiency of the model and isn’t served as part of the model, such that when the pre-training is finished, the note dictionary is discarded.
To evaluate TNF’s efficacy, the coauthors concatenated a Wikipedia corpus and the open-source BookCorpus into a single 16GB data set, which they pre-processed, segmented, and normalized. They used it to pre-train several BERT-based models, which they then fine-tuned on the popular General Language Understand Evaluation (GLUE) benchmark.
The researchers report that TNF accelerates the BERT-based models throughout the entire pre-training process. The average GLUE scores were larger than
the baseline through most of the pre-training, with one model reaching BERT’s performance within two days while it took a TNF-free BERT model nearly six days. And the BERT-based models with TNF outperformed the baseline model on the majority of sub-tasks (eight tasks in total) by “considerable margins” on GLUE.
“TNF alleviates the heavy-tail word distribution problem by taking temporary notes for rare words during pre-training,” the coauthors wrote. “If trained with the same number of updates, TNF outperforms original BERT pre-training by a large margin in downstream tasks. Through this way, when rare words appear again, we can leverage the cross-sentence signals saved in their notes to enhance semantics to help pre-training.”