Apple researchers achieve breakthroughs in multimodal AI as company ramps up investments
Apple researchers achieve state-of-the-art results in multimodal AI with MM1 models, combining text and images for breakthroughs in image captioning, visual question answering, and few-shot learning, as the company invests heavily in AI to enhance Siri, Messages, and future products. …
Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.
Apple researchers have developed new methods for training large language models on both text and images, enabling more powerful and flexible AI systems, in what could be a significant advance for artificial intelligence and for future Apple products.
The work, described in a research paper titled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training” that was quietly posted to arxiv.org this week, demonstrates how carefully combining different types of training data and model architectures can lead to state-of-the-art performance on a range of AI benchmarks.
“We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks,” the researchers explain. By training models on a diverse dataset spanning visual and linguistic information, the MM1 models were able to excel at tasks like image captioning, visual question answering, and natural language inference.
Scaling visual components is key
The researchers also found that the choice of image encoder and the resolution of input images had a major impact on model performance. “We show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance,” they said. This suggests that continued scaling and refinement of the visual components of these multimodal models will be key to unlocking further gains.
Surprisingly, the largest 30 billion parameter MM1 model exhibited strong in-context learning abilities, allowing it to perform multi-step reasoning over multiple input images using few-shot “chain-of-thought” prompting. This points to the potential for large multimodal models to tackle complex, open-ended problems that require grounded language understanding and generation.
Apple’s billion-dollar AI bet
The MM1 research comes as Apple has been ramping up its investments in artificial intelligence in an effort to catch up with rivals like Google, Microsoft, and Amazon who have raced ahead in integrating generative AI capabilities into their products. The company is on track to spend $1 billion per year on AI development, according to a recent Bloomberg report.
Sources say Apple is working on a large language model framework called “Ajax” as well as a chatbot known internally as “Apple GPT.” The goal is to integrate these technologies into Siri, Messages, Apple Music and other apps and services. For example, AI could be used to auto-generate personalized playlists, assist developers in writing code, or engage in open-ended conversation and task completion.
We view AI and machine learning as fundamental technologies, and they’re integral to virtually every product that we ship,” Apple CEO Tim Cook said during a recent earnings call. “I’m not going to get into details about what it is, because — as you know, we don’t — we really don’t do that. But you can bet that we’re investing, we’re investing quite a bit, we’re going to do it responsibly and it will — you will see product advancements over time that where the — those technologies are at the heart of them.”
The high stakes of the AI arms race
Apple has a history of being a fast follower rather than a first mover when it comes to major technology shifts. But with AI poised to transform every aspect of the digital landscape, the stakes are high for the iPhone maker to stay competitive. The MM1 research shows that Apple has the talent and resources to make cutting-edge advances. But it remains to be seen if the notoriously secretive company can move quickly enough to keep pace in the escalating AI arms race.
Many eyes will be on Apple’s Worldwide Developers Conference in June, where the company is expected to unveil new AI-powered features and developer tools. In the meantime, smaller AI advances like the Keyframer animation tool and performance enhancements coming out of Apple’s research labs show steady progress is being made behind the scenes.
As Cook recently hinted during a Q1 earnings call: “We’re excited to share details of our ongoing work in AI later this year.” That work, it is now clear, includes ambitious efforts to master multimodal intelligence at the largest scales. The age of pervasively helpful and human-like AI may arrive sooner than we think — and Apple intends to play a major part in shaping it.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.