Apple researchers achieve breakthroughs in multimodal AI as company ramps up investments

Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.

Apple researchers have developed new methods for training large language models on both text and images, enabling more powerful and flexible AI systems, in what could be a significant advance for artificial intelligence and for future Apple products.

The work, described in a research paper titled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training” that was quietly posted to arxiv.org this week, demonstrates how carefully combining different types of training data and model architectures can lead to state-of-the-art performance on a range of AI benchmarks.

“We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks,” the researchers explain. By training models on a diverse dataset spanning visual and linguistic information, the MM1 models were able to excel at tasks like image captioning, visual question answering, and natural language inference.

Scaling visual components is key

The researchers also found that the choice of image encoder and the resolution of input images had a major impact on model performance. “We show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance,” they said. This suggests that continued scaling and refinement of the visual components of these multimodal models will be key to unlocking further gains.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.

Request an invite

Surprisingly, the largest 30 billion parameter MM1 model exhibited strong in-context learning abilities, allowing it to perform multi-step reasoning over multiple input images using few-shot “chain-of-thought” prompting. This points to the potential for large multimodal models to tackle complex, open-ended problems that require grounded language understanding and generation.

Apple’s billion-dollar AI bet

The MM1 research comes as Apple has been ramping up its investments in artificial intelligence in an effort to catch up with rivals like Google, Microsoft, and Amazon who have raced ahead in integrating generative AI capabilities into their products. The company is on track to spend $1 billion per year on AI development, according to a recent Bloomberg report.

Sources say Apple is working on a large language model framework called “Ajax” as well as a chatbot known internally as “Apple GPT.” The goal is to integrate these technologies into Siri, Messages, Apple Music and other apps and services. For example, AI could be used to auto-generate personalized playlists, assist developers in writing code, or engage in open-ended conversation and task completion.

We view AI and machine learning as fundamental technologies, and they’re integral to virtually every product that we ship,” Apple CEO Tim Cook said during a recent earnings call. “I’m not going to get into details about what it is, because — as you know, we don’t — we really don’t do that. But you can bet that we’re investing, we’re investing quite a bit, we’re going to do it responsibly and it will — you will see product advancements over time that where the — those technologies are at the heart of them.”

The high stakes of the AI arms race

Apple has a history of being a fast follower rather than a first mover when it comes to major technology shifts. But with AI poised to transform every aspect of the digital landscape, the stakes are high for the iPhone maker to stay competitive. The MM1 research shows that Apple has the talent and resources to make cutting-edge advances. But it remains to be seen if the notoriously secretive company can move quickly enough to keep pace in the escalating AI arms race.

Many eyes will be on Apple’s Worldwide Developers Conference in June, where the company is expected to unveil new AI-powered features and developer tools. In the meantime, smaller AI advances like the Keyframer animation tool and performance enhancements coming out of Apple’s research labs show steady progress is being made behind the scenes.

As Cook recently hinted during a Q1 earnings call: “We’re excited to share details of our ongoing work in AI later this year.” That work, it is now clear, includes ambitious efforts to master multimodal intelligence at the largest scales. The age of pervasively helpful and human-like AI may arrive sooner than we think — and Apple intends to play a major part in shaping it.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Tags: AI, AI, ML and Deep Learning, Apple, Apple AI, Apple AI research, Artificial Intelligence, Automation, business, category-/Computers & Electronics, category-/Science/Computer Science, conversational AI, data infrastructure, enterprise analytics, few-shot learning, image captioning, large language models, LLMs, MLLMs, MM1 models, multimodal ai, nlp, Programming & Development, Security, Siri

Apple researchers achieve breakthroughs in multimodal AI as company ramps up investments

Scaling visual components is key

VB Event

Apple’s billion-dollar AI bet

The high stakes of the AI arms race

6 smart gifts for holiday travelers

What to know about David Sacks, Trumps pick for AI and crypto czar

Get indie books for 99 cents (or less) during the Indie Author Winter Wonderland event

How To Secure AI With MLSecOps

The best early Cyber Monday deals are live at Amazon — check out our top picks

You may have missed

Arm lawsuit against Qualcomm ends in mistrial and favorable ruling for Qualcomm

Perplexity’s Carbon integration will make it easier for enterprises to connect their data to AI search

OpenAI Upgrades Its Smartest AI Model With Improved Reasoning Skills

My favorite games of 2024 | The DeanBeat

I Used AI to Do All of My Holiday Shopping

Get to Know Us