You can now run the most powerful open source AI models locally on Mac M4 computers, thanks to Exo Labs

To further support adoption of local AI solutions, Exo Labs is preparing to launch a free benchmarking website next week. …

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


When it comes to generative AI, Apple’s efforts have seemed largely concentrated on mobile — namely Apple Intelligence running on iOS 18, the latest operating system for the iPhone.

But as it turns out, the new Apple M4 computer chip — available in the new Mac Mini and Macbook Pro models announced at the end of October 2024 — is excellent hardware for running the most powerful open source foundation large language models (LLMs) yet released, including Meta’s Llama-3.1 405B, Nvidia’s Nemotron 70B, and Qwen 2.5 Coder-32B.

In fact, Alex Cheema, co-founder of Exo Labs, a startup founded in March 2024 to (in his words) “democratize access to AI” through open source multi-device computing clusters, has already done it.

As he shared on the social network X recently, the Dubai-based Cheema connected four Mac Mini M4 devices (retail value of $599.00) plus a single Macbook Pro M4 Max (retail value of $1,599.00) with Exo’s open source software to run Alibaba’s software developer-optimized LLM Qwen 2.5 Coder-32B.

After all, with the total cost of Cheema’s cluster around $5,000 retail, it is still significantly cheaper than even a single coveted NVidia H100 GPU (retail of $25,000-$30,000).

The value of running AI on local compute clusters rather than the web

While many AI consumers are used to visiting websites such as OpenAI’s ChatGPT or mobile apps that connect to the web, there are incredible cost, privacy, security, and behavioral benefits to running AI models locally on devices the user or enterprise controls and owns — without a web connection.

Cheema said Exo Labs is still working on building out its enterprise grade software offerings, but he’s aware of several companies already using Exo software to run local compute clusters for AI inferences — and believes it will spread from individuals to enterprises in the coming years. For now, anyone with coding experience can get started by visiting Exo’s Github repository (repo) and downloading the software themselves.

“The way AI is done today involves training these very large models that require immense compute power,” Cheema explained to VentureBeat in a video call interview earlier today. “You have GPU clusters costing tens of billions of dollars, all connected in a single data center with high interconnects, running six-month-long training sessions. Training large AI models is highly centralized, limited to a few companies that can afford the scale of compute required. And even after the training, running these models effectively is another centralized process.”

By contrast, Exo hopes to allow “people to own their models and control what they’re doing. If models are only running on servers in massive data centers, you lose transparency and control over what’s happening.”

Indeed, as an example, he noted that he fed his own direct and private messages into a local LLM to be able to ask it questions about those conversations, without fear of them leaking onto the open web.

“Personally, I wanted to use AI on my own messages to do things like ask, ‘Do I have any urgent messages today?’ That’s not something I want to send to a service like GPT,” he noted.

Using M4’s speed and low power consumption to AI’s advantage

Exo’s recent success has been thanks to Apple’s M4 chip — available in regular, Pro and Max models offer what Apple calls “the world’s fastest GPU core” and best performance on single-threaded tasks (those operating on a single CPU core, whereas the M4 series has 10 or more).

Based on the fact that the M4 specs had been teased and leaked earlier, and a version already offered in the iPad, Cheema was confident that the M4 would work well for his purposes.

“I already knew, ‘we’re going to be able to run these models,’” Cheema told VentureBeat.

Indeed, according to figures shared on X, Exo Labs’s Mac Mini M4 cluster operates Qwen 2.5 Coder 32B at 18 tokens per second and Nemotron-70B at 8 tokens per second. (Tokens are the numerical representations of letter, word and numeral strings — the AI’s native language.)

Exo also saw success using earlier Mac hardware, connecting two Macbook Pro M3 computers to run the Llama 3.1-405B model at more than 5 tok/second.

This demonstration shows how AI training and inference workloads can be handled efficiently without relying on cloud infrastructure, making AI more accessible for privacy and cost-conscious consumers and enterprises alike. For enterprises working in highly regulated industries, or even those simply conscious of cost, who still want to leverage the most powerful AI models — Exo Labs’ demoes show a viable path forward.

For enterprises with high tolerance for experimentation, Exo is offering bespoke services including installing and shipping its software on Mac equipment. A full enterprise offering is expected in the next year.

The origins of Exo Labs: trying to speed up AI workloads without Nvidia GPUs

Cheema, a University of Oxford physics graduate who previously worked in distributed systems engineering for web3 and crypto companies, was motivated to launch Exo Labs in March 2024 after finding himself stymied by the slow progress of machine learning research on his own computer.

“Initially, it just started off as just a curiosity,” Cheema told VentureBeat. “I was doing some machine learning research and I wanted to speed up my research. It was taking a long time to run stuff on my old MacBook, so I was like, ‘okay, I have a few other devices laying around. Maybe old devices from a few friends here…is there any way I can use their devices?’ And instead of it taking a day to run this thing, ideally, it takes a few hours. So then, that kind of turned into this more general system that allows you to distribute any AI workload over multiple machines. Usually you would run basically something on just one device, but if you want to get the speed up, and deliver more tokens per second from your model, or you want to speed up your training run, then the only option you really have to do that is to go out to more devices.”

However, even once he gathered the requisite devices he had lying around and from friends, Cheema discovered another issue: bandwidth.

“The problem with that is now you have this communication between the devices which is really slow,” he explained to VentureBeat. “So there’s a lot of hard technical problems there that are very similar to the kind of distributed systems problems that I was working on in my past.”

As a result, he and his co-founder Mohamed “Mo” Baioumy, developed a new software tool, Exo, that distributes AI workloads across multiple devices for those lacking Nvidia GPUs, and ultimately open sourced it on Github in July through a GNU General Public License, which includes commercial or paid usage, as long as the user retains and makes available a copy of the source code.

Since then, Exo has seen its popularity climb steadily on Github, and the company has raised an undisclosed amount in funding from private investors.

Benchmarks to guide the new wave of local AI innovators

To further support adoption, Exo Labs is preparing to launch a free benchmarking website next week.

The site will provide detailed comparisons of hardware setups, including single-device and multi-device configurations, allowing users to identify the best solutions for running LLMs based on their needs and budget.

Cheema emphasized the importance of real-world benchmarks, pointing out that theoretical estimates often misrepresent actual capabilities.

“Our goal is to provide clarity and encourage innovation by showcasing tested setups that anyone can replicate,” he added.