E-Ink News Daily

Back to list

Streaming experts

Researchers are achieving breakthroughs in running massive trillion-parameter AI models on consumer hardware by streaming expert weights from SSD rather than loading entire models into RAM. This technique enabled running a 1T parameter model on a MacBook Pro with 96GB RAM and even a 397B parameter model on an iPhone at 0.6 tokens/second. The approach shows significant potential for democratizing access to state-of-the-art models through clever optimization.

Background

Mixture-of-Experts (MoE) models use specialized sub-networks ('experts') activated per token, but traditionally require loading all experts into memory. Streaming experts loads only needed weights from storage during inference.

Source
Simon Willison
Published
Mar 24, 2026 at 01:09 PM
Score
7.0 / 10