Streaming experts

Simon WillisonMar 24, 2026 at 01:09 PM7.0/10

Researchers are achieving breakthroughs in running massive trillion-parameter AI models on consumer hardware by streaming expert weights from SSD rather than loading entire models into RAM. This technique enabled running a 1T parameter model on a MacBook Pro with 96GB RAM and even a 397B parameter model on an iPhone at 0.6 tokens/second. The approach shows significant potential for democratizing access to state-of-the-art models through clever optimization.

Background

Mixture-of-Experts (MoE) models use specialized sub-networks ('experts') activated per token, but traditionally require loading all experts into memory. Streaming experts loads only needed weights from storage during inference.

Source: Simon Willison
Published: Mar 24, 2026 at 01:09 PM
Score: 7.0 / 10

Read Original →