Researchers are achieving breakthroughs in running massive trillion-parameter AI models on consumer hardware by streaming expert weights from SSD rather than loading entire models into RAM. This technique enabled running a 1T parameter model on a MacBook Pro with 96GB RAM and even a 397B parameter model on an iPhone at 0.6 tokens/second. The approach shows significant potential for democratizing access to state-of-the-art models through clever optimization.
Background
Mixture-of-Experts (MoE) models use specialized sub-networks ('experts') activated per token, but traditionally require loading all experts into memory. Streaming experts loads only needed weights from storage during inference.
- Source
- Simon Willison
- Published
- Mar 24, 2026 at 01:09 PM
- Score
- 7.0 / 10