E-Ink News Daily

Back to list

Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

A developer successfully ran the massive 397B parameter Qwen3.5 MoE model locally on a 48GB MacBook Pro M3 Max at 5.5+ tokens/second by applying techniques from Apple's 'LLM in a Flash' paper. The approach uses flash memory streaming for expert weights and a custom 2-bit quantization scheme, with the code generated via Claude-assisted autoresearch. This demonstrates practical methods for running models far larger than available RAM on consumer hardware.

Background

Running large language models locally on devices with limited memory is a significant challenge in AI deployment. Apple's 'LLM in a Flash' research addresses this by optimizing flash memory usage for model inference.

Source
Simon Willison
Published
Mar 19, 2026 at 07:56 AM
Score
7.0 / 10