Cactus has open-sourced Needle, a highly efficient 26M parameter model specialized in function calling that achieves remarkable speeds of 6000 tok/s prefill and 1200 tok/s decode on consumer devices. The model uses a novel architecture with only attention and gating mechanisms, no MLPs, and was trained on 200B tokens plus 2B tokens of synthesized function-calling data. It outperforms larger models like FunctionGemma-270M in single-shot function calling tasks while being optimized for resource-constrained devices like phones and watches.
Background
Function calling (tool use) is a key capability for AI agents, but most existing models are too large for consumer devices. There's growing interest in creating smaller, more efficient models that can run locally on phones and other edge devices.
- Source
- Hacker News (RSS)
- Published
- May 13, 2026 at 02:03 AM
- Score
- 7.0 / 10