Microsoft released VibeVoice, an MIT-licensed Whisper-style speech-to-text model with built-in speaker diarization. The author tested the 4-bit MLX-converted version on a Mac, processing an hour of audio in under 9 minutes using 30GB+ RAM. The model successfully handles both WAV and MP3 files and outputs detailed JSON with speaker identification and timestamps.
Background
Microsoft has been developing speech recognition technologies, with Whisper-style models representing the current state-of-the-art in audio transcription. The MLX framework enables efficient machine learning on Apple Silicon hardware.
- Source
- Simon Willison
- Published
- Apr 28, 2026 at 07:46 AM
- Score
- 6.0 / 10