microsoft/VibeVoice

Simon WillisonApr 28, 2026 at 07:46 AM6.0/10

Microsoft released VibeVoice, an MIT-licensed Whisper-style speech-to-text model with built-in speaker diarization. The author tested the 4-bit MLX-converted version on a Mac, processing an hour of audio in under 9 minutes using 30GB+ RAM. The model successfully handles both WAV and MP3 files and outputs detailed JSON with speaker identification and timestamps.

Background

Microsoft has been developing speech recognition technologies, with Whisper-style models representing the current state-of-the-art in audio transcription. The MLX framework enables efficient machine learning on Apple Silicon hardware.

Source: Simon Willison
Published: Apr 28, 2026 at 07:46 AM
Score: 6.0 / 10

Read Original →