E-Ink News Daily

Back to list

microsoft/VibeVoice

Microsoft released VibeVoice, an MIT-licensed Whisper-style speech-to-text model with built-in speaker diarization. The author tested the 4-bit MLX-converted version on a Mac, processing an hour of audio in under 9 minutes using 30GB+ RAM. The model successfully handles both WAV and MP3 files and outputs detailed JSON with speaker identification and timestamps.

Background

Microsoft has been developing speech recognition technologies, with Whisper-style models representing the current state-of-the-art in audio transcription. The MLX framework enables efficient machine learning on Apple Silicon hardware.

Source
Simon Willison
Published
Apr 28, 2026 at 07:46 AM
Score
6.0 / 10