WhisperX provides fast automatic speech recognition (ASR) with word-level timestamps and speaker diarization. It enables batched inference for up to 70x realtime transcription using the Whisper large-v2 model, with support for faster-whisper backend requiring less than 8GB GPU memory. Accurate word-level timestamps are achieved using wav2vec2 alignment, and multispeaker ASR is supported via speaker diarization from pyannote audio. The tool includes VAD preprocessing to reduce hallucination and improve batching without degrading word error rate (WER). Whisper is an ASR model developed by OpenAI, trained on a large and diverse audio dataset. While Whisper produces highly accurate transcriptions, its native timestamps are at the utterance level and may be inaccurate by several seconds. WhisperX addresses these limitations by providing word-level alignment and batching support. Additional features include phoneme-based ASR using models like wav2vec2.0, forced alignment for phone-level segmentation, voice activity detection (VAD), and speaker diarization to segment audio by speaker identity. |
hg clone https://toolshed.g2.bx.psu.edu/repos/bgruening/whisperx
Name | Description | Version | Minimum Galaxy Version |
---|---|---|---|
Transcribe audio or video files to text using the OpenAI Whisper and speaker diarization (WhisperX) | 3.4.2+galaxy1 | 25.0 |