.. _quality: ########################################################### Optimizing Transcription Quality with Whisper ########################################################### The transcription performance of Whisper models can be improved by a combination of audio preprocessing, model optimizations, and post-processing techniques. Published research indicates that certain methods can reduce the Word Error Rate (WER) while maintaining processing speed. .. contents:: :local: ======================================= Audio Preprocessing ======================================= ---------------------------------- Voice Activity Detection (VAD) ---------------------------------- Voice Activity Detection is a technique used for improving Whisper transcription. By identifying segments containing speech, VAD can help eliminate silent or noisy sections that may cause hallucinations [1]_ [2]_ [3]_ [4]_ [5]_ [6]_. **Reported Benefits:** * **45% reduction in transcription errors**, according to several studies [4]_ [1]_. * Elimination of spurious transcriptions in non-vocal segments [3]_ [7]_. * Significant accuracy improvement on telephone recordings [4]_. * Reduced computational load by avoiding the processing of unnecessary segments [8]_ [1]_. Recommended VAD models include Silero-VAD and WebRTC VAD, with Silero-VAD showing superior performance on complex data [2]_ [6]_. ---------------------------------- Neural Denoising ---------------------------------- Deep learning-based source separation techniques, such as **Demucs** from Meta, can be used for this purpose. This approach uses multi-layer convolutional neural networks to separate speech from background noise [9]_ [10]_. **Documented Results:** * Demucs, followed by a low-pass filter, **reduced the performance gap between genres by 25%** for Whisper [10]_. * Particularly notable improvements on recordings with significant ambient noise [9]_. * Superior performance compared to traditional signal-based denoising techniques [10]_. ------------------------------------------ Audio Normalization and Parameters ------------------------------------------ Whisper models are pre-trained with a 16 kHz sampling rate. **Optimizing audio parameters** can yield benefits [11]_ [12]_ [13]_: * **Recommended Format**: MP3 mono at 16 kbps and 12-16 kHz [14]_. * **Latency Reduction**: Up to 50% without loss of accuracy [14]_. * **Audio Level Normalization**: Improves transcription consistency [15]_. ================================ Model Fine-Tuning and Adaptation ================================ ----------------------------------------------- Fine-Tuning with LoRA (Low-Rank Adaptation) ----------------------------------------------- Fine-tuning using LoRA is a technique for enhancing Whisper's performance on specific domains [16]_ [17]_ [18]_ [19]_. **Reported Performance:** * **WER reduction from 68.49% to 26.26%** (a 61.7% improvement) on aeronautical data [16]_. * Uses only **0.8% of the model's parameters** for fine-tuning [17]_ [16]_. * **38.49% WER improvement** on Vietnamese with Whisper-Tiny [18]_. * Maintains generalization on data not seen during training [17]_. **Identified Hyperparameters:** * **Learning Rate**: 1e-3 for Large, 1e-5 for Turbo [16]_ [17]_. * **LoRA Alpha**: 256 for best performance [17]_. * **LoRA Rank**: 32 as a starting point [17]_. ---------------------------------- Transcription Normalization ---------------------------------- Text normalization schemes can improve evaluation metrics. OpenAI provides a specialized normalizer that [16]_ [20]_: * Standardizes case and removes punctuation [20]_. * Handles regional spelling variations [20]_. * Improves WER scores by an average of 1.78% [21]_. ===================================== Model Optimization and Acceleration ===================================== ---------------------------------- CTranslate2 and faster-whisper ---------------------------------- The CTranslate2 implementation (`faster-whisper`) is a common method for performance optimization [22]_ [23]_ [24]_. **Measured Improvements:** * **Speed**: Up to **4x faster** than the original implementation [23]_ [22]_. * **Memory**: VRAM usage reduced from 11.3 GB to 4.7 GB for Large-v2 [23]_. * **Quantization**: Further reduction to 3.1 GB with INT8 [23]_. * **Accuracy Maintained**: Performance is identical to the original [22]_. ---------------------------------- Quantization ---------------------------------- Quantization techniques enable deployment on resource-constrained hardware [25]_ [26]_. * **INT8 Quantization**: 19% latency reduction, 45% size reduction [25]_. * **Accuracy Maintained**: 98.4% accuracy with INT4 [25]_. * **Automatic Optimization**: CTranslate2 handles quantization transparently [22]_. ======================================== Segmentation and Decoding Strategies ======================================== ---------------------------------- Audio Segmentation ---------------------------------- The audio chunking strategy influences transcription quality [27]_ [28]_ [29]_. **Recommended Approaches:** * **VAD-based Segmentation**: Splitting at natural speech boundaries [30]_ [27]_. * **Overlap**: 10-20% overlap between segments [31]_ [32]_. * **Chunk Size**: 1-second segments with attention-guided stopping [28]_. ---------------------------------- Decoding Parameter Optimization ---------------------------------- Decoding parameters have a significant impact on quality [33]_ [34]_ [35]_ [36]_. **Identified Configuration:** * **Beam Size**: 5 provides a balance between quality and speed [34]_ [35]_ [33]_. * **Temperature**: 0.0 to maximize consistency [35]_ [34]_. * **Language Setting**: Explicitly specifying the language improves performance by up to 10x [37]_ [34]_. * **`condition_on_previous_text`**: `False` to prevent hallucinatory loops [38]_ [33]_. ======================== Hallucination Prevention ======================== ------------------------------------------- Detection and Prevention Techniques ------------------------------------------- Hallucinations can be a challenge, especially in non-vocal segments [39]_ [5]_ [6]_. **Proposed Solutions:** * **Calm-Whisper**: Selectively fine-tuning 3 attention heads reduces hallucinations by 80% [5]_. * **Bag of Hallucinations (BoH)**: Detects and suppresses recurring phrases [6]_. * **Adaptive Thresholds**: `compression_ratio_threshold` and `log_prob_threshold` [36]_. * **Post-processing**: Aho-Corasick algorithm for pattern detection [6]_. ---------------------------------- Anti-Hallucination Parameters ---------------------------------- Recommended configuration to minimize hallucinations [33]_ [36]_: * `no_speech_threshold`: Adjust according to desired sensitivity. * `compression_ratio_threshold`: 2.4 by default. * `log_prob_threshold`: -1.0 to filter uncertain transcriptions. =========================================== Evaluation and Benchmarking Methodologies =========================================== ---------------------------------- Standardized Metrics ---------------------------------- Rigorous evaluation requires proper normalization [21]_ [40]_. * **Normalized WER**: Use the OpenAI normalizer [20]_ [21]_. * **Realistic Datasets**: Prefer "in-the-wild" data over academic corpora [40]_. * **Multilingual Consistency**: Use language-specific normalization [40]_. ---------------------------------- Deployment Considerations ---------------------------------- Studies show that real-world performance can differ from academic benchmarks. The FLEURS dataset, for example, may overestimate performance compared to natural recordings [40]_. ================================ Summary of Methods ================================ An integrated strategy combines several complementary approaches: 1. **Preprocessing**: VAD + Demucs for denoising + audio normalization. 2. **Model**: `faster-whisper` with INT8 quantization + LoRA fine-tuning for specific domains. 3. **Decoding**: `beam_size=5`, `temperature=0`, language specification. 4. **Post-processing**: Text normalization + hallucination detection. This combined approach can reduce WER, as demonstrated in aeronautical and multilingual case studies. .. rubric:: References .. [1] https://www.osedea.com/insight/understanding-voice-activity-detection-how-vad-powers-real-time-voice-systems .. [2] https://github.com/openai/whisper/discussions/2378 .. [3] https://www.f22labs.com/blogs/what-is-vad-and-diarization-with-whisper-models-a-complete-guide/ .. [4] https://docs.phonexia.com/products/speech-platform-4/3.2.0/technologies/speech-to-text/enhanced-speech-to-text-built-on-whisper/comparison .. [5] https://arxiv.org/html/2505.12969v1 .. [6] https://arxiv.org/html/2501.11378v1 .. [7] https://aclanthology.org/2025.iwsds-1.26.pdf .. [8] https://arxiv.org/html/2506.01365v1 .. [9] https://github.com/openai/whisper/discussions/2125 .. [10] https://arxiv.org/html/2410.16712v1 .. [11] https://learnopencv.com/fine-tuning-whisper-on-custom-dataset/ .. [12] https://amgadhasan.substack.com/p/whisper-how-to-create-robust-asr-46b .. [13] https://github.com/openai/whisper/discussions/870 .. [14] https://dev.to/mxro/optimise-openai-whisper-api-audio-format-sampling-rate-and-quality-29fj .. [15] https://myscale.com/blog/mastering-audio-transcription-with-whisper-ai-step-by-step-guide/ .. [16] https://arxiv.org/html/2506.21990v1 .. [17] https://arxiv.org/pdf/2503.22692.pdf .. [18] https://trellisdata.com/research/Blog%20Post%20Title%20One-crc24-m7skl .. [19] https://github.com/Vaibhavs10/fast-whisper-finetuning .. [20] https://huggingface.co/learn/audio-course/chapter5/evaluation .. [21] https://mlcommons.org/2025/09/whisper-inferencev5-1/ .. [22] https://github.com/SYSTRAN/faster-whisper .. [23] https://nikolas.blog/making-openai-whisper-faster/ .. [24] https://ai.gopubby.com/whisper-gets-a-boost-introducing-fast-whisper-506f1901a8b2 .. [25] https://arxiv.org/html/2503.09905v1 .. [26] https://arxiv.org/pdf/2503.09905.pdf .. [27] https://github.com/openai/whisper/discussions/1977 .. [28] https://arxiv.org/pdf/2406.10052.pdf .. [29] https://community.groq.com/t/chunking-longer-audio-files-for-whisper-models-on-groq/162 .. [30] https://www.cerebrium.ai/articles/faster-whisper-transcription-how-to-maximize-performance-for-real-time-audio-to-text .. [31] https://weaviate.io/blog/chunking-strategies-for-rag .. [32] https://huggingface.co/openai/whisper-large-v2/discussions/67 .. [33] https://github.com/jhj0517/Whisper-WebUI/wiki/Whisper-Advanced-Parameters .. [34] https://arxiv.org/html/2503.23542v1 .. [35] https.