Whisper explained

Whisper is a popular open-source ASR system. Yet many users are unfamiliar with its inner workings, which can limit contributions and innovation. In this blog, we break down Whisper’s key components in simple terms to make it easier to understand and encourage contributions to open-source ASR projects.

For a tutorial on fine-tuning Whisper, check out our Diabolocom blog post: Fine-tuning an ASR: Focus on Whisper

Language models

While often associated with text, language models also encompass other modalities, such as speech. For instance, the Whisper model is an Automatic Speech Recognition (ASR) model designed to transcribe spoken audio into textual output. Modern language models generally fall into three broad categories:

Masked Language Models (MLMs):
These models learn to predict missing tokens within a text by leveraging bidirectional context. Such as the highly influential BERT model Devlin et al., 2019. Diabolocom trained its own masked language model called EuroBERT Boizard et al., 2025; for details, see the blog EuroBERT: A Refreshed Encoder to Tackle Business Problems.
Causal Language Models (CLMs):
Also known as autoregressive models, CLMs generate text one token at a time, conditioning each prediction on all preceding tokens. The GPT family from OpenAI is the most famous representative, optimized for fluent, left‑to‑right language generation.
Sequence‑to‑Sequence Models (Seq2Seq):
Combining an encoder and a decoder, Seq2Seq architectures transform an input sequence into a target sequence. They are particularly well‑suited for tasks like machine translation, text summarization, and speech‑to‑text transcription.

Whisper belongs to this last family, as a Seq2Seq model, it first encodes 30 seconds of raw audio into a latent context representation and then decodes this representation as an autoregressive model to predict from left to right, text tokens.

No description has been provided for this image

Figure 1: Different categories of language models: MLM, CLM, Seq2Seq

Seq2Seq Whisper model

In the context of Whisper and similar sequence-to-sequence (Seq2Seq) models, the architecture consists of an encoder (which processes the input audio once) and a decoder (which generates the output text token by token).

Encoder: The encoder processes the entire audio sequence (30 seconds in Whisper) in a single forward pass. Its computation cost is incurred only once per input.
Decoder: The decoder, however, is autoregressive: it is called n times, where n is the length of the output sequence (i.e., the number of text tokens to be generated). Each time a new token is predicted, the decoder runs again, conditioned on all previously generated tokens.

Whisper was trained on 30-second audio chunks along with their transcript, learning to predict both regular and special tokens from each audio segment. The decoding process begins with the special token <startoftranscript|> and continues until it generates the final token <endoftranscript|>. To better handle long-range dependencies, Whisper maintains previously transcribed 30-second chunks in its history. During training, these previous text tokens were included with a certain probability, which helps the model maintain context across longer audio files.

No description has been provided for this image

Figure 2: Illustration of the Whisper model from Radford et al, 2022

Whisper behavior depends on the presence or not of special tokens in the pre-training data. Pre-training data includes pairs of audio segments with their transcriptions, whose text is populated with special tokens such as the language of the text, and optional timestamps between words. paired audio segments and textual transcripts, enabling the model to acquire robust and generalized speech-to-text mappings. The inclusion of special tokens, such as language tags, silence indicators, and timestamps during this phase, conditions the model to handle a wide variety of linguistic and acoustic phenomena effectively. This extensive multilingual and multitask pretraining ensures the model’s adaptability and broad applicability across numerous downstream tasks.

Multitask training

Multitask learning is an approach in machine learning where a model simultaneously learns multiple related tasks. Whisper using multitask learning by jointly training on tasks such as language identification, speech transcription, translation, silence detection, and timestamp prediction. This shared learning paradigm enriches the model’s ability to generalize across tasks by exploiting complementarities between them.

The Whisper multitask capabilities arise from explicit training with data containing language and silence token, and also aligned timestamps between audio segments and corresponding text tokens.

Language prediction: Thanks to its extensive multilingual training data, the model can identify and predict language tags (<|fr|> or <|en|>) from among 99 different languages. These language tags then condition the rest of the transcription process, ensuring appropriate language-specific processing.
Silence prediction: generating the special token <nospeech|> when no speech is detected. However, in practice, Whisper sometimes hallucinates content during silent passages, so it’s generally recommended to use a dedicated Voice Activity Detection (VAD) model to pre-filter audio chunks before passing them to Whisper.
Transcribe or translate: The user can specify these tokens to change the behavior of Whisper:
- <|transcribe|> forces the model to transcribe the audio in its original language
- <|translate|> instructs the model to translate the content into English
Timestamp prediction in Whisper involves predicting the approximate start and end times for segments of transcribed speech. Whisper’s timestamp predictions are quantized to 20 milliseconds intervals, balancing precision with computational efficiency. Although these timestamps provide sufficient accuracy for general applications, scenarios requiring more precise alignment may necessitate specific diarization models, such as Connectionist Temporal Classification (CTC), which directly aligns audio frames with textual outputs at a finer temporal granularity. If the <|notimestamp|> token is not passed to Whisper decoding to skip timestamp prediction entirely. Otherwise, the model can approximate start and end times for each detected group of words, thanks to its multitask training data. These timestamps are quantized to the nearest 20 milliseconds for efficiency. For applications requiring more precise alignment between audio and text, more advanced approaches exist, such as Connectionist Temporal Classification (CTC), which directly aligns audio frames with text tokens for greater accuracy.

Distillation

Distillation is a model compression technique where a smaller “student” model is trained to replicate the behavior of a larger “teacher” model. This is done to reduce the computational requirements for inference, making the model more practical for deployment without sacrificing too much performance.

No description has been provided for this image

Figure 4: Illustration of the distilled Whisper model from Gandhi et al., 2023

Why distill the decoder layers in Whisper?

Because the decoder is executed repeatedly during inference (once per output token), it dominates the model’s computational cost—especially for long transcriptions. Therefore, reducing the number of decoder layers has a much larger impact on speed and efficiency compared to reducing encoder layers.

However, simply removing decoder layers risks harming performance, since both early and late layers contribute distinct knowledge:

The first decoder layers are critical for transforming the latent audio representation into linguistic features.
The final decoder layers are crucial for producing fluent, contextually accurate output as the model autoregressively generates tokens.

To retain the most valuable aspects of the decoder, a common distillation strategy is to keep the first and the last decoder layers (for example, the first 2 and the last 2 layers) Gandhi et al., 2023 while removing those in the middle. This preserves both the initial processing of audio features and the final generation of coherent output, maintaining much of the original model’s performance while greatly reducing computational load.

This strategy is widely adopted in the distillation of large Seq2Seq models like Whisper to achieve a favorable trade-off between accuracy and speed, enabling faster, more resource-efficient speech-to-text systems.

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.

[2] N. Boizard et al., “EuroBERT: Scaling Multilingual Encoders for European Languages,” Mar. 07, 2025, arXiv: arXiv:2503.05500. doi: 10.48550/arXiv.2503.05500.

[3] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” Dec. 06, 2022, arXiv: arXiv:2212.04356. doi: 10.48550/arXiv.2212.04356.

[3] S. Gandhi, P. von Platen, and A. M. Rush, “Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling,” Nov. 01, 2023, arXiv: arXiv:2311.00430. doi: 10.48550/arXiv.2311.00430.

Written by | 06/23/2025

Whisper explained

Whisper explained

Language models

Seq2Seq Whisper model

Multitask training

Distillation

References

Related articles

Test Space for Soraia

Everything you need to know about fine-tuning an ASR: a focus on Whisper

test Formulas Latex – Copy

Ebook FR TEST