Model ArchitectureInference & Performance

Whisper's Broken Record: Why Silence Makes Speech-to-Text Talk to Itself

You feed Whisper an audio clip. The first few seconds go fine, then it hits a long pause. That’s when things unravel—the sentence Whisper just output appears again, then a third time, then a fourth, spinning in place like a scratched CD. If you’re unlucky, you’ll also get “Thanks for watching and Electric Unicorn” or “please leave a like and subscribe to the channel”—phrases that exist in none of the audio you gave it.

I’ve used Whisper extensively over the past few years and have seen this behavior across every variant, from tiny to large-v3, from English to Chinese. The question is whether this is a Whisper-specific flaw or a universal ASR problem. Newer models like Qwen-ASR and GPT-4o’s real-time voice don’t seem to have it—GPT-4o will occasionally confabulate when given truly silent audio, but at least it won’t loop “nice weather today” forty times in a row.

The Broken Window

To understand why Whisper turns into a broken record, you need to see how it handles long audio. Whisper’s effective window is just 30 seconds. For anything longer, it chops the audio into 30-second chunks, transcribes each one, then stitches them together using a clever sliding-window mechanism. The key detail: the transcription from the previous window is fed in as “context” for the next one—specifically, a <|startofprev|> special token followed by the previously transcribed text.

This design makes sense in principle. Knowing what was just said gives the next segment useful linguistic context, which mirrors how human speech works. The problem is another design choice: Whisper has no VAD (Voice Activity Detection). It doesn’t know how to say “this segment is silent, skip it.” It generates output for every frame of audio regardless.

When silence arrives, the encoder produces audio embeddings that are essentially near-zero noise vectors. The model was trained exclusively on speech—it has never learned what “no speech” looks like. But it must output something, because it has no mechanism for saying “this is empty.” So the decoder falls back on what it knows best: the language model prior, which is the text it just saw. If the last transcript read “the meeting will start at three,” the decoder, facing a zero vector, thinks: that should look like the last thing. It outputs “the meeting will start at three” again. And that output becomes the “previous context” for the next 30-second window, which copies it again—once the loop starts, it feeds itself.

This isn’t my speculation. The original Whisper paper (Radford et al., 2022), section 4.5, acknowledges that greedy decoding tends to fall into repetition loops during long-form transcription, and recommends beam search with 5 beams starting from temperature 0, increasing gradually. But many open-source implementations and local deployments didn’t follow these defaults.

The Subtitle Ghosts Living in the Training Data

When silence stretches long enough, the repetition evolves into something stranger: Whisper starts producing content that was never in the audio at all. A real example—on audio containing no speech whatsoever, Whisper confidently output: “This is Ritesh Srinivasan and welcome to my channel. In this video, let’s look at WhisperJAX… I hope you enjoyed the video. If you did, please leave a like and subscribe to the channel.” It then appended “For more information visit www.FEMA.gov.” Another case from Cornell’s academic paper is more concise: “Thanks for watching and Electric Unicorn”—the audio was someone telling the Cinderella story. That phrase appeared from nowhere.

These phrases share a common origin: subtitles. OpenAI’s Whisper paper describes training on 680,000 hours of “weakly supervised” audio data, with only tens of thousands of hours manually labeled. “Weakly supervised” means pairing audio with text found on the internet, without human verification. Where does that natural pairing occur? YouTube subtitles and auto-generated captions, podcast transcripts, and closed captions from video platforms.

Here’s the rub. At the end of many videos—where there’s no speech, just visuals and scrolling credits—the subtitle file isn’t empty. It contains “thanks for watching,” channel subscription calls, sponsor messages, or fansub group credits. In the training samples, these text strings are paired with audio that is nothing but silence (or background music). Whisper learned a rule: when you hear prolonged silence, output whatever text you memorized as belonging to silence. That text happens to be YouTube signoffs. An OpenAI forum user confirmed this directly: “most of the hallucination text I did get was things like credits to movies subtitles sites.”

A 2025 paper (arXiv:2501.11378) provides precise statistics: when Whisper hallucinates on non-speech audio, the most frequent outputs are “thank you” (24.76%), “thanks for watching” (10.32%), and “thank you for watching” (2.58%). That ranking is essentially a frequency-sorted list of YouTube outro captions.

Cornell researchers (Koenecke et al., 2024) validated this pattern systematically using the AphasiaBank dataset. They found that roughly 1% of transcriptions contained full hallucinated sentences, and categorized the content into three types: perpetuating violence (e.g., fabricated murder descriptions), inaccurate associations (e.g., invented patient medication lists), and false authority (e.g., fake YouTube channel personas and website links). More telling: when they ran the same audio through Google, Microsoft, Amazon, AssemblyAI, and RevAI’s ASR services as controls, these hallucinations simply didn’t occur. This isn’t a general speech recognition difficulty—it’s a specific artifact of Whisper’s training pipeline.

How the Next Generation Fixed It

Qwen-ASR and GPT-4o took two different paths to the same destination: avoiding Whisper’s trap.

Qwen-ASR (built on the Qwen3-Omni foundation) took the direct approach. First, cleaner training data. Qwen3-ASR’s technical report describes “tens of millions of hours of multimodal speech data” with explicit silence and non-speech audio filtering during training—not bolted on as post-processing, but built into the model’s own ability to distinguish speech from non-speech. Second, VAD is integrated as a formal pipeline stage, not an afterthought. Third, it supports <|nospeech|> type tokens, allowing the model to directly output “there’s nothing here” rather than forcing it to fabricate text.

GPT-4o took a different route. It’s an end-to-end audio understanding model, not an “audio-to-text” transcription pipeline. GPT-4o’s voice mode doesn’t convert audio to text and then feed it to a language model—it processes audio tokens natively as a primary input modality. This means it doesn’t have Whisper’s fragile cascaded structure of “transcribe text, then use previous text as context.” GPT-4o can distinguish silence from speech, and the Dynamic-SUPERB benchmark shows its ability to judge “does this audio contain human speech” far exceeds pipelined Whisper systems.

That said, GPT-4o isn’t fully immune. When I give it completely silent audio and ask for transcription, it occasionally produces vague fabrications—far less often than Whisper, and never in infinite loops. The underlying cause may be the same: any generative model, deprived of input signal, falls back to its training distribution. But GPT-4o’s alignment training imposes a stronger constraint of “say you don’t know when uncertain,” preventing it from indulging in the cascading repetition loops that Whisper allows itself.

What You Can Do Right Now

If Whisper is a dependency you can’t escape—and in many scenarios, it is, because it’s fast, open-source, and has a mature ecosystem—several tactics can dramatically reduce how often the repetition appears.

The first layer is preprocessing: VAD. Before sending audio to Whisper, use Silero VAD or webrtcvad to strip out silent segments and feed in only the speech portions. This approach has a high hit rate—most repetitions are triggered by prolonged silence, and removing the silence removes the trigger.

The tradeoff is that VAD sensitivity may not match Whisper’s. Some segments contain speech but under heavy background noise or at very low volume, and VAD may classify them as silence and discard them. Whisper might have been able to transcribe those segments. With poor signal-to-noise ratios, the F1 threshold is hard to tune. The result is a tradeoff between “fewer repetitions” and “fewer missed words”—it’s an engineering parameter, not a one-size-fits-all setting.

The second layer is inference parameters. Start from temperature 0, add beam search (5 beams is a safe choice), then enable compression ratio thresholds and logprob thresholds. These two thresholds are the most direct signals for detecting repetition—looping produces abnormally long output length, and the tokens’ average probability tends to be unusually low. If triggered, re-submit that audio segment (Whisper’s hallucinations are non-deterministic; running the same audio twice usually produces different output).

The third layer is post-hoc cleanup. Use regex to catch continuously repeated phrases or sentences, then delete or flag them on detection. This is what many production systems that depend on Whisper actually do in practice.

If your Chinese ASR requirements are demanding, switching to Qwen3-ASR or Fun-ASR produces immediate improvements. Qwen3-ASR’s WER on Chinese speech is already noticeably lower than Whisper large-v3, and the silence hallucination problem is essentially nonexistent.


Whisper’s repetition problem is a good teaching case. At its core, it’s one instance of a broader trap—“the model left no encoding for blank input.” Language models fabricate answers when they can’t admit ignorance; diffusion models revert to the training set’s average face when given too much noise. The solution isn’t mysterious: during training, expose the model to blank samples and teach it to output a “null” token; during inference, don’t force it to produce output for meaningless input. Between principle and practice, though, lie the hundreds of thousands of hours of audio you trained on, the subtitle credits lingering at the end of every video, and countless repetitions of “thank you for watching.”