科研与技术前沿模型架构

AI Can Pass Visual Tests With Its Eyes Closed: A Decade-Long Crisis in Visual Understanding Evaluation

A Long-Standing Problem

The visual understanding capabilities of multimodal large models are primarily measured through benchmarks. But if you remove the input image and the model still retains 70%–80% of its original accuracy, what exactly is the benchmark score measuring? The Stanford team’s MIRAGE study conducted systematic no-image tests on a range of frontier models, producing the latest quantitative data on this question. The narrative circulating on social media is that models completely bypass the visual channel, but the real issue lies in the benchmarks themselves: the test design makes pure text inference a viable path to correct answers.

More critically, this phenomenon has nearly a decade of research behind it. From 2016 when VQA researchers discovered that language priors could bypass visual input, to 2018 when chest X-ray AI was found to be identifying hospital labels rather than lung pathology, to GPT-5.1 hitting a 93.5% mirage rate without images—this is a clear evolutionary line. VQA stands for Visual Question Answering: given an image and a question, the model answers based on the image content. It is one of the most common evaluation paradigms for multimodal models. MIRAGE is the latest data point on this line; studies predating it by years had already revealed the same mechanism.

This article traces that line end to end, helping readers understand three things: why multimodal benchmarks repeatedly fail, why medical imaging is the hardest-hit domain, and what questions to ask when confronted with a benchmark score to judge its credibility.

What’s at Stake

If most questions in a visual question-answering benchmark can be answered correctly without ever seeing the image, then every model ranking, product decision, and research conclusion built on that benchmark rests on unreliable ground. For general AI applications, this means product teams may be overestimating their models’ visual understanding. For medical AI, the consequences are more severe: a chest X-ray diagnostic model that performs well on benchmarks may actually be recognizing portable X-ray machine markers rather than lung pathology, and that kind of error can directly affect patient safety in clinical settings.

The core finding of the MIRAGE paper is that across all tested model-benchmark combinations, the contribution of textual information exceeded that of visual information. If this finding is reproducible, it means the current mainstream multimodal evaluation framework provides extremely limited signal on whether models are actually using the visual channel. For teams making procurement or deployment decisions based on benchmark scores, this is a direct practical concern.

The novelty of the MIRAGE paper itself needs to be placed in proper context. The paper’s contribution lies in its systematic quantification across the latest generation of frontier models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro) and currently active benchmarks. The observed phenomenon itself—models achieving high scores on visual tasks by bypassing the visual channel—has nearly a decade of research in both the VQA and medical imaging subfields.

A Decade of Context: From Language Priors to Mirage Reasoning

2016–2018: The Language Prior Problem in VQA

The story begins around 2016 with an embarrassing discovery by VQA researchers. Agrawal et al. were the first to systematically analyze language priors in VQA datasets: when models encountered questions like “What sport is…,” the high-frequency answer “tennis” was often correct, with no need to understand the image at all. In 2017, Goyal et al. quantified the scale of this problem when building VQA v2. They reported that a blind model—one that completely ignores visual input—could achieve 67% accuracy on binary questions and 27% on open-ended ones.

The 2018 VQA-CP v2 further exposed the depth of the problem. By changing the test set’s answer distribution to differ from the training set, the then-SOTA model’s accuracy dropped by 24–27 percentage points. This experiment cleanly demonstrated that models had largely learned statistical associations between surface-level question forms and answers, rather than semantic relationships between image content and answers.

2018–2021: Shortcut Learning in Medical Imaging

The language prior problem in VQA has a more dangerous counterpart in medical imaging. In 2018, Zech et al. published a study in PLoS Medicine showing that a deep learning model trained to detect pneumonia had actually learned to identify the source hospital. They found that a trivial model using only hospital-source information could achieve an AUC of 0.861, while a CNN could identify which hospital an image came from with 99.95% accuracy. The mechanism was clear: the pneumonia prevalence rate in the NIH dataset was 1.2%, versus 34.2% at Mount Sinai Hospital. Such a large baseline discrepancy allowed the model to completely ignore image content and achieve seemingly decent performance just by recognizing hospital markers. The paper has over 1,400 citations.

In 2021, DeGrave et al. reported a similar finding in Nature Machine Intelligence: approximately 50% of a COVID-19 chest X-ray detection model’s accuracy came from spurious confounds—features that happen to correlate with labels but are unrelated to the actual diagnosis—including text markers on images, patient positioning, and laterality markers rather than lung pathology itself. The paper’s own phrasing was that the model “selects shortcuts over signal.” This work has accumulated over 430 citations.

Banerjee et al. in their 2023 JACR review systematically catalogued shortcut types in radiology AI: pneumonia detection using ICU portable markers as proxies, pneumothorax detection using chest tubes as proxies. These shortcuts are hidden within overall accuracy figures and only become visible when performance is stratified across specific subgroups.

There is a clear mapping from VQA to medical imaging. In VQA, language priors let models use surface-level question forms to bypass images. In medical imaging, shortcuts let models use non-diagnostic image features to bypass pathological regions. The underlying mechanism is the same: when a path easier than genuinely understanding visual content exists, models take it.

2024–2026: MIRAGE and the Frontier Model Era

The MIRAGE paper’s work occupies the latest node on this timeline. Researchers tested frontier multimodal models including GPT-5.1, Claude Opus 4.5, and Gemini 3 Pro, measuring residual accuracy on benchmarks such as MMMU-Pro, VQA-RAD, and MedXpertQA-MM after removing image inputs. The paper reported that across all model-benchmark combinations, no-image accuracy ranged from 60% to 99% of with-image accuracy, with medical benchmarks consistently at the high end.

The paper also introduced a new concept: mirage reasoning. This describes models generating detailed image descriptions and reasoning based on those descriptions even when no image was provided at all. When using evaluation workflow prompts, the mirage rate—the frequency at which models confidently describe visual content without receiving an image—reached 90%–100%.

Concurrent work corroborated this from different angles. MMMU-Pro applied post-hoc filtering to the original MMMU, removing text-solvable questions, and found model performance dropped by 16.8–26.9 percentage points. ViLP designed a three-image test (presenting the model with three images, only one corresponding to the correct answer) and found GPT-4o’s accuracy was only 66.17%, far below its standard benchmark performance. Wang et al. 2025 revisited chest X-ray model performance and found that when stratifying evaluation by clinical context (such as disease prior probability), SOTA model performance fell well below what overall accuracy suggested.

Jabbour et al. 2020 studied the exploitation and prevention of shortcuts in chest X-ray AI earlier, using transfer learning to mitigate model reliance on spurious correlations. Together, these studies form a complete evidence chain: from Zech in 2018 to MIRAGE in 2026, varying in methods, datasets, and model generations but pointing to the same core problem with high consistency.

Three Driving Mechanisms

Why does this problem keep recurring and apparently worsen as models improve? Based on the existing literature and MIRAGE’s data, three mutually reinforcing mechanisms can be identified.

Information redundancy at the benchmark level. Most VQA benchmark questions inherently carry substantial information usable for inferring answers. Question phrasing, option design, and domain constraints—these textual cues combined with the model’s internalized world knowledge—make images redundant inputs. In information-theoretic terms, given the model’s prior knowledge K, the conditional mutual information I(Q; I | K) between question Q and image I tends to be low. MMMU-Pro uses filtering and MicroVQA uses RefineBot to audit CoT and rewrite questions, both attempting to raise this mutual information, but with limited success: MicroVQA’s expert review costs 30–40 minutes per question and is hard to scale; MMMU-Pro’s filtering produces clear performance drops, but still cannot guarantee that remaining questions inherently require visual input.

Architectural shortcut bias. Current mainstream vision-language models (VLMs) pair shallow visual encoders with deep language models. The language model component vastly exceeds the visual encoder in parameter count, training data volume, and reasoning capability. When models face two paths—carefully analyzing visual input (computationally expensive, high uncertainty) versus inferring answers from question text (computationally cheap, often correct)—the architecture itself favors the latter. The HC-M3D benchmark provided direct evidence: even when critical information positions in images were changed (which should theoretically change the answer), GPT-4o’s performance barely shifted, indicating that models frequently bypass the visual channel.

Knowledge Scale Inversion. This is the most noteworthy of the three mechanisms. As model parameter counts grow and pretraining corpora expand, the model’s internalized prior knowledge K grows continuously. For any fixed benchmark, I(Q; I | K) shrinks as K grows. Questions that GPT-4V needed to see the image to answer, the more knowledgeable GPT-5.1 can solve through text inference alone. MIRAGE’s data directly supports this: GPT-4.1’s mirage rate was 43%, while the more powerful GPT-5.1 reached 93.5%. This gap reflects stronger language capabilities making it easier to bypass the visual channel, rather than any regression in GPT-5.1’s visual ability.

This implication points to a concerning trend: if evaluation methods remain unchanged, stronger models produce more distorted visual evaluation signals. A dynamic tension exists between evaluation precision and model capability, and the evaluation side is currently falling behind.

Mirage and Hallucination: Theoretical Distinction, Practical Convergence

The MIRAGE paper draws an explicit distinction between mirage reasoning and hallucination. By the paper’s definition, hallucination fabricates details within a valid cognitive framework (e.g., seeing a photo of a cat and describing it as black when it’s actually white), while mirage constructs an entirely false cognitive framework (describing image content and reasoning from it when no image was provided at all).

Bai et al. 2024’s MLLM Hallucination Survey defined a related IK (I Know) hallucination: models refuse to acknowledge they cannot answer a question and instead provide a confident response. IK hallucination and mirage are very similar in manifestation—both involve models giving definitive answers when information is insufficient. The same survey’s definition of event hallucination—models fabricating fictitious objects and constructing narratives around them—is nearly isomorphic to mirage reasoning in cognitive process, differing only in that event hallucination occurs when an image exists but is incorrectly described, while mirage extends to scenarios where the image is entirely absent.

This theoretical distinction may matter at the cognitive science level. It does point to different levels of epistemic failure—failure of the knowledge framework itself, as opposed to getting details wrong within a correct framework. Hallucination fills in wrong details within the right cognitive framework; mirage means the entire framework is fabricated. But in terms of practical consequences, the two converge: both render benchmark scores unreliable, both require detection and mitigation mechanisms, and both pose deployment safety risks. For AI product teams, distinguishing mirage from hallucination matters less than a more fundamental question: how much visual information is the model actually using on this task?

Correcting the Social Media Narrative

Returning to the social media summary. The claim that models completely bypass the visual channel gained traction because it is concise and striking. But it conflates two entirely different propositions.

What MIRAGE’s data shows is: on current mainstream benchmarks, most questions can be answered correctly without providing an image. This is a judgment about benchmark quality, not about model visual capability. A high mirage score on a benchmark means that benchmark fails to effectively distinguish between two behaviors: genuinely using visual information, versus arriving at the answer through text and knowledge inference.

Whether models possess genuine visual understanding requires different testing methods. ViLP’s three-image test is one direction; requiring models to perform low-level visual tasks answerable only through image observation (counting, spatial relationship judgment) is another. Rahmanzadehgervi et al.’s 2024 study (VLMs are Blind) found that current VLMs achieve only 58.57% accuracy on simple visual tasks, suggesting that genuine visual understanding has substantial room for improvement—but this is an entirely different diagnosis from models completely ignoring visual input.

The MIRAGE paper itself uses more cautious language than media coverage. The paper’s phrasing is “the illusion of visual understanding,” pointing to an evaluation-level illusion: we assumed benchmark scores represented visual understanding, but they may primarily reflect language capability and knowledge reserves. This is a judgment about the measurement tool, operating on a different level from judgments about model visual capability itself.

Will This Problem Improve or Worsen as Models Advance?

The knowledge scale inversion mechanism yields a concerning prediction: if evaluation methods remain unchanged, the problem grows worse as models improve. Each generation of stronger language models turns more benchmark questions into ones solvable through pure text inference. MIRAGE’s data—the jump from GPT-4.1 (mirage rate 43%) to GPT-5.1 (mirage rate 93.5%)—is direct evidence.

This means benchmark design needs to become a dynamic process. Each time model capabilities make a significant leap, existing benchmarks lose discriminative validity and require recalibration. A more fundamental direction is ensuring benchmarks are constructed so that questions inherently require visual input: changing image content must change the correct answer. Achieving this is costly (HC-M3D’s attempt shows that even with this design goal, current models can still find workarounds), but it at least pushes the problem to the right level.

Ultimately, the field faces an ongoing contest: evaluators work to design tests that genuinely require visual understanding, while increasingly powerful language models continue finding new bypass routes. This dynamic has gone unresolved for the past decade and is unlikely to be solved in one stroke in the foreseeable future. It will more likely remain a permanent topic in multimodal AI development, requiring sustained investment in evaluation methodology.


Key Sources