You get into an argument online. The other side fires back with a wall of AI-generated rebuttal: clean paragraphs, confident tone, a few bullet points that look like evidence. The subtext is clear: see, the AI agrees with me, so you’re wrong.
It is easy to get annoyed. The AI might well be right, but the problem is that the other person has treated its output as their own judgment. They have not explained how they prompted it, what sources it drew on, or which parts they verified on their own. They just dropped a block of fluent text on you. To counter it, you have to reconstruct their entire argument, fact-check it, and sort out what is real from what only looks real.
The trouble is that this scenario looks ridiculous from the outside but we all do something similar every day. Draft an analysis with AI. Organize your thoughts. Write an email. Reply to a critique. You read it over, tweak the wording, and send it. When it is your own work, this does not feel like deferring to AI authority. It feels like being efficient. You wrote the prompt, you set the direction, you provided the context. The AI output reads like something you already thought through, just expressed more clearly.
The asymmetry here is critical. When someone else uses AI output to argue against you, the absence of judgment is obvious. When you use AI yourself, the line blurs: after the AI’s fluent reasoning has merged with your own thinking, you can no longer tell which parts you filtered and which parts you just went along with. The output feels like yours, but whether a real act of judgment ever happened is no clearer to you than it is to the other person.
This is the real question: did the person, before forwarding or using that text, actually judge it themselves? The same question applies to us. When AI gives an answer, are we independently judging it, or just reading it and finding it plausible? Shaw and Nave designed an experiment to look at exactly that moment.
Shaw and Nave from Wharton turned this into an experiment (preprint on SSRN, OSF summary, Wharton official write-up).
Participants answered Cognitive Reflection Test questions. These questions bypass knowledge and are designed to create a situation where the first answer that comes to mind looks right, but pausing to double-check reveals it could be wrong. The test measures whether people are willing to stop and reconsider their first impulse once an answer has already presented itself.
In the experiment, some participants could consult an AI chatbot. The bot sometimes gave correct answers and sometimes gave incorrect ones. The part that mirrors real-world AI use is this: the errors did not always look absurd. They could come with reasoning steps and a confident tone, making them hard to notice without deliberately stopping to verify. This is exactly the scenario where AI most easily lowers guardrails in daily life: it looks right but is not.
The results break into two layers.
The first layer is adoption rates. Across 1,372 people and 9,593 trials, participants who used the chat feature accepted the AI’s answer about 93% of the time when it was correct and about 80% of the time when it was wrong. These numbers are not surprising. People readily adopt AI output; this in itself is well known.
The second layer is more critical. Participants who had access to AI reported self-assessed confidence 11.7 percentage points higher than the no-AI baseline group. The paper specifically notes that this confidence boost persisted even after the AI gave wrong answers. The AI made an error, and the user’s confidence did not drop accordingly. They accepted the error and simultaneously grew more confident in their judgment. The calibration signal was severed: they not only accepted the wrong answer, but did not know they were accepting it.
This means the core problem goes beyond people trusting AI too much. People lose the ability to sense the precision of their own judgment. You can feel that you are right, but you cannot feel that you might be wrong.
Why did AI errors fail to generate enough uncertainty?
Because AI does not just provide answers. It provides the full reasoning process: argument structure, causal chains, a confident tone. What you read is a pre-packaged argument. The material is fluent and complete enough that the brain tags it as correct. Subsequent checking turns into familiarity detection: I have read it, it reads smoothly, I find no obvious contradictions, I can sign off.
This is fundamentally different from independently reconstructing the judgment path. Independent reconstruction means starting from the question and running through the reasoning yourself, pausing at each step to evaluate alternatives and decide direction. Familiarity detection starts from an entry point already established by the AI. Your job becomes checking whether this particular path looks smooth, not deciding which direction to take. The difference: in independent reconstruction, you control the entry point. In familiarity detection, the entry point is already set.
The most common counterargument is the calculator analogy. You do not manually recompute after using a calculator. What is the difference?
The difference runs along three dimensions. First, calculators operate in a closed domain where errors are either correct or obviously wrong, not plausible pseudo-arguments wrapped in fluent language. Second, calculator results can always be verified with mental arithmetic; a backup verification channel is always available. Third, a calculator only replaces the execution layer; you still decide what operation to perform and how to use the result. LLMs operate in open domains, package errors in fluent natural language, lack independent verification channels, and in many tasks simultaneously replace problem definition, retrieval, reasoning, and output. The further upstream a tool moves, from the execution layer into judgment and problem definition, the higher the risk of cognitive surrender. Before using any tool, ask yourself: if it were wrong, could I detect that without using it? If the answer is no, you are already surrendering cognition.
If the problem is anchoring, many people reach for making AI output more transparent. But the common fixes each have their own limitations.
Add explanations. The intuition is that if the AI explains its reasoning, users can evaluate the quality of that reasoning. This question was studied before ChatGPT. Buçinca, Malaya, and Gajos looked at it in 2021 in the context of traditional AI decision-support systems, not today’s generative AI (paper). Their finding offers a useful warning: attaching explanations to AI recommendations did not naturally reduce over-reliance and in some cases increased it. Explanations usually side with the answer; they give users more reasons to agree without providing a new, independent entry point for judgment. More text is not the same as more verification.
Show uncertainty. The intuition is that if the AI signals uncertainty, users will be more cautious. A 2026 study, More Is Not Better (ScienceDirect), found that visual uncertainty cues improved participants’ subjective sense that they could distinguish correct from incorrect answers, but did not improve actual decision behavior. Users felt more alert; their behavior barely budged. This aligns with the finding from Shaw and Nave: confidence inflation is not purely about over-trusting AI, but about a breakdown in the user’s ability to calibrate their own judgment. Displaying uncertainty makes people feel better, but does not fix the calibration itself. Feeling calibrated is not the same as being calibrated.
Offer rewards and feedback. Another intuition: if there is a reward for being right and immediate feedback when wrong, people will try harder. Shaw and Nave’s third experiment tested this. The results did improve: the share of people who rejected a wrong AI answer rose from 20% to 42% (PsyPost coverage). But looking at it the other way, 58% still accepted the wrong advice. Rewards and feedback can pull some people back, but they do not eliminate the problem. Time pressure amplifies the risk further: with a 30-second limit, the tendency to correct an erroneous AI dropped by 12 percentage points (Ars Technica coverage). This suggests the problem is not just about motivation. When the starting point is already the AI’s answer, everything downstream is colored by it.
Review more carefully. The intuition: if the first three approaches are not enough, reading more closely and checking more thoroughly should catch the problem. Careful reading helps, but reading is not independent judgment. The AI has already drawn the path; you are reviewing a path that already exists. You can find grammatical issues and logical contradictions along the way, but it is hard to detect that the path itself is misdirected, because your brain is following the very path the AI laid out. Reviewing can catch errors in AI responses, but this depends on luck (the AI happened to err on a detail you happened to check). It is not a robust method.
All four approaches have their uses, and they share the same limitation: they all come into play after the AI output has already been delivered. By the time you are reviewing, the AI has already occupied the entry point for judgment. Salvage efforts after the fact yield diminishing returns. A more robust approach may need to intervene before the output.
This tension points to a practical distinction.
A double-check means retracing the same path the AI has already laid out: reread the answer, check the steps, verify the facts. When the output can be confirmed in seconds (spell-check, compile errors, formula verification), double-checking is sufficient; the cost of building an independent perspective outweighs the benefit. David Lyell and Enrico Coiera, in their 2017 systematic review in JAMIA (paper), frame this in terms of verification complexity: how many steps, how much domain expertise, and how much working memory are needed to confirm an automated suggestion? When verification complexity is low, double-checking works. When it is high, the effectiveness of double-checking depends on the quality of the path itself. Walking carefully down the wrong path does not reveal that you are heading the wrong way.
A cross-check means establishing an independent reference point before seeing the AI’s answer. You can write down your own estimate, list the parts you think are most error-prone, run through the problem within your own thinking framework, or ask another model the same question and see if you get a different answer. Then you compare the AI output against this reference point. You do not need to redo every detail. You just need a baseline that has not been shaped by the AI’s output. With this baseline, double-checking has a fulcrum: instead of inspecting a pre-drawn path, you are inspecting the difference between your own coordinates and a new input.
A workplace analogy sharpens the distinction. When a good manager reviews a subordinate’s analysis, they typically do not recalculate every number. But if they open the report with no expectations, no prior position, and no risk assessment, read through it, find it coherent, and sign off, that is not a review. That is a ritual. A good manager does something completely different: before opening the report, they already have expectations about the conclusions, know what results would surprise them, know which assumptions are fragile, and have a few questions in mind. Then they sample, probe, and cross-validate against those expectations. This is independent judgment preparation, not post-hoc reading.
Independent priority does not mean redoing everything. It does not mean going fully manual before consulting AI. It means that by the time you encounter the AI’s output, you already have a perspective that was not produced by that output. With this perspective, double-checking has a fulcrum. Without it, double-checking is just inspecting a path someone else drew.
Back to the opening scene. The answer is not to use AI less, and not to check more aggressively, because checking itself may have already degraded. A more productive starting point is to ask yourself a different question:
Am I retracing the path AI has drawn, or am I reading my own map against it?
AI should lower every cost of arriving at the scene of judgment: search, organization, formatting, retrieval, the first draft. But the scene of judgment itself, the step where you decide what to agree with, what to question, and what direction to take, needs to stay with the person. Draw your own map first. Then compare it to the one AI drew. The boundary between double-checking and cross-checking is right here.