You know the feeling. A colleague sends you a summary of a paper they were supposed to read for the meeting. You glance at it and something is off. Not wrong, exactly — the terminology is right, the methods are described accurately enough, the conclusion is there. But you can tell, somehow immediately, that they didn't really read it.
What tipped you off? The summary is technically correct. What it lacks is harder to name: a sense of what the paper was pushing against. What the authors chose not to say. Why this result, and not the dozens of similar results published the same month. The summary has the content of reading without the activity of reading.
Now think about the last time you asked an AI to summarize a research paper for you.
I want to explore that feeling — because I think it is diagnostic, not incidental. This is the third and final post in a series on what happens when you use an LLM to read research for you. The first post dissected the mechanisms. The second weighed the competing evidence. This one takes a different route: I want to map what the model does onto experiences you already have, because the most useful understanding of AI reading comes not from technical analysis but from recognition.
Three analogies. Each illuminating. Each, ultimately, wrong in ways that matter.
Every teacher knows this student. They can repeat back the definition. They can summarize the chapter. If you ask them what the author's main argument was, they'll give you something plausible. But the moment you push — "Okay, but how would you apply that to this case?" — the facade collapses.
This is not stupidity. It's a specific and recognizable failure mode: the student absorbed the surface representation without building the operational understanding that would let them use what they learned. They memorized the map but never visited the territory.
I think the parallel with LLMs is uncomfortably precise.
A study titled Potemkin Understanding demonstrated that language models can correctly explain concepts, fail to apply them in practice, and — most remarkably — recognize that they failed. The model knows the right answer in the abstract. It produces the wrong answer in context. Then, when shown its own failure, it can articulate why it went wrong. This is exactly the student who can tell you what the textbook says the answer should be while getting the problem wrong.
The gap between knowing and doing turns out to be quantifiable. Research on LLMs as decision-making agents — LLMs are Greedy Agents — found that models produce correct rationales 87% of the time but take the correct action only 64% of the time. They can tell you what to do. They frequently don't do it. The student who aces the study guide and fails the exam.
What makes this more than a cute analogy is the mechanistic explanation. Comprehension Without Competence showed that "instruction and action pathways are geometrically and functionally dissociated" inside the model. The part that understands what should happen and the part that decides what to do are, in a meaningful sense, different systems. This isn't a student who could try harder. It's a student whose ability to describe chemistry and ability to perform chemistry are implemented in separate, poorly connected regions of their brain.
The Explain-Query-Test framework made this visible in the starkest possible way: models fail to answer questions derived from their own explanations. They produce an explanation. You ask a question that the explanation logically entails. They get it wrong. The student didn't just fail to do the reading — they failed to read their own notes.
So when you ask an LLM to read a paper and extract key points, the student analogy says: the model can give you the summary, but it hasn't built the understanding that would let it evaluate whether the summary captures what matters. It got the content without the comprehension.
Here's a different analogy, less flattering but perhaps more mechanistically honest.
Forget the student. Think instead about a very sophisticated keyword extraction system. Not ctrl+F — something much more capable, with deep statistical knowledge of which terms tend to co-occur, which phrases signal important claims, which sentence positions typically contain conclusions. It doesn't read the paper. It identifies the parts of the paper that pattern-match to "important."
This analogy gets support from a surprising place: the model's own architecture. Research on retrieval heads — the internal attention mechanisms that handle long-context factuality — has shown that fewer than 5% of the model's attention heads are responsible for retrieving specific information from the context. The model doesn't read the whole paper in anything like the way you do. A tiny fraction of its computational machinery locates specific tokens, and the rest generates fluent text around them.
The hierarchy of what matters reinforces this. Functional Importance of Reasoning Tokens demonstrated that when models reason through a problem, they preserve symbolic content (numbers, variable names, logical operators) while pruning linguistic scaffolding (connecting phrases, contextual explanations). The model's internal priority system — what it treats as load-bearing and what it treats as disposable — is not a reader's priority system. A reader would consider the author's hedging language and contextual framing essential. The model treats them as noise to be discarded.
Perhaps most telling: research on graph-structured reasoning has shown that shuffling the topology of a graph — scrambling the connections between nodes — has minimal effect on model performance. The model processes things with a sequential, U-shaped attention pattern regardless of the actual structure of the input. It's not following the argument's structure. It's applying its own fixed processing pattern to whatever it receives.
This is keyword extraction in the deepest sense: the model identifies and extracts the tokens that its statistical training says are important, arranges them in a fluent order, and returns the result. It did not follow the author's argument. It pattern-matched against billions of prior examples of what summaries look like.
When you ask an LLM to read a paper, the keyword extraction analogy says: you're getting a statistically informed selection of the paper's most salient tokens, arranged by a system that knows what summaries should look like. You're not getting a reading.
The third analogy is the most generous and perhaps the most dangerous.
You probably know someone who can skim a paper in three minutes and give you a decent overview. They will get the gist — main finding, roughly what was measured, what the conclusion says. They will miss the carefully worded hedge in the discussion, the devastating footnote on page 12, the reason this paper matters to your specific question. But they give you something useful, and fast.
This is what most people assume the model is doing. And that assumption is exactly the problem.
A study titled Large Language Models Do Not Simulate Human Psychology tested models on moral reasoning scenarios with minimal rewording. Changing "wrongfully" to "rightfully" in a moral scenario — a change that any human reader, even a fast one, would immediately register as reversing the entire moral valence — produced negligible changes in model output. The correlation between model judgments on the original and reworded versions was r = 0.99 (near identical), while humans showed r = 0.54 (appropriately different). The model is not skimming. A skimmer would catch a word that reverses the meaning of a sentence. The model is doing something else entirely — something that looks like reading from the outside but that doesn't track meaning the way even the most cursory human reading does.
The ambiguity evidence is equally stark. We're Afraid Language Models Aren't Modeling Ambiguity found that models achieve only 32% accuracy on disambiguating ambiguous sentences, compared to 90% for humans. But more importantly, models don't recognize ambiguity as a property of the text. They resolve it — silently, confidently, and often incorrectly. A speed reader might miss a subtle ambiguity. But they would recognize a blatant one. They would pause at a sentence that could mean two different things. The model doesn't pause. It picks one interpretation and proceeds with the serene confidence of a system that cannot experience uncertainty about meaning.
When you ask an LLM to read a paper, the speed reader analogy says: you're getting the gist, fast, with some nuance lost. The reality is worse. You're getting output from a system that can fail to notice when a single word reverses the meaning of an entire passage — a failure no human speed reader would make.
Each analogy above is wrong. Not just imprecise — wrong in a way that reveals something genuinely novel about what language models are doing. The student, the keyword extractor, the speed reader: each gives you a handhold for understanding. But the handholds are attached to the wrong wall.
Consider this finding: Beyond Semantics showed that corrupted reasoning traces — chains of thought with errors introduced — sometimes generalize better to out-of-distribution problems than correct traces do.
The model performs better on new types of problems when it studied from wrong notes.
No human analogy survives this. Think about it: the student who reads a garbled textbook does not develop superior problem-solving skills. The keyword extractor fed corrupted input does not produce better extractions. The speed reader who skims an error-filled draft does not achieve deeper understanding. There is no human experience of reading where errors in your source material improve your ability to handle novel situations. This is a property of a system that is not reading — it is doing something for which we do not have a word.
Then there is the format effect. The CoT Encyclopedia systematically tested how different formats of reasoning traces affect model performance across domains. The finding: the format of the reasoning trace has an effect 7.5 times stronger than the domain it's applied to.
Think about what this means for the reading analogy. Imagine if the way you took notes — bullet points versus prose versus diagrams — mattered 7.5 times more than whether you were studying biology or economics. Imagine if your approach to reading a paper on quantum computing transferred almost perfectly to reading a paper on Renaissance art, but switching from bullet-point notes to paragraph notes collapsed your comprehension entirely. No human reader works this way. The content of what we read dominates our understanding of it. For the model, the format dominates the content by nearly an order of magnitude.
And finally, the deepest break. Grosz and Sidner's foundational discourse theory — Attention, Intentions, and Structure of Discourse — identifies three components that any reader brings to a text: the linguistic structure (the words and sentences), the intentional structure (what the author and reader are each trying to do), and the attentional structure (what should be focal at each point in the discourse).
Language models approximate the linguistic structure well. They partially handle the intentional structure — they can infer what the author was trying to argue, at least at a surface level. But they have no attentional structure. They cannot answer the question "What should I be paying attention to right now?" because they have no mechanism for deciding that anything is more worth attending to than anything else — not based on the discourse, not based on purpose, not based on the evolving state of understanding.
This is not a gap that maps onto human reading failure. Even the worst human reader — the student who didn't do the reading, the skimmer, the person checking their phone every thirty seconds — has an attentional frame. They know, however poorly, that some parts of the text are more relevant to their purpose than others. The model does not know this. It processes every token with whatever attention pattern its architecture dictates, regardless of what the reader (you) actually needs from the text.
The absence of an attentional frame is not a degree of failure. It is a category difference. And it is the reason every human analogy ultimately misleads.
If every analogy breaks, we need a different approach. Not "what does this remind us of?" but "what frames are available, and which is least likely to lead us astray?"
Frame A: Sophisticated search. The model is not reading; it's retrieving. It identifies tokens in the input that match patterns it learned during training, and it generates fluent text that looks like what a reading would produce. This frame is supported by the retrieval heads evidence — the fact that a tiny fraction of the model's attention heads do the actual work of locating information. Under this frame, asking the model to "read a paper" is like asking a search engine to "understand a query." The search engine doesn't understand. It matches patterns. Extremely well, in ways that are useful. But it's search.
Frame B: A competent but uninterested reader. The model has the capability to read well — it just doesn't engage. It could understand the nuance if it tried; it just doesn't try. This frame is supported by the knowing-doing gap: 87% correct rationales, 64% correct actions. The competence is there. The application isn't. Under this frame, the solution is to make the model "try harder" — better prompts, more careful instructions, chain-of-thought reasoning that forces engagement.
Frame C: A new cognitive activity with no human equivalent. The model is doing something that shares surface features with reading but is mechanistically and functionally different in ways that no amount of prompting can resolve. This frame is supported by the corrupted-traces finding (no human equivalent), the format-over-domain effect (no human equivalent), and the missing attentional layer (a category gap, not a degree gap).
I want to argue that Frame B is the most dangerous, because it is the most intuitive and the most wrong.
If you believe the model is a competent but uninterested reader, you believe the gap can be closed by effort — better prompts, more explicit instructions, reasoning scaffolds that force the model to engage. And indeed, these interventions help. Chain-of-thought prompting improves performance. Explicit instructions to "consider counterarguments" produce outputs that mention counterarguments. The model responds to pressure in ways that look exactly like an uninterested reader starting to pay attention.
But the underlying mechanics haven't changed. The model still processes tokens without an attentional frame. It still weights format 7.5 times more than domain. It still fails to notice when a single word reverses the meaning of a sentence. What better prompting does is change the statistical distribution of the output — it makes the model more likely to produce text that resembles careful reading. It does not make the model read carefully. The distinction matters because it determines whether you should trust the output more when you've prompted well (Frame B says yes) or trust it the same amount while finding it more useful (Frame C says yes).
Frame A — sophisticated search — is the safest frame, but it undersells what the model does. The model is producing novel combinations, making connections across passages, generating summaries that contain genuine synthesis. Search doesn't do this. Frame A is conservative enough to prevent misplaced trust, but it can't explain why the model's output is often genuinely useful in ways that search results are not.
I believe Frame C is the most honest. What the model does when it processes a research paper is a new kind of cognitive activity. It shares enough features with reading to be useful — it extracts claims, identifies methods, summarizes findings. But it differs from reading in ways that are not bugs to be fixed: the missing attentional frame, the dominance of format over content, the alien relationship between understanding and performance. These are architectural properties, not failure modes. They won't be resolved by scaling, or by better prompting, or by longer context windows.
You asked the model to read a paper. It gave you something back. The question is not whether it is useful — it often is. The question is what kind of trust it deserves.
Not the trust you would give a colleague who read the paper. Your colleague has an attentional frame shaped by the same professional context you share. They know what matters to you because they know what matters in the field.
Not the trust you would give a search result. The model is doing more than matching — it generates, connects, synthesizes. Its output contains value that pattern-matching alone could not produce.
Something in between. Something we do not have good language for yet.
Trust it the way you would trust a translation from a language you do not speak, done by a translator fluent in grammar who has never visited the country. The words will be right. The sentences will be correct. The meaning — the lived, contextual, embedded-in-a-community-of-practice meaning — might not survive the translation. Not because the translator made an error, but because meaning of that kind was never in the text to begin with. It was in the reader. And the model is not a reader.
When you ask an LLM to read a paper for you, you are using a tool that is genuinely good at extracting linguistic content from text. You are not using a tool that reads. The difference matters most exactly when it's hardest to see — when the output looks so much like what a reader would produce that you forget no reading happened.
The colleague who didn't read the paper? You could tell. The AI that didn't read the paper produces output so fluent, so well-structured, so plausibly insightful that the absence of reading becomes invisible. That invisibility is not a feature. It's the thing you need to watch for.
Use the tool. It's genuinely useful. But know what it is: not a reader, not a search engine, but something new — something that approximates the output of reading without performing the activity of reading. And let that knowledge inform the weight you give to what it tells you.
You'll know you've calibrated correctly when you stop asking "Did the AI understand this paper?" and start asking "What did the AI's processing of this paper give me, and what do I still need to get for myself?"
That second question is harder. It's also the right one.