AI Products & PlatformsCommunity & Cognition

An AI Tutor for Every Student — Is That Really How AI Lands in Education?

A Silicon Valley engineer notices his eighth-grader is falling behind in math. He opens Khanmigo and has the kid do half an hour of AI tutoring every day: explanations at the kid’s pace, problems based on the kid’s actual mistakes, infinite patience, always available. He thinks this solves the core problem of education. If every kid had an AI tutor like this, we’d be done.

This isn’t just one person’s intuition. Sal Khan’s 2023 TED talk explicitly argues that AI can solve a famous old problem in education. Khan Academy launched Khanmigo shortly after, and within a year the user base went from 68,000 to more than 700,000 students. Texas’s Alpha School compresses daily academic time to two hours, leaving the rest for projects and interests. Across ed-tech, the dominant narrative over the past two years has been: AI finally lets every student have a one-on-one tutor; the holy grail of education is within reach.

That narrative is compelling because it’s backed by decades of academic research. Below, I’ll introduce the three most important theories behind it. They’ve been cited tens of thousands of times and are required reading in teacher education worldwide. The research findings themselves are solid and haven’t been overturned. But the way people extrapolate from these findings to “therefore AI should do personalized tutoring” may rest on a common misreading. I’ll come back to that. First, the theories themselves.

Three Core Theories Behind the Personalization Narrative

First is Bloom’s Two Sigma. In 1984, Benjamin Bloom published a widely cited study reporting that students who received one-on-one tutoring combined with mastery learning (where a student only advances after mastering the current concept) scored two standard deviations higher than students in traditional classrooms. What does two standard deviations mean? It means the median student in a regular class would reach the top 2 percent under one-on-one tutoring. Bloom called this the Two Sigma Problem: one-on-one is this effective, so why can’t mass education deliver it?

Second is Vygotsky’s Zone of Proximal Development. In the 1930s, Lev Vygotsky introduced the concept of the Zone of Proximal Development (ZPD): the range between what a student can do alone and what they can do with help. Teaching should happen in that zone. Too easy and students don’t learn; too hard and they give up. Real learning only happens when instruction targets the ZPD. The associated practice is scaffolding, the temporary support teachers provide that’s gradually withdrawn as the student grows. The problem: a class of 40 students has 40 different ZPDs, and a teacher standing at the front can only aim at the median.

Third is Black and Wiliam’s formative assessment. In 1998, Paul Black and Dylan Wiliam published the review Inside the Black Box, synthesizing 250 studies. Their conclusion: systematic formative assessment (diagnostic feedback given during teaching, as opposed to summative end-of-unit tests) produces an effect size of 0.4 to 0.7. That’s large by education-research standards. But good formative assessment requires feedback that is timely, individualized, and diagnostic. In a 40-student class, feedback on homework typically lags by one to three days, and no more than five students per lesson get immediate verbal feedback from the teacher.

The common thread across these three theories is easy to see: all three assume teaching has to target individual students to be effective. One-on-one tutoring, per-student ZPDs, individualized feedback. Together they built the core narrative of ed-tech over the past few decades: good teaching equals personalized teaching, and technology’s mission is to make personalization scale. AI is the latest vehicle for that narrative.

The logic looks airtight: theory says individualization works; in practice the 40:1 ratio blocks it; AI lifts the block; problem solved.

The experimental results don’t match.

What the Experiments Actually Show

Over the past two years, a body of serious empirical research has accumulated, and the results surprised most observers.

Stanford SCALE’s 2026 evidence review on AI in K-12 cites a controlled experiment by Bastani and colleagues in Turkey. High schoolers reviewed math in three groups: one used a general-purpose chatbot (could ask anything and get full answers), one used a tutoring-specific AI (hints only, no answers), and one used a traditional textbook. During the AI-assisted practice phase, the general chatbot group performed best, which is unsurprising since the AI was doing the problems for them. But on the closed-book exam, the general chatbot group scored noticeably worse than the textbook group. The tutoring-specific AI group scored the same as the textbook group. Same as, not better than.

Khanmigo is currently the closest thing to the “AI one-on-one tutor” ideal. Built on GPT-4, deliberately designed as a Socratic tutor (no answers, only questions that lead students to think), backed by Khan Academy’s brand and content library. By 2025 it had been deployed in over 350 school districts. But cracks appeared in the feedback after large-scale deployment. Dallas ISD canceled its contract. The University of Iowa’s teaching and learning office formally recommended against rollout after its pilot. The Michigan Virtual pilot covered 1,102 students and found usage below once per week, eventually pivoting to teacher-facing tools.

MIT Media Lab’s 2025 study Your Brain on ChatGPT observed cognitive debt accumulating among students who relied on AI writing support over time. A Hechinger Report controlled experiment found that students with AI assistance produced higher-quality essays, but showed no improvement in topic understanding, writing motivation, or subsequent writing ability. The group that had no AI and only a checklist reported enjoying writing the most.

Alpha School sounds like the most aggressive success case: students spend only two hours a day on academics and rank in the top 1 percent nationally on standardized tests. But Scott Alexander, after visiting in person, pointed out that Alpha’s core isn’t large-language-model conversation; it’s a beefed-up spaced-repetition platform with mastery-learning checkpoints. The AI’s contribution is in pacing control, not personalized conversation.

Put together, these results point to a conclusion that runs against mainstream expectation: when AI personalized tutoring is done well, the effect is at best on par with traditional methods, not better. When it’s done poorly, students actually regress. That’s in sharp contradiction with the ed-tech industry narrative that “AI solves the holy grail of education.”

The contradiction forces us to stop and ask a more fundamental question: if the effect of personalized attention is being overstated, what’s actually the decisive variable for educational effectiveness?

The Independent Contribution of Personalized Attention Might Be Overstated

Come back to the three core theories. Their research findings are solid; decades of empirical work hasn’t overturned them. But how those findings get interpreted has gone wrong.

Start with Bloom’s Two Sigma. Education Next revisited the claim in 2023 under the title “Two-Sigma Tutoring: Separating Science Fiction from Science Fact.” The core finding of the reassessment: Bloom attributed the two-standard-deviation gain to “one-on-one tutoring,” but the primary mechanism at work is actually mastery learning and formative feedback, not the human connection or individualized attention of the one-on-one setup. In other words, the decisive variable is the “master it before moving on” pacing control, not the fact that someone is paying attention to you specifically. Cognitive tutoring software like ASSISTments and Carnegie Learning’s MATHia partially reproduces Bloom’s effect primarily through pacing control, not by simulating a human relationship.

Next, Hattie’s Visible Learning meta-analysis. John Hattie synthesized over 1,200 studies and 195 influences, the largest effect-size ranking in education research. Hidden in there is a number most AI-education discussions ignore: the effect size of class size is only 0.21. What does 0.21 mean? Hattie’s own threshold for “meaningful” is 0.4. Shrinking a class from 40 to 20 students, giving the teacher more capacity to attend to individuals, produces barely meaningful gains. Meanwhile teaching methods themselves (Direct Instruction effect size 0.59, Cooperative Learning 0.4–0.6, Feedback 0.7+) matter far more than class size.

If “individualized attention” were the decisive variable, shrinking the class should have a large effect. The data says otherwise.

Third piece of evidence: Shanghai and Japan. Shanghai ranked first globally in math, reading, and science on the 2009 and 2012 PISA tests. Shanghai’s class sizes are typically 40–48 students, with higher student-teacher ratios than most OECD countries. Shanghai’s PISA performance isn’t the result of more individualized attention; on the contrary, the core mechanism of Shanghai instruction is collective lesson preparation and lesson research. Every school has subject-based teaching research groups that meet weekly for fixed hours of collective preparation. A team of teachers spends weeks polishing a single lesson plan, taking turns teaching it, observing one another, discussing collectively, and iterating. Japan’s Lesson Study (jugyō kenkyū) has over a century of tradition; its core methodology, kyōzai kenkyū (study of teaching materials), is the systematic investigation of how each specific topic should be taught, where students typically get stuck, and how instruction should adjust.

These three lines of evidence converge on a single judgment: the variable that actually decides educational effectiveness may not be whether each student receives individual attention, but rather how well the lesson itself is designed. The design quality of a good lesson acts on 40 students simultaneously; you can get most of the effect without individualization.

Collective Instruction: An Underrated Research Tradition

Once attention shifts from “individualized attention” to “lesson design quality,” it becomes clear that education research has an entire tradition dedicated to exactly this. The tradition has long been overshadowed by the personalization narrative, but its evidence base is anything but weak.

The strongest empirical evidence comes from the largest instructional comparison experiment in American education history: Project Follow Through. The project ran from 1967 to 1977, covered over 70,000 students from low-income families, and compared nine different teaching methods. Among them, Direct Instruction (DI, designed by Siegfried Engelmann) is a highly structured collective teaching method: teachers deliver carefully designed scripts at a fast pace, with dense questioning, immediate error correction, and whole-class synchronized response. This sounds like the opposite of student-centered individualized education. Yet across three evaluation dimensions (basic skills, cognitive skills, affective development), Direct Instruction ranked first among all nine methods. Even on affective development, the dimension where open education was widely expected to outperform because it respects student autonomy, the data showed it didn’t beat DI.

The result has generated decades of debate. Criticism has centered on methodology, sample selection, and political stance (DI is viewed as too “traditional” for the philosophical leanings of progressive education). The data itself hasn’t been overturned. A reasonable interpretation is that DI wins not because “teacher authoritarianism is better,” but because its lessons are designed with extreme care. Each step of the explanation, each question’s timing, each possible student error and its corresponding correction, is all worked out in advance in the lesson plan. This matches the logic of Shanghai teaching research groups collectively polishing a lesson.

Barak Rosenshine’s 2012 Principles of Instruction does something similar from another angle. He synthesized three research bodies (cognitive science, effective-teacher research, cognitive-tutoring studies) into ten empirical principles. Examples: begin each day by reviewing the previous lesson; present new material in small steps; check understanding across the whole class frequently; require 80 percent accuracy before moving to independent practice; systematically review weekly and monthly. These ten principles assume a one-to-many teaching scenario. In 2019, UK educator Tom Sherrington turned the ten into Rosenshine’s Principles in Action, which has since become standard content in UK primary teacher training.

Johnson and Johnson’s research on Cooperative Learning, starting in the 1970s, also belongs to the collective tradition. In Hattie’s meta-analysis, cooperative learning has an effect size between 0.4 and 0.6, essentially on par with one-on-one tutoring. Cooperative learning’s key isn’t “let students discuss in groups” (which frequently degenerates into a few students working while others free-ride), but strict design elements: positive interdependence (where group members sink or swim together) and individual accountability (where individual contributions are identifiable). Robert Slavin’s Success for All program deployed this approach at scale in hundreds of low-income American schools.

Laid out together, a counterintuitive judgment starts to form. In Hattie’s data, Direct Instruction has an effect size of 0.59 and one-on-one tutoring 0.58–0.60: essentially identical. Class size has an effect size of 0.21, far below teaching method. The impact of teaching method is much larger than whether each student is individually attended to. And collective teaching methods have adoption rates an order of magnitude higher than individualized ones (collective teaching is the default form in all public schools, while individualized methods see extremely low real-world implementation; differentiated instruction, mentioned earlier, has only about 33 percent of teachers actually implementing it). When you multiply effect size by adoption rate, the total effect of collective teaching likely far exceeds that of individualized teaching.

The core logic of this tradition can be captured in one sentence: instead of asking one teacher to simultaneously attend to 40 different students (a near-impossible task), have a group of teachers design one lesson well enough that it works on all 40 different students. The bottleneck of the former is the teacher’s cognitive bandwidth; the bottleneck of the latter is the design quality of the lesson. For decades, ed-tech has been trying to break through the former, when the latter may be the larger lever.

The Lever for AI May Not Be on the Student Side

If lesson design quality matters more than individualized attention for educational effectiveness, where AI fits into education needs rethinking.

The core narrative of ed-tech over the past three decades has been “use technology to deliver student-side personalization.” Computer-assisted instruction in the 1990s gave students different practice problems. Adaptive learning platforms in the 2010s gave students different learning paths. AI tutors in the 2020s give students different explanations and dialogues. Each generation of technology pushed harder on the student side, assuming “if I can attend more precisely to each student, education gets better.” The analysis above suggests that assumption’s independent contribution may be overstated.

If the focus of AI shifts from the student side to the lesson design side, what becomes possible?

First, AI can accelerate lesson iteration for teaching research groups. When a Shanghai teaching research group spends weeks polishing one lesson, the main cost is information gathering and comparison analysis: where did students get stuck after this version of the lesson? Would a different questioning sequence work better? What methods have other schools used to teach the same topic, and what were the results? These questions are currently answered by human labor, and the number of lessons a group can polish in one semester is limited. AI can dramatically compress the information-gathering stage: analyze recorded-lesson data to pinpoint exactly where students got stuck, compare the effectiveness of different questioning strategies across similar classes, extract the most effective teaching sequence for specific topics from large case libraries. Japan’s Lesson Study process of kyōzai kenkyū currently runs entirely on human labor; AI could compress months of research to days.

Second, AI can give teachers real-time classroom quality-control signals. This isn’t about giving individualized feedback to students; it’s about giving teachers aggregate-state signals about the whole class. Pierre Dillenbourg and Luis Prieto at EPFL have spent fifteen years studying exactly this concept, orchestration load: while teaching, a teacher has to simultaneously explain, observe, and decide, and cognitive bandwidth is already saturated. Rosenshine’s third principle says “check whole-class understanding frequently,” but in a 40-student class, teachers can only judge by occasional hand-raising and questioning. AI can monitor the whole class’s exercise progress and error patterns in the background, and at the right moment give the teacher a class-level signal: 30 percent of the class didn’t follow that step; now’s a good time to pause and reteach. This signal goes to the teacher, not to students. It makes the hardest-to-execute items on Rosenshine’s list (frequently check whole-class understanding, ensure 80 percent correctness before advancing) actually feasible.

Third, AI can lower the threshold for new teachers to reach master-teacher quality. One reason master teachers teach well is that they’ve accumulated large amounts of trial-and-error experience on the same topics. A master teacher knows the three most common student errors at the “completing the square” step of quadratic equations, knows how to follow up on the first type of error, knows which prerequisite concept to revisit for the second type. This experience currently transfers through master-teacher studios (shadowing, teaching competitions, mentor-apprentice relationships), which have extremely low bandwidth. A master teacher can mentor perhaps a few dozen apprentices in a lifetime. If AI can extract experiential patterns from large volumes of recorded lessons and learning data, such as “after the third step of explanation at this topic, if more than 20 percent of students show error type X, the most effective next step is a two-minute revisit of prerequisite concept Y,” new teachers can receive this kind of judgment assistance in real time, without waiting ten years to accumulate similar intuition.

Fourth, AI can serve as a supplementary layer for personalization, filling specific gaps. The analysis above isn’t saying student-side personalization has no value. It’s saying its effect as the main act is overstated. Within a framework where collective teaching is the main activity, AI can do something very specific: during the practice phase after the teacher has taught the core, AI can give students who fell behind a revisit of prerequisite concepts, and give students ahead an extension problem. This kind of personalization serves as safety net and extension, not replacement of the classroom’s main activity. Bastani’s Turkey experiment provides a critical design constraint: even for student-side supplementation, the AI must be tutoring-style (hints only, no answers, forced retry); general-purpose conversational AI actively makes students worse.

These four shifts share a direction: the lever of AI moves from the student side to the lesson design side and the teacher side. Instead of giving every student an AI tutor, help teachers teach better lessons, help teaching research groups design better lessons, help new teachers reach master-teacher quality faster. Student-side personalization still exists, but in a supporting role.

Boundaries of this direction need to be acknowledged honestly. Improving lesson design quality depends on sufficiently rich classroom data (recorded lessons, learning data, questioning records), and the structuring and accumulation of that data takes time. A harder constraint is the evaluation system: if the college entrance exam (in China or equivalents elsewhere) continues to evaluate by breadth of knowledge coverage, the time AI saves teachers will likely flow back into more drilling and more marginal-topic coverage, not into deeper instructional design. Institutions are a harder constraint than technology.

Back to That Silicon Valley Engineer

That engineer opening Khanmigo for his kid isn’t wrong to have the intuition he has. He’s just not pulling the largest lever.

His kid attends six lessons a day at school, and the quality of each of those lessons affects learning outcomes far more than half an hour of after-school AI tutoring. If those six lessons themselves were carefully designed by a team of teachers, repeatedly iterated, and validated by data, the kid’s learning efficiency in class would likely beat any form of after-school remediation. Students in 48-student Shanghai classes are good at math not because they receive more individualized attention (they get less), but because each of their lessons is a product that a group of teachers spent weeks polishing.

The biggest value of AI may not be giving every student an AI tutor. It may be helping people make lessons better.

This judgment is still a hypothesis, not a conclusion. Supporting evidence includes the mechanism reassessment of Bloom 2σ, Hattie’s class-size effect size, the Project Follow Through results, the Shanghai and Japan practices, and the mixed large-scale-deployment outcomes of AI personalized tutoring. Opponents will point out: the Shanghai model has its own institutional soil (collective lesson prep is administratively mandated, high-stakes exams act as unifying pressure), and can’t be simply replicated. Project Follow Through results have methodological debates. Bastani’s experiment has a limited sample in a single setting. These objections all stand, but they challenge the strength of the evidence, not the reasonableness of the direction.

Over the past decades, education research has accumulated a great deal of work on what constitutes good teaching, with the individualized tradition and the collective tradition offering different answers. The arrival of AI gives us a chance to reexamine the assumptions of both traditions. If we don’t pause to ask how much the independent contribution of personalized attention really is, AI applications in education will follow the same path ed-tech has walked for thirty years: every new generation of technology pushing student-side personalization, every generation claiming to solve Bloom’s Two Sigma Problem, every generation running into the twin problems of adoption and effect at deployment time.

Maybe this time we can take a different path. Start from lesson design quality rather than individual student differences. Let AI help teachers and teaching research groups do what they most need help with, rather than trying to use AI to replace the attention teachers give to individual students. That attention is human work; AI can’t do it and doesn’t need to. What AI can do is make every lesson good enough that 40 different students can all benefit from it.