R3 3.7 July 15, 2025 How Do Tutoring Chatbots Stack Up Against In-Class Active Learning?
Surprising impacts of a custom AI chatbot developed for an introductory physics course
In today’s issue we’re going to take a look at a controlled study involving a tutoring chatbot created by physics instructors at Harvard University. This particular application of generative AI - interactive tutoring - has been referenced by AI optimists time and again since these tools became widely available around 2023. But well before that technological turning point, the prospect of truly responsive, flexible, and knowledgeable tutoring systems has been a recurring aspiration within educational technology circles.
I think back, for example, to the first iterations of the interactive courses pioneered by the Online Learning Initiative, which presented learners with carefully designed progressions of hints, fine-tuning the challenge level as the lesson went along. Or when I was working with an interactive courseware system pitched as “Netflix for learning,” in that it created custom pathways through course materials based on individual interests and prior knowledge.
These kinds of projects tend to look a lot different today than they did back then; given the rapid evolution of such tools, driven by a competitive market space, that doesn’t mean that they were failures. But to my mind, the way that personalized learning systems have come and gone through the years drives home two big points. First is the need to be on guard for the hype that follows any new invention promising to supercharge learning (while conveniently generating a pile of acclaim, money, or both for the inventors). The first few rounds of these systems added some important new options to the ed tech space, to be sure, but they didn’t revolutionize it. Nor did they end up replacing expert human instruction at any significant level or scale.
The second point that comes to my mind is a less cynical one. It is that if we in ed tech keep coming back to the tutoring concept, time and again, there is likely something important there. The vision of an infinitely patient, always-accessible interface ready to coach students on their own time and on their own terms is an enticing one, and not just as a way to save money on expensive human tutors. For one, fulfilling that vision would be a big step toward access. In particular, effective tutoring could help knock down barriers facing students coming from less-privileged educational backgrounds as they make their way through key gateway courses - the ones that too frequently derail students’ dreams of careers in medicine, technology, education, business and more. (Think Statistics for the Social Sciences, Organic Chemistry, Introduction to Microeconomics and so on; the specifics may vary, but every campus has their own versions, and the failure rates in this kind of class are often staggering.)
I like to think that visions of access and helping students attain their own educational goals are the sort of motivations driving the newest round of interest in digital tutoring. We’re also at a new moment of opportunity given that interactive tutoring meshes exceptionally well with the strengths and affordances of today’s generative AI tools.
I should also mention that the whole issue of custom chatbots is on my mind given that - like many of you - I’m currently in the thick of preparing for the fall semester, and I do plan to offer AI tutoring as an option in my newly revamped Introduction to Psychology course. I’ll be figuring out a lot of specifics of how these bots will work, but much like the authors of this issue’s focus article, I envision that they will reference specific concepts and source materials I provide as they lead students through assigned exercises, as one part of my plan to shake up the lecture-driven approach that’s typical for intro psych.
Chatbots like this have gotten pretty easy to create, but even so, I feel a lot more confident having completed this online course on AI for education, presented by Ethan and Lilach Mollick. I’ve been practicing as well by setting up a chatbot for fellow faculty, one designed to capture some specific approaches to teaching thinking skills and to generate teaching materials aligned to particular disciplines. I’ll talk more about that project in a future issue, but if in the meantime you’re interested in an advance look at that resource, shoot me a message and I can set you up.
With that, let’s take a look at what that team of physics instructors were able to accomplish with their own chatbot project.
Citation:
Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: An RCT introducing a novel research-based design in an authentic educational setting. Nature Scientific Reports, 15, 17458.
DOI:
https://doi.org/10.1038/s41598-025-97652-6
Paywall or Open:
Open
Summary:
This study evaluated an AI-powered tutor against a traditional in-class active learning session for teaching introductory physics concepts. In a randomized controlled trial, college students either interacted with the AI tutor—designed using similar pedagogical principles as the active learning session—or participated in the in-person lesson. Learning gains, time on task, engagement, and motivation were measured via pre- and post-tests and student surveys. The AI tutor condition produced higher learning gains in less time, compared to the active-learning condition.
Research Questions:
Can an AI tutor designed with active learning principles match or exceed learning gains from an in-class active-learning session?
Will students using the AI tutor demonstrate comparable or greater engagement and motivation?
Does the AI tutor achieve these outcomes more efficiently in terms of time spent?
Sample:
194 students enrolled in a large lower-division physics course at Harvard University.
Method/Design:
Students completed two parallel learning activities on different but related physics topics. One was delivered via a custom-built AI chatbot tutor using GPT-4, and the other was conducted in class with a human instructor using active learning techniques. The AI tutoring sessions were completed individually at home. Students were guided through a set of problem-solving steps without being directly told the answers, and the AI's output was tested beforehand to ensure fidelity to this approach.
All students participated in both experimental conditions (AI vs. in-class active learning), with the order of experimental conditions randomly assigned in a crossover design. The Force Concept Inventory (FCI) was used in a pre-test/post-test fashion to assess baseline physics knowledge and evaluate generalizability across institutions, as well as measure learning gains before and after each type of lesson (AI or in-class active learning).
Student perceptions were also collected via a survey. Questions included:
Engagement - “I felt engaged [while interacting with the AI] / [while in lecture today].”
Motivation - “I felt motivated when working on a difficult question.”
Enjoyment - “I enjoyed the class session today.”
Growth mindset - “I feel confident that, with enough effort, I could learn difficult physics concepts.”
Key Findings:
Baseline knowledge as measured by FCI pre-test scores was comparable across experimental conditions, and was also comparable to typical scores in physics courses at other institutions.
There was a large and statistically significant increase in pre-test/post-test learning gains for the AI condition. Overall time spent on the lesson was reduced in the AI group, and the relatively large range (with some students spending substantially more or less time than the median) suggests that students were appropriately pacing themselves and taking advantage of the ability to personalize the time needed to complete the learning. The advantage for AI-based learning held up for students in the lower range of pre-test scores as well as for those in the highest range.
Student perceptions also showed an advantage for AI over in-class active learning, with higher ratings for engagement and motivation (but not for growth mindset, which was similar across conditions).
Choice Quote from the Article:
We have found that when students interact with our AI tutor, at home, on their own, they learn significantly more than when they engage with the same content during an in-class active learning lesson, while spending less time on task. This finding underscores the transformative potential of AI tutors in authentic educational settings. In order to realize this potential for improving STEM outcomes, student-AI interactions must be carefully designed to follow research-based best practices.
The extensive pedagogical literature supports a set of best practices that foster students’ learning, applicable to both human instructors and digital learning platforms. Key practices include (i) facilitating active learning… (ii) managing cognitive load… (iii) promoting a growth mindset…(iv) scaffolding content… (v) ensuring accuracy of information and feedback, (vi) delivering such feedback and information in a targeted and timely fashion…, and (vii) allowing for self-pacing…We aimed to design an AI system that conforms to these practices to the fullest extent current technology allows, thus establishing a model for future educational AI applications.
Why it Matters:
This study stands out as one of the first controlled experiments to test the application of generative AI in a realistic college classroom setting. Much of the existing research in this area relies on self-report—asking people how they think tools like ChatGPT affect their learning or cognition—rather than measuring the effects of AI more directly. While the focus on physics instruction may limit generalizability somewhat, it also brings a major advantage: the use of a common, well-validated outcome measure. In this case, the Force Concept Inventory (FCI) provided a clear, interpretable benchmark for assessing what students learned before and after the intervention, and allows for meaningful comparisons with past studies in the same domain.
The discussion section is also a standout, avoiding simplistic conclusions and explaining exactly how these positive results could be replicated elsewhere. The authors don’t position AI tutoring as a wholesale replacement for classroom-based active learning, nor do they frame the results in binary terms of it works/it doesn’t work. Instead, they focus on how a tool like ChatGPT can be configured to deliver benefits that resemble those seen with well-designed in-class instruction. What emerges is less an argument for AI replacing human instruction and more an endorsement of flipped-classroom models, where students engage in individualized, self-paced preparation outside of class that then enables them to make the most of their in-class time.
Importantly, the AI tutor seemed to benefit a wide range of students, not just those already doing well or those in serious need of help. Both lower- and higher-performing students (as identified by pretest scores) showed similar patterns of greater learning gains with AI. Notably, students who reported struggling to keep up with the pace of class tended to spend more time with the AI tutor, suggesting they were leveraging the tool to meet their individual needs.
Most Relevant For:
STEM faculty, researchers interested in educational applications of generative AI
Limitations, Caveats, and Nagging Questions:
It’s worth noting that this study was conducted with a high-achieving student population. That’s not necessarily a flaw—no single study can represent every learner demographic—but it does raise the question of generalizability. My hunch is that it won’t be the Ivy Leagues of the world leading the charge to augment instruction with AI tutors. Institutions like public universities and community colleges may have more to gain from scalable tools like this one, so replication studies in those settings would be especially valuable. To the authors’ credit, they do report that students’ pretest scores were in line with those from other institutions, which is another point in favor of standardized assessments like the Force Concept Inventory and is encouraging as far as the generalizability question.
One unavoidable design issue is the confound between learning activity and setting. Students in the AI tutor condition worked at home, while those in the control group received live instruction in class. It’s possible that some of the observed benefits could stem from the setting itself—studying independently at home—rather than the AI component per se. Still, this arguably enhances the real-world relevance of the study, given that most students using AI tutors are likely to do so on their own, outside of class.
Student perception data tracked well with learning gains, which is encouraging (and a bit surprising) given how often research demonstrates a disconnect between student preferences and actual measured effectiveness. One unexpected element was the inclusion of a growth mindset measure. While mindset remains a relevant concept for student motivation and persistence, it wasn’t entirely clear why it would be affected by this particular intervention.
One of the most important takeaways from this study is just how much design matters when it comes to educational chatbots. The chatbot used here is a strong model of best practices—structured, responsive, and explicitly guided in how to engage students. Unfortunately, some of the most valuable implementation details are tucked away in the Supplementary Materials rather than highlighted in the main text. Even so, those materials emphasize the importance of scripting the chatbot to prompt dialogue and reasoning without giving answers, and of providing specific, step-by-step instructions for solving the problems it presents - well worth the few clicks it takes to find and download the supplemental file.
While physics is especially well suited to this structured approach, it’s not hard to imagine how similar principles could apply in other fields, including those outside STEM. There may well be pockets of content across disciplines where correct, expert approaches can be clearly articulated and built into tutoring interactions - features that would raise the value of AI tutoring above and beyond what students could achieve simply by searching or chatting all on their own.
If you liked this article, you might also appreciate:
Agnoli, S., & Rapp, D. N. (2024). Understanding and supporting thinking and learning with generative artificial intelligence. Journal of Applied Research in Memory and Cognition, 13(4), 495–499. https://doi.org/10.1037/mac0000203
Deslauriers, L., McCarty, L. S., Miller, K., Callaghan, K., & Kestin, G. (2019). Measuring actual learning versus feeling of learning in response to being actively engaged in the classroom. Proceedings of the National Academy of Sciences of the United States of America. https://doi.org/10.1073/pnas.1821936116
Ellis, A. R., & Slade, E. (2023). A new era of learning: Considerations for ChatGPT as a tool to enhance statistics and data science education. Journal of Statistics and Data Science Education, 1–10. https://doi.org/10.1080/26939169.2023.2223609
Fawaz, M., El-Malti, W., Alreshidi, S. M., & Kavuran, E. (2025). Exploring health sciences students’ perspectives on using generative artificial intelligence in higher education: A qualitative study. Nursing and Health Sciences, 27(1). https://doi.org/10.1111/nhs.70030
Patiño, A., Ramírez-Montoya, M. S., & Ibarra-Vazquez, G. (2023). Trends and research outcomes of technology-based interventions for complex thinking development in higher education: A review of scientific publications. Contemporary Educational Technology, 15(4), ep447. https://doi.org/10.30935/cedtech/13416
Triberti, S., Di Fuccio, R., Scuotto, C., Marsico, E., & Limone, P. (2024). “Better than my professor?” How to develop artificial intelligence tools for higher education. Frontiers in Artificial Intelligence, 7(April), 1–9. https://doi.org/10.3389/frai.2024.1329605
File under:
AI tutoring, chatbots, active learning, STEM education