R3 2.3 February 6, 2024 When Does an AI Grading Bot Seem Fairer than A Human Grader – and Why?
A new study comes to the surprising conclusion that AI-generated grades can be seen as more fair than those handed out by humans.
This issue of R3 looks at a question that’s sure to come up as AI grading tools hit the higher education scene: whether students see AI-generated grades as equitable. Regardless of whether you would ever consider using or recommending such tools, the reasons for the AI fairness advantage highlight several factors that could make traditional grading-by-hand seem fairer as well.
Citation:
Ma, J., Wang, Y., Zhu, J., & Han, T. (2024, February). Grading by AI makes me feel fairer? How different evaluators affect college students’ perception of fairness. Frontiers in Psychology.
DOI:
https://doi.org/10.3389/fpsyg.2024.1221177
Paywall or Open:
Open
Summary:
Undergraduates studying at universities in China were surveyed via an online research platform. Participants read hypothetical scenarios of different types of assessments (diagnostic, formative, summative) offered either by human graders or by an AI algorithm. They were asked for their perceptions that the grade was fair. AI-generated grades were generally rated as more fair. This effect was largely driven by the perception that AI-generated grades were more transparent, and was reduced when additional explanations of grades were offered.
Research Questions (excerpted from the article):
Do college students perceive a higher level of fairness when an AI algorithm is used as the evaluator compared to traditional teacher evaluation?
How does the use of an AI algorithm affect the fairness perception of college students?
What are the boundary conditions for this influence on fairness perception?
Sample:
Participants were all undergraduate students in China, recruited via an online research platform; 172, 149, and 145 across the three studies. They were pursuing a varied range of different majors and were studying at different institutions including more- and less-selective universities.
Method/Design:
Participants read hypothetical scenarios involving college students being evaluated; this approach was based on other studies on workplace performance evaluations. In the scenarios, the students being evaluated were in an English-language course.
Across the three studies, researchers manipulated several key factors in these scenarios: 1) type of assessment being done (diagnostic, formative, and summative), 2) evaluation type (AI-generated or human teacher), 3) whether an explanation of the grade was offered, within the scenario (explanation provided, or not). The assessment type variable was manipulated by describing the type of assignment being graded, using terms that would be familiar to students, such as an oral proficiency assessment held at the beginning of the school year, midterm examination, or final paper.
Perceived fairness was the main outcome measure, gauged via responses on a 7-point scale ranging from “very unfair” to “very fair.” Transparency was also measured via a version of the Information Transparency Scale, with questions such as “The teacher/AI algorithm relies on more useful information,” and “The information relied on by the teacher/AI algorithm is easier to obtain.”
In the design, transparency was treated as a mediator – meaning a middle step in the chain of causality. If you think of it as A leading to B and B leading to C, the mediator is B. The last study treated explanation as a moderator. Moderators change the relationship between two other variables – in this case, corresponding to whether the effect of evaluation type on perceived fairness changes when explanations are present or absent.
(Here is a good page that offers more explanation about mediators and moderators.)
Key Findings:
Participants perceived greater fairness for AI evaluations, especially for formative assessment. This effect was significantly mediated by transparency (i.e., clarity in why the grade was assigned). In other words, AI evaluation led to greater perceptions of transparency, and thus, of fairness in the grade. Explanation was also a significant moderator; there was far less of an advantage for AI evaluations when explanations were provided.
Choice Quote from the Article:
Study 1 found that different evaluators could significantly influence the perception of fairness under three evaluation contexts. Students perceived AI algorithms as fairer evaluators than teachers. Study 2 revealed that information transparency was a mediator, indicating that students perceived higher fairness with AI algorithms due to increased transparency compared with teachers. Study 3 revealed that the explanation of evaluation outcomes moderated the effect of evaluator on students’ perception of fairness. Specifically, when provided with explanations for evaluation results, the effect of evaluator on students’ perception of fairness was lessened.
Why it Matters:
This study serves as a powerful reminder that fairness is a major player in human psychology generally, and is something that we ought to consider early on in the process of designing and redesigning assessments for any course. It’s particularly timely to think about this in the context of ungrading and alternative grading systems, which continue to generate a lot of interest within higher education. I would be willing to bet that fairness comes up as an issue in most alternative grading projects, but it probably does deserve more systematic attention in the research literature. Considering how much of our energy, as instructors, gets taken up with preventing and dealing with student dissatisfaction, and how much of that dissatisfaction traces back to perceived unfairness, this seems like a productive line of reflection for any teacher or instructional designer.
As for the AI angle, I think that it’s important to frame this work not so much a study on the feasibility of grading with AI, as much as on the factors that drive student perceptions of AI grading (or really, perceptions of grades from any source). But with AI-related grading options already on the horizon, and surely more to come, this is still an important connection. In particular, I think of the surprising (to me) finding that AI was not automatically assumed to be unpredictable, arbitrary, or opaque, at least not by these undergraduate students. The “black box” characterization has been dominant enough in debates about AI that I have begun to take it for granted as a source of anxiety and hesitance about these tools, but that’s not what this particular study found. It’s also pretty humbling to see that student perceptions of the feedback we faculty give them can be worse, at baseline, than a computer-generated report.
Lastly, the study is yet another reminder of how powerful transparency is as a driver of student perceptions, and by extension, student success. There are many ways to help make the purpose and pathways to success for any assignment clearer to students (for ideas, see this definitive guide). This research demonstrates that transparency can make it not only easier for students to complete assignments, but that it can also make assessments seem more equitable when the feedback goes out.
Most Relevant For:
Faculty interested in using AI for grading; instructional designers; faculty professional development directors and other leaders responsible for guiding policy and practice relating to AI; staff and leaders involved in student success initiatives
Limitations, Caveats, and Nagging Questions:
This study deals in hypothetical scenarios. These have a long track record in behavioral science, but still, it’s not possible to know if the same students would react the same way if their actual grades were on the line. I think it would be interesting to follow this work up by looking at the level of student experience with or knowledge of AI to see if that is also a moderator. It’s true that AI algorithms can in theory be quite transparent - but are they actually, for students who lack that personal experience?
It's also important not to take this work as a study on the feasibility or quality of AI-generated assessment. As mentioned, these are just hypothetical vignettes, not actual computer-generated grades; the key outcome measure wasn’t quality of the feedback, the concordance between human and AI graders, or longer-term learning impacts such as improving the quality of future work. These were just student perceptions of the process, and only as they relate to the relatively narrowly defined quality of fairness.
The subject, as well, was limited to language fluency; would the patterns be the same, e.g., for a scientific writeup, a creative paper, an oral presentation in a psychology or history course? It’s entirely possible that they would be, but my instincts tell me that if you wanted to put these findings into practice, it would be worth digging into the question of whether they hold up for other subjects.
Lastly, especially for those readers based in North America - keep in mind that these findings are situated in a Chinese educational context. Fairness strikes me as a concept that broadly applies cross-culturally (although please note – I’m not an expert on this particular area of psychology). Still, there’s likely a fair amount of nuance that will play out differently in different cultures and contexts.
If you liked this article, you might also appreciate:
Clark, D., & Talbert, R. (2023). Grading for growth: A guide to alternative grading practices that promote authentic learning and student engagement in higher education. Routledge.
Miller, M.D. (2023). What we talk about when we talk about grades: Framing, intrinsic motivation, and how to keep it all about the learning. Zeal, 1(2). https://zeal.kings.edu/zeal/article/view/24
Lodge, J. M., Yang, S., Furze, L., & Dawson, P. (2023). It ’ s not like a calculator , so what is the relationship between learners and generative artificial intelligence ? Learning: Research and Practice. https://doi.org/10.1080/23735082.2023.2261106
Mollick, E. R., & Mollick, L. (2023). Assigning AI: Seven approaches for students, with prompts. SSRN Electronic Journal, 1–46.
*Olewski, J., & Powell, R. (2018). In the absence of grades. College Composition and Communication, 70(1), 30–56.
Zepeda, C. D., Ortegren, F. R., & Butler, A. C. (2023, June). Learning from feedback in college courses: student practices, beliefs, and preferences. https://doi.org/10.1002/acp.4118
File under:
Grading; ungrading; assessment; AI; transparency; feedback
*Many thanks to my colleague Rebecca Campbell for recommending this article - it adds a lot to the discussion of what grades mean to students.