Why using Generative Artificial Intelligence models for grading high-stakes assessments is problematic
Generative artificial intelligence has undoubtedly proven to be a useful tool in education, especially in supporting teachers and students in the teaching and learning process (Baidoo-Anu & Owusu Ansah, 2023). Recognizing the valuable role of GenAI in enhancing learning, many educational institutions, including Conestoga College, have started exploring ways to ensure safe, responsible and ethical use of GenAI in promoting teaching learning. An area that has drawn significant attention from educators, including Conestoga faculty, since the introduction of GenAI, is student assessment. One of the key questions about GenAI and assessment is how effective GenAI is in grading student work, especially in high-stakes assessments. This hub post focuses specifically on addressing this question.
Generative Artificial Intelligence and Grading
Grading is one of the most crucial assessment practices, significantly impacting students’ education and lives due to the high-stakes decisions made based on grades (DeLuca et al, 2019). For instance, the grades we assign to students are used to make decisions about their progression within the college and the support they receive for their learning. Therefore, anything that affects the reliability and validity of the grades we give to our students must be mitigated in the grading process. Given the critical nature of grading, the introduction of GenAI in education and its promises of supporting grading and providing personalized feedback generated excitement among many well-intentioned educators who are eager to support their students’ learning. To determine whether current GenAI models can effectively support grading and provide personalized feedback, we need to understand how these models work.
GenAI models only produce outputs based on patterns and probabilities derived from large datasets they have been trained on. Their responses are generated based on patterns and probabilities rather than personal insight or comprehension. They do not have genuine understanding or personal experience. This means they struggle with nuanced qualitative judgments or deeper understanding of real-life events, and their outputs are only as good as the data they’ve been trained on. Given the inherent limitations of GenAI models, using them for grading student work—particularly in high-stakes assessments—can be problematic and negatively impact the reliability and validity of the assigned grades. The following section explores several reasons why using GenAI for grading student work might be problematic.
Why using GenAI models for grading high-stakes assessments is problematic
Lack of Nuanced Understanding of Students’ Work
GenAI models generate text output based on patterns and probabilities, not on a true understanding of context or content. High-stakes assessments often require nuanced evaluation, critical thinking, and a deep understanding of subject matter that GenAI models are not able to fully grasp. For example, student work that draws from classroom discussions, specific events, or real-life experiences may be challenging for GenAI to interpret accurately. This is because the model lacks the contextual awareness and depth of understanding required to evaluate such complex and personalized elements effectively.
Inconsistencies in Grading
The grades assigned by GenAI models can be inconsistent. Small changes in input or variations in wording can lead to significantly different grades, which undermines the reliability of the assessment. In a scoping review conducted by Baidoo-Anu and colleagues (In press), we found that several studies reported inconsistencies in the grades assigned to students. For instance, Fuller and Bixby (2024) using GenAI revealed significant variations in grading patterns, feedback justifications, and response formats. Their findings showed a 24-point discrepancy in scores—ranging from 74% to 98%—for the same assignment. Similarly, in an experiment which was conducted by Furze (2024), he submitted the same Year 9 persuasive writing piece to ChatGPT multiple times, altering only the student’s name each time. The resulting grades varied dramatically, ranging from 78 to 95 out of 100. This significant discrepancy shows how a single variable can lead to vastly different grading for the same assignment. This means that even small changes in the prompts you use can have a big impact on the grades GenAI assigns to your students.
Limited Feedback Quality
Grading of students work not only involve assigning grades but also provides meaningful feedback to students. Current GenAI models are not able to provide the detailed, constructive and personalized feedback necessary for student improvement and development. In their review, Baidoo-Anu and colleagues found that human educators provide higher-quality feedback compared to GenAI. Specifically, educators excel in offering personalized feedback, clear guidance for improvement, ensuring accuracy, emphasizing critical elements, and maintaining a supportive tone.
Dehumanization of the Grading Process
Grading is not just about assigning scores; it often involves understanding individual student contexts and personal experiences. As educators, we build relationships with our students through regular interaction which helps us understand our students’ strength and weakness. These regular interactions with students help us tailor feedback and assessments to each student’s unique needs and progress. In contrast, GenAI’s impersonal nature can undermine these interactions and focus primarily on surface-level metrics, leading to more standardized responses that may not address the specific circumstances of each student. Moreover, as educators we can offer empathetic support and motivational guidance, helping students understand their strengths and areas for improvement. GenAI, however, cannot replicate this emotional and supportive aspect of feedback, which can be critical for student development and motivation.
Bias, Fairness and Ethical Concerns
GenAI models may inadvertently perpetuate biases present in their training data. This can result in unfair grading that reflects these biases rather than an objective assessment of student performance. Furze (2024) explained that because GenAI models are trained on extensive datasets sourced from the internet, which often include societal biases and discriminatory patterns. GenAI models can infer student attributes such as race, gender, or socioeconomic background based on subtle cues in their writing. This capability raises the risk of perpetuating and amplifying existing biases, potentially disadvantaging certain groups of students. For instance, a student’s use of language, cultural references, or even writing style could lead the GenAI model to make biased judgments, resulting in unfair grading outcomes.
Conclusion
GenAI has been found to be effective in creating diverse assessment tasks that support students’ learning. However, research is also suggesting that using current GenAI models for grading high-stakes assessments is problematic. GenAI models lack nuanced understanding, leading to inconsistent grading and poor feedback quality. They can misinterpret complex student work and exhibit biases from their training data. Studies show significant grading discrepancies and low-quality feedback compared to educators. Moreover, GenAI’s impersonal nature dehumanizes the grading process, neglecting individual student contexts and can make biased judgments, resulting in unfair grading outcomes. Notwithstanding, although both GenAI and educators can demonstrate biases, the nature and impact of these biases differ significantly. Educators’ biases mostly originate from personal experiences, cultural backgrounds, and societal influences and they have the ability to recognize and address their biases, whereas GenAI biases typically come from the training data, or the decisions made by its designers and lack the self-awareness and capability to address biases.
References
Baidoo-Anu, D., Asamoah, D., & Raji, M. (In press). Innovative applications of generative AI in classroom assessment: A scoping review. In S. Sabbaghan (Ed.), Navigating generative AI in higher education: Ethical, theoretical and practical perspectives. Edward Elgar Publishing.
DeLuca, C., Cheng. L, & Volente, L. (2019). Grading across Canada; Policies, practices, and perils. https://www.edcan.ca/articles/grading-across-canada/
Furze, L. (2024). Don’t use GenAI to grade student work. Retrieved on July 29, 2024 from https://leonfurze.com/2024/05/27/dont-use-genai-to-grade-student-work/comment-page-1/.
Furze, L. (2024). The AI Iceberg: Understanding ChatGPT. Retrieved on July 29, 2024 from
https://leonfurze.com/2023/05/18/the-ai-iceberg-understanding-chatgpt/.
Fuller, L. P., & Bixby, C. (2024). The theoretical and practical implications of OpenAI system rubric assessment and feedback on higher education written assignments. American Journal of Educational Research, 12(4), 147-158. http://dx.doi.org/10.12691/education-12-4-4