Why using Generative Artificial Intelligence models for grading high-stakes assessments is problematic

Share this Teaching Tip
Reading Time: 5 minutes

Generative artificial intelligence has undoubtedly proven to be a useful tool in education, especially in supporting teachers and students in the teaching and learning process (Baidoo-Anu & Owusu Ansah, 2023). Recognizing the valuable role of GenAI in enhancing learning, many educational institutions, including Conestoga College, have started exploring ways to ensure safe, responsible and ethical use of GenAI in promoting teaching learning.  An area that has drawn significant attention from educators, including Conestoga faculty, since the introduction of GenAI, is student assessment. One of the key questions about GenAI and assessment is how effective GenAI is in grading student work, especially in high-stakes assessments. This hub post focuses specifically on addressing this question.

Generative Artificial Intelligence and Grading

Grading is one of the most crucial assessment practices, significantly impacting students’ education and lives due to the high-stakes decisions made based on grades (DeLuca et al, 2019). For instance, the grades we assign to students are used to make decisions about their progression within the college and the support they receive for their learning. Therefore, anything that affects the reliability and validity of the grades we give to our students must be mitigated in the grading process. Given the critical nature of grading, the introduction of GenAI in education and its promises of supporting grading and providing personalized feedback generated excitement among many well-intentioned educators who are eager to support their students’ learning. To determine whether current GenAI models can effectively support grading and provide personalized feedback, we need to understand how these models work.

GenAI models only produce outputs based on patterns and probabilities derived from large datasets they have been trained on. Their responses are generated based on patterns and probabilities rather than personal insight or comprehension. They do not have genuine understanding or personal experience.  This means they struggle with nuanced qualitative judgments or deeper understanding of real-life events, and their outputs are only as good as the data they’ve been trained on. Given the inherent limitations of GenAI models, using them for grading student work—particularly in high-stakes assessments—can be problematic and negatively impact the reliability and validity of the assigned grades. The following section explores several reasons why using GenAI for grading student work might be problematic.

Inconsistencies in Grading

The grades assigned by GenAI models can be inconsistent. Small changes in input or variations in wording can lead to significantly different grades, which undermines the reliability of the assessment. In a scoping review conducted by Baidoo-Anu and colleagues (In press), we found that several studies reported inconsistencies in the grades assigned to students. For instance, Fuller and Bixby (2024) using GenAI revealed significant variations in grading patterns, feedback justifications, and response formats. Their findings showed a 24-point discrepancy in scores—ranging from 74% to 98%—for the same assignment. Similarly, in an experiment which was conducted by Furze (2024), he submitted the same Year 9 persuasive writing piece to ChatGPT multiple times, altering only the student’s name each time. The resulting grades varied dramatically, ranging from 78 to 95 out of 100. This significant discrepancy shows how a single variable can lead to vastly different grading for the same assignment. This means that even small changes in the prompts you use can have a big impact on the grades GenAI assigns to your students.

Limited Feedback Quality

Grading of students work not only involve assigning grades but also provides meaningful feedback to students. Current GenAI models are not able to provide the detailed, constructive and personalized feedback necessary for student improvement and development.  In their review, Baidoo-Anu and colleagues found that human educators provide higher-quality feedback compared to GenAI. Specifically, educators excel in offering personalized feedback, clear guidance for improvement, ensuring accuracy, emphasizing critical elements, and maintaining a supportive tone.

Dehumanization of the Grading Process

Grading is not just about assigning scores; it often involves understanding individual student contexts and personal experiences. As educators, we build relationships with our students through regular interaction which helps us understand our students’ strength and weakness. These regular interactions with students help us tailor feedback and assessments to each student’s unique needs and progress. In contrast, GenAI’s impersonal nature can undermine these interactions and focus primarily on surface-level metrics, leading to more standardized responses that may not address the specific circumstances of each student. Moreover, as educators we can offer empathetic support and motivational guidance, helping students understand their strengths and areas for improvement. GenAI, however, cannot replicate this emotional and supportive aspect of feedback, which can be critical for student development and motivation.

Conclusion

GenAI has been found to be effective in creating diverse assessment tasks that support students’ learning. However, research is also suggesting that using current GenAI models for grading high-stakes assessments is problematic. GenAI models lack nuanced understanding, leading to inconsistent grading and poor feedback quality. They can misinterpret complex student work and exhibit biases from their training data. Studies show significant grading discrepancies and low-quality feedback compared to educators. Moreover, GenAI’s impersonal nature dehumanizes the grading process, neglecting individual student contexts and can make biased judgments, resulting in unfair grading outcomes. Notwithstanding, although both GenAI and educators can demonstrate biases, the nature and impact of these biases differ significantly. Educators’ biases mostly originate from personal experiences, cultural backgrounds, and societal influences and they have the ability to recognize and address their biases, whereas GenAI biases typically come from the training data, or the decisions made by its designers and lack the self-awareness and capability to address biases.

References

Baidoo-Anu, D., Asamoah, D., & Raji, M. (In press). Innovative applications of generative AI in classroom assessment: A scoping review. In S. Sabbaghan (Ed.), Navigating generative AI in higher education: Ethical, theoretical and practical perspectives. Edward Elgar Publishing.

DeLuca, C., Cheng. L, & Volente, L. (2019). Grading across Canada; Policies, practices, and perils. https://www.edcan.ca/articles/grading-across-canada/

Furze, L. (2024). Don’t use GenAI to grade student work. Retrieved on July 29, 2024 from https://leonfurze.com/2024/05/27/dont-use-genai-to-grade-student-work/comment-page-1/.

Furze, L. (2024). The AI Iceberg: Understanding ChatGPT. Retrieved on July 29, 2024 from

https://leonfurze.com/2023/05/18/the-ai-iceberg-understanding-chatgpt/.

Fuller, L. P., & Bixby, C. (2024). The theoretical and practical implications of OpenAI system rubric assessment and feedback on higher education written assignments. American Journal of Educational Research, 12(4), 147-158. http://dx.doi.org/10.12691/education-12-4-4

David Baidoo-Anu

David Baidoo-Anu, Ph.D. (Education),brings vast professional experience as a researcher, educator, and assessment specialist, particularly within the contexts of North America (especially Canada and the USA) and Africa. He has previously taught courses such as Educational Statistics, Educational Assessment, Educational Research Methods, Evaluation of Teaching and Learning, Psychological Foundations of Education, and several other educational courses. Dr. Baidoo-Anu has also worked as an educational and assessment specialist and consultant with international organizations such as the World Bank, Educational Testing Services (ETS), and others.