A person filling OMR sheet.

Understanding Multiple Choice Test Item Analysis Report from DataLink

Reading Time: 7 minutes

As faculty members, we all strive to create fair, effective, and meaningful assessments. But once a test is administered and the results are in, we’re often left wondering: Did my test truly measure student learning as I intended? If you’ve ever opened your test item analysis report from DataLink and felt overwhelmed by the numbers, statistics, and unfamiliar terms, you’re not alone. Many faculty members find themselves asking, what do these indicators actually mean? How can they help me improve my assessments? While DataLink provides valuable insights into how individual test questions perform, making sense of the data—and more importantly, knowing what to do with it—can feel like a challenge.
This two-part post aims to break down test item analysis in a way that’s simple and practical. The first part will help you understand the key indicators in your DataLink report—what they measure, how to interpret them, and how they can help you refine your test items. The second part will answer the crucial question: What’s next? Once you understand the data, what steps can you take to improve your test questions and ensure your assessments are valid, reliable, and aligned with student learning outcomes.

Test Item Analysis

Test item analysis is a process used to evaluate the effectiveness of individual test questions (items) in an assessment (multiple choice questions). Item analysis allows us to observe the item characteristics, and to improve the quality of the test (Gronlund, 1998). Five key metrics used in item analysis are Item Discrimination Index, Item Difficulty Index, Point-Biserial Correlation, Distractor analysis, Kuder-Richardson 20 (KR-20). These measures assist educators in improving the quality of their multiple-choice questions (MCQs).

Item Discrimination Index (D)

The Item Discrimination Index (D) measures how well a test item differentiates between high-performing and low-performing students. Item discrimination shows how well your question distinguishes between students who understand the course material well and those who do not. A high discrimination index means the question is effective in distinguishing between different levels of student ability. A low or negative discrimination index suggests that the question might be flawed—for instance, it might be too easy, too tricky, or even misleading. Item Discrimination Index (D) is calculated by comparing the proportion of top-scoring students who answered the item correctly with the proportion of low-scoring students who answered it correctly (Christian et al 2016; Date et al, 2019; Rao et al., 2016).

Values Interpretation:

D ValueInterpretationRecommendation
0.40 and aboveExcellent discriminationThe item strongly differentiates between high and low performers.
0.30 – 0.39Acceptable discriminationAcceptable and useful in most tests.
0.20 – 0.29Moderate discriminationConsider revising; may not differentiate well
0.10-0.19Weak discriminationLikely ineffective; should be improved or replaced.
Below 0.10Poor discrimination (consider revising or removing)Should be removed or rewritten.
NegativeThe question is misleading or flawedSomething is wrong. The item may be misleading, and strong students are getting it wrong while weaker students are getting it right. Needs revision or removal.

Item Difficulty Index (P)

The Item Difficulty Index (P) refers to how easy or hard a test question is for students. If most students answer a question correctly, it is considered an easy question. On the other hand, if most students answer it incorrectly, it is considered a difficult question (Crocker & Algina, 2008; Hambleton et al, 1991). Ideally, a well-balanced test should contain questions of varying difficulty levels to assess both basic understanding and advanced knowledge. The difficulty of a test item is usually expressed as a difficulty index, also known as the p-value, which is calculated as the proportion of students who answer the question correctly. This value ranges from 0 to 1, where: A value close to 1 (e.g., 0.90) means the question is very easy because 90% of students answered it correctly. A value close to 0 (e.g., 0.20) means the question is difficult because only 20% of students got it right.

Values Interpretation:

P ValueInterpretationRecommendation
Above 0.80Very easyConsider revising or removing if too many questions are this easy.
0.60 – 0.80Moderately easyGenerally acceptable but may not effectively differentiate students.
0.40 – 0.59Ideal difficulty rangeBest range for assessing student performance effectively.
0.21 – 0.39DifficultMay be too challenging; consider revising if too many students struggle.
0.00 – 0.20Too difficult (consider revising)Likely too hard or confusing; check for clarity or unfair difficulty.

Distractor Analysis

Distractor analysis is the process of evaluating how students interact with the incorrect answer choices in a multiple-choice question (Tarrant et al, 2009). The goal is to ensure that the distractors (wrong answer choices) are plausible and challenging enough that students who do not know the correct answer are drawn to them, while students who have mastered the material are more likely to choose the correct answer. If distractors are too obvious, too tricky, or not selected by any students, they do not contribute to the effectiveness of the question. By analyzing how often each answer choice is selected, we can determine whether the distractors are doing their job in assessing student knowledge. Non-Functional Distractor (NFD) – A distractor chosen by fewer than 5% of students, meaning it is not serving its purpose (Tarrant et al, 2009). Functional Distractor – A distractor chosen by a reasonable percentage of students, ideally more by low-scoring students.

How to Conduct Distractor Analysis

For each multiple-choice question, find out how many students selected each answer option.

Example Data Table:

Answer OptionTotal Students% of TotalHigh-ScorersLow-ScorersStatus
A (Correct Answer)6060%5010Correct
B (Distractor 1)2020%515Functional
C (Distractor 2)1818%414Functional
D (Distractor 3)22%11Non-Functional

How to Use Distractor Analysis for Test Improvement

Revise Non-Functional Distractors

  • If a distractor is rarely chosen (<5% of students), it may be too obviously incorrect.
  • Revise it to make it more plausible.
  • Ensure all distractors relate to common misconceptions.

Ensure a Balance in Response Distribution

  • The correct answer should be the most selected.
  • Distractors should attract lower-scoring students without misleading high-performing students.

Avoid “Trick” Questions

  • Distractors should be plausible but clearly incorrect, not designed to “trick” students unfairly.
  • Ensure wording is clear so that students struggling with the content—not the wording—choose the distractor.

Identify Guessing Patterns

  • If responses are evenly distributed across options, students might be guessing, meaning the question may be too difficult or unclear.

Reliability – Kuder-Richardson 20 (KR-20)

The reliability of a test refers to the extent to which the test is likely to produce consistent scores. Kuder-Richardson 20 (KR-20) is a statistical measure used to evaluate the internal consistency of a test. It tells us how reliable a test is by checking whether the test items (questions) work together well to measure the same concept. If a test has a high KR-20 score (close to 1.0), it means that the questions are well-aligned, and students’ performance is consistent. A low KR-20 score (closer to 0), on the other hand, suggests that some questions may not be effectively contributing to the test’s overall purpose—possibly because they are too difficult, too easy, or unrelated to the main concept being assessed. The following general guidelines can help interpret KR-20 scores for classroom exams:

Values Interpretations

ReliabilityInterpretation
.90 and aboveExcellent reliability; at the level of the best standardized tests
.80 – .90Very good for a classroom test
.70 – .80Good for a classroom test; in the range of most. There are probably a few items which could be improved.
.60 – .70Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved.
.50 – .60Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading.
.50 or belowQuestionable reliability. This test should not contribute heavily to the course grade, and it needs revision.

Point-Biserial Correlation (rpb)

The Point-Biserial Correlation (rpb) measures the relationship between how students perform on a specific test question and their overall test score. It helps determine whether a question is aligned with overall student performance. If a question has a high point-biserial correlation (close to +1.0), this means students who scored high on the overall test are more likely to get the question correct, and students who scored low on the test are more likely to get it wrong. This is a good sign that the question is functioning properly. If the rpb is near 0, it suggests that the question does not contribute much to differentiating students, meaning strong and weak students are equally likely to get it right. Such a question may not be useful in assessing students’ abilities.

A negative rpb value indicates a serious issue. It suggests that students who performed well on the overall test were more likely to get the question wrong, while weaker students were more likely to get it right. This often happens when a question is misleading, poorly worded, or has a tricky answer choice that confuses knowledgeable students.

Values Interpretation:

r_pb ValueInterpretationRecommendations  
0.40 and aboveExcellent itemStrongly correlates with overall test performance. Ideal for high-quality tests.
0.30 – 0.39Good itemGenerally acceptable and contributes well to test reliability.
0.20 – 0.29Acceptable itemAcceptable, but may need revision for better differentiation.
0.10-0.19WeakDoes not contribute much to the test; consider revising or replacing
Below 0.10Very weak item (consider revising)Should be removed or rewritten, as it does not distinguish well.
NegativeItem is flawed (high-scoring students are more likely to get it wrong)Indicates a serious problem—high-scoring students tend to get it wrong while low-scoring students get it right. Needs urgent revision.

Conclusion

Assessing the quality of test items is essential for creating fair and effective assessments that accurately measure student learning. However, no single metric can fully determine the quality of a test. Instead, all the indicators—Item Discrimination Index, Item Difficulty Index, Point-Biserial Correlation, Distractor Analysis, and Kuder-Richardson 20 (KR-20)—should be examined together to gain a comprehensive understanding of how well a test functions. Looking at these measures in isolation can be misleading, as a test may score well in one area while having weaknesses in another. For example, a test with high reliability (KR-20) might still have questions that fail to effectively differentiate between high- and low-performing students (low discrimination index). Considering all these indicators holistically will help you make informed decisions about which multiple choice test items to revise, replace, or retain. This approach ensures that assessments are not only statistically sound but also aligned with learning objectives and fair to all students.

The part two of this post will discuss into practical strategies for enhancing the quality of your multiple-choice questions (MCQs) to ensure they are both reliable and effective in assessing student learning.

References

Christian, D. S., Prajapati, A. C., Rana, B. M., & Dave, V. R. (2017). Evaluation of multiple choice questions using item analysis tool: a study from a medical institute of Ahmedabad, Gujarat. International Journal Community Med Public Health4(6), 1876-81.

Crocker, L, & Algina J (2008). Introduction to classical and modern test theory. Belmont, CA: Wadsworth.

Date, A. P., Borkar, A. S., Badwaik, R. T., Siddiqui, R. A., Shende, T. R., & Dashputra, A. V. (2019). Item analysis as tool to validate multiple choice question bank in pharmacology. International Journal Basic Clinical Pharmacol8(9), 1999-2003.

Ebel, R. L. (1979). Essentials of educational measurement (3rded.). Englewood Cliffs, NJ: Prentice Hall.

Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Boston, MA: Allyn and Bacon.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

Rao, C., Kishan Prasad, H. L., Sajitha, K., Permi, H., & Shetty, J. (2016). Item analysis of multiple choice questions: Assessing an assessment tool in medical students. International Journal Education Psychology Research2(4), 201-204.

Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis. BMC Medical Education9, 1-8.

Share this Teaching Tip

David Baidoo-Anu

David Baidoo-Anu, Ph.D. (Education),brings vast professional experience as a researcher, educator, and assessment specialist, particularly within the contexts of North America (especially Canada and the USA) and Africa. He has previously taught courses such as Educational Statistics, Educational Assessment, Educational Research Methods, Evaluation of Teaching and Learning, Psychological Foundations of Education, and several other educational courses. Dr. Baidoo-Anu has also worked as an educational and assessment specialist and consultant with international organizations such as the World Bank, Educational Testing Services (ETS), and others.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment