
Understanding Multiple Choice Test Item Analysis Report from DataLink
As faculty members, we all strive to create fair, effective, and meaningful assessments. But once a test is administered and the results are in, we’re often left wondering: Did my test truly measure student learning as I intended? If you’ve ever opened your test item analysis report from DataLink and felt overwhelmed by the numbers, statistics, and unfamiliar terms, you’re not alone. Many faculty members find themselves asking, what do these indicators actually mean? How can they help me improve my assessments? While DataLink provides valuable insights into how individual test questions perform, making sense of the data—and more importantly, knowing what to do with it—can feel like a challenge.
This two-part post aims to break down test item analysis in a way that’s simple and practical. The first part will help you understand the key indicators in your DataLink report—what they measure, how to interpret them, and how they can help you refine your test items. The second part will answer the crucial question: What’s next? Once you understand the data, what steps can you take to improve your test questions and ensure your assessments are valid, reliable, and aligned with student learning outcomes.
Test Item Analysis
Test item analysis is a process used to evaluate the effectiveness of individual test questions (items) in an assessment (multiple choice questions). Item analysis allows us to observe the item characteristics, and to improve the quality of the test (Gronlund, 1998). Five key metrics used in item analysis are Item Discrimination Index, Item Difficulty Index, Point-Biserial Correlation, Distractor analysis, Kuder-Richardson 20 (KR-20). These measures assist educators in improving the quality of their multiple-choice questions (MCQs).
Item Discrimination Index (D)
The Item Discrimination Index (D) measures how well a test item differentiates between high-performing and low-performing students. Item discrimination shows how well your question distinguishes between students who understand the course material well and those who do not. A high discrimination index means the question is effective in distinguishing between different levels of student ability. A low or negative discrimination index suggests that the question might be flawed—for instance, it might be too easy, too tricky, or even misleading. Item Discrimination Index (D) is calculated by comparing the proportion of top-scoring students who answered the item correctly with the proportion of low-scoring students who answered it correctly (Christian et al 2016; Date et al, 2019; Rao et al., 2016).
Values Interpretation:
D Value | Interpretation | Recommendation |
0.40 and above | Excellent discrimination | The item strongly differentiates between high and low performers. |
0.30 – 0.39 | Acceptable discrimination | Acceptable and useful in most tests. |
0.20 – 0.29 | Moderate discrimination | Consider revising; may not differentiate well |
0.10-0.19 | Weak discrimination | Likely ineffective; should be improved or replaced. |
Below 0.10 | Poor discrimination (consider revising or removing) | Should be removed or rewritten. |
Negative | The question is misleading or flawed | Something is wrong. The item may be misleading, and strong students are getting it wrong while weaker students are getting it right. Needs revision or removal. |
Item Difficulty Index (P)
The Item Difficulty Index (P) refers to how easy or hard a test question is for students. If most students answer a question correctly, it is considered an easy question. On the other hand, if most students answer it incorrectly, it is considered a difficult question (Crocker & Algina, 2008; Hambleton et al, 1991). Ideally, a well-balanced test should contain questions of varying difficulty levels to assess both basic understanding and advanced knowledge. The difficulty of a test item is usually expressed as a difficulty index, also known as the p-value, which is calculated as the proportion of students who answer the question correctly. This value ranges from 0 to 1, where: A value close to 1 (e.g., 0.90) means the question is very easy because 90% of students answered it correctly. A value close to 0 (e.g., 0.20) means the question is difficult because only 20% of students got it right.
Values Interpretation:
P Value | Interpretation | Recommendation |
Above 0.80 | Very easy | Consider revising or removing if too many questions are this easy. |
0.60 – 0.80 | Moderately easy | Generally acceptable but may not effectively differentiate students. |
0.40 – 0.59 | Ideal difficulty range | Best range for assessing student performance effectively. |
0.21 – 0.39 | Difficult | May be too challenging; consider revising if too many students struggle. |
0.00 – 0.20 | Too difficult (consider revising) | Likely too hard or confusing; check for clarity or unfair difficulty. |
Distractor Analysis
Distractor analysis is the process of evaluating how students interact with the incorrect answer choices in a multiple-choice question (Tarrant et al, 2009). The goal is to ensure that the distractors (wrong answer choices) are plausible and challenging enough that students who do not know the correct answer are drawn to them, while students who have mastered the material are more likely to choose the correct answer. If distractors are too obvious, too tricky, or not selected by any students, they do not contribute to the effectiveness of the question. By analyzing how often each answer choice is selected, we can determine whether the distractors are doing their job in assessing student knowledge. Non-Functional Distractor (NFD) – A distractor chosen by fewer than 5% of students, meaning it is not serving its purpose (Tarrant et al, 2009). Functional Distractor – A distractor chosen by a reasonable percentage of students, ideally more by low-scoring students.
How to Conduct Distractor Analysis
For each multiple-choice question, find out how many students selected each answer option.
Example Data Table:
Answer Option | Total Students | % of Total | High-Scorers | Low-Scorers | Status |
A (Correct Answer) | 60 | 60% | 50 | 10 | Correct |
B (Distractor 1) | 20 | 20% | 5 | 15 | Functional |
C (Distractor 2) | 18 | 18% | 4 | 14 | Functional |
D (Distractor 3) | 2 | 2% | 1 | 1 | Non-Functional |
How to Use Distractor Analysis for Test Improvement
Revise Non-Functional Distractors
- If a distractor is rarely chosen (<5% of students), it may be too obviously incorrect.
- Revise it to make it more plausible.
- Ensure all distractors relate to common misconceptions.
Ensure a Balance in Response Distribution
- The correct answer should be the most selected.
- Distractors should attract lower-scoring students without misleading high-performing students.
Avoid “Trick” Questions
- Distractors should be plausible but clearly incorrect, not designed to “trick” students unfairly.
- Ensure wording is clear so that students struggling with the content—not the wording—choose the distractor.
Identify Guessing Patterns
- If responses are evenly distributed across options, students might be guessing, meaning the question may be too difficult or unclear.
Reliability – Kuder-Richardson 20 (KR-20)
The reliability of a test refers to the extent to which the test is likely to produce consistent scores. Kuder-Richardson 20 (KR-20) is a statistical measure used to evaluate the internal consistency of a test. It tells us how reliable a test is by checking whether the test items (questions) work together well to measure the same concept. If a test has a high KR-20 score (close to 1.0), it means that the questions are well-aligned, and students’ performance is consistent. A low KR-20 score (closer to 0), on the other hand, suggests that some questions may not be effectively contributing to the test’s overall purpose—possibly because they are too difficult, too easy, or unrelated to the main concept being assessed. The following general guidelines can help interpret KR-20 scores for classroom exams:
Values Interpretations
Reliability | Interpretation |
.90 and above | Excellent reliability; at the level of the best standardized tests |
.80 – .90 | Very good for a classroom test |
.70 – .80 | Good for a classroom test; in the range of most. There are probably a few items which could be improved. |
.60 – .70 | Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved. |
.50 – .60 | Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading. |
.50 or below | Questionable reliability. This test should not contribute heavily to the course grade, and it needs revision. |
Point-Biserial Correlation (rpb)
The Point-Biserial Correlation (rpb) measures the relationship between how students perform on a specific test question and their overall test score. It helps determine whether a question is aligned with overall student performance. If a question has a high point-biserial correlation (close to +1.0), this means students who scored high on the overall test are more likely to get the question correct, and students who scored low on the test are more likely to get it wrong. This is a good sign that the question is functioning properly. If the rpb is near 0, it suggests that the question does not contribute much to differentiating students, meaning strong and weak students are equally likely to get it right. Such a question may not be useful in assessing students’ abilities.
A negative rpb value indicates a serious issue. It suggests that students who performed well on the overall test were more likely to get the question wrong, while weaker students were more likely to get it right. This often happens when a question is misleading, poorly worded, or has a tricky answer choice that confuses knowledgeable students.
Values Interpretation:
r_pb Value | Interpretation | Recommendations |
0.40 and above | Excellent item | Strongly correlates with overall test performance. Ideal for high-quality tests. |
0.30 – 0.39 | Good item | Generally acceptable and contributes well to test reliability. |
0.20 – 0.29 | Acceptable item | Acceptable, but may need revision for better differentiation. |
0.10-0.19 | Weak | Does not contribute much to the test; consider revising or replacing |
Below 0.10 | Very weak item (consider revising) | Should be removed or rewritten, as it does not distinguish well. |
Negative | Item is flawed (high-scoring students are more likely to get it wrong) | Indicates a serious problem—high-scoring students tend to get it wrong while low-scoring students get it right. Needs urgent revision. |
Assessing the quality of test items is essential for creating fair and effective assessments that accurately measure student learning. However, no single metric can fully determine the quality of a test. Instead, all the indicators—Item Discrimination Index, Item Difficulty Index, Point-Biserial Correlation, Distractor Analysis, and Kuder-Richardson 20 (KR-20)—should be examined together to gain a comprehensive understanding of how well a test functions. Looking at these measures in isolation can be misleading, as a test may score well in one area while having weaknesses in another. For example, a test with high reliability (KR-20) might still have questions that fail to effectively differentiate between high- and low-performing students (low discrimination index). Considering all these indicators holistically will help you make informed decisions about which multiple choice test items to revise, replace, or retain. This approach ensures that assessments are not only statistically sound but also aligned with learning objectives and fair to all students.
The part two of this post will discuss into practical strategies for enhancing the quality of your multiple-choice questions (MCQs) to ensure they are both reliable and effective in assessing student learning.
References
Christian, D. S., Prajapati, A. C., Rana, B. M., & Dave, V. R. (2017). Evaluation of multiple choice questions using item analysis tool: a study from a medical institute of Ahmedabad, Gujarat. International Journal Community Med Public Health, 4(6), 1876-81.
Crocker, L, & Algina J (2008). Introduction to classical and modern test theory. Belmont, CA: Wadsworth.
Date, A. P., Borkar, A. S., Badwaik, R. T., Siddiqui, R. A., Shende, T. R., & Dashputra, A. V. (2019). Item analysis as tool to validate multiple choice question bank in pharmacology. International Journal Basic Clinical Pharmacol, 8(9), 1999-2003.
Ebel, R. L. (1979). Essentials of educational measurement (3rded.). Englewood Cliffs, NJ: Prentice Hall.
Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Boston, MA: Allyn and Bacon.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Rao, C., Kishan Prasad, H. L., Sajitha, K., Permi, H., & Shetty, J. (2016). Item analysis of multiple choice questions: Assessing an assessment tool in medical students. International Journal Education Psychology Research, 2(4), 201-204.
Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis. BMC Medical Education, 9, 1-8.