Understanding Multiple Choice Test Item Analysis Report from DataLink

Published date March 11, 2025

Last modified date March 28, 2025

Reading Time: 7 minutes

As faculty members, we all strive to create fair, effective, and meaningful assessments. But once a test is administered and the results are in, we’re often left wondering: Did my test truly measure student learning as I intended? If you’ve ever opened your test item analysis report from DataLink and felt overwhelmed by the numbers, statistics, and unfamiliar terms, you’re not alone. Many faculty members find themselves asking, what do these indicators actually mean? How can they help me improve my assessments? While DataLink provides valuable insights into how individual test questions perform, making sense of the data—and more importantly, knowing what to do with it—can feel like a challenge.
This two-part post aims to break down test item analysis in a way that’s simple and practical. The first part will help you understand the key indicators in your DataLink report—what they measure, how to interpret them, and how they can help you refine your test items. The second part will answer the crucial question: What’s next? Once you understand the data, what steps can you take to improve your test questions and ensure your assessments are valid, reliable, and aligned with student learning outcomes.

Test Item Analysis

Test item analysis is a process used to evaluate the effectiveness of individual test questions (items) in an assessment (multiple choice questions). Item analysis allows us to observe the item characteristics, and to improve the quality of the test (Gronlund, 1998). Five key metrics used in item analysis are Item Discrimination Index, Item Difficulty Index, Point-Biserial Correlation, Distractor analysis, Kuder-Richardson 20 (KR-20). These measures assist educators in improving the quality of their multiple-choice questions (MCQs).

Item Discrimination Index (D)

The Item Discrimination Index (D) measures how well a test item differentiates between high-performing and low-performing students. Item discrimination shows how well your question distinguishes between students who understand the course material well and those who do not. A high discrimination index means the question is effective in distinguishing between different levels of student ability. A low or negative discrimination index suggests that the question might be flawed—for instance, it might be too easy, too tricky, or even misleading. Item Discrimination Index (D) is calculated by comparing the proportion of top-scoring students who answered the item correctly with the proportion of low-scoring students who answered it correctly (Christian et al 2016; Date et al, 2019; Rao et al., 2016).

Values Interpretation:

D Value	Interpretation	Recommendation
0.40 and above	Excellent discrimination	The item strongly differentiates between high and low performers.
0.30 – 0.39	Acceptable discrimination	Acceptable and useful in most tests.
0.20 – 0.29	Moderate discrimination	Consider revising; may not differentiate well
0.10-0.19	Weak discrimination	Likely ineffective; should be improved or replaced.
Below 0.10	Poor discrimination (consider revising or removing)	Should be removed or rewritten.
Negative	The question is misleading or flawed	Something is wrong. The item may be misleading, and strong students are getting it wrong while weaker students are getting it right. Needs revision or removal.

Item Difficulty Index (P)

The Item Difficulty Index (P) refers to how easy or hard a test question is for students. If most students answer a question correctly, it is considered an easy question. On the other hand, if most students answer it incorrectly, it is considered a difficult question (Crocker & Algina, 2008; Hambleton et al, 1991). Ideally, a well-balanced test should contain questions of varying difficulty levels to assess both basic understanding and advanced knowledge. The difficulty of a test item is usually expressed as a difficulty index, also known as the p-value, which is calculated as the proportion of students who answer the question correctly. This value ranges from 0 to 1, where: A value close to 1 (e.g., 0.90) means the question is very easy because 90% of students answered it correctly. A value close to 0 (e.g., 0.20) means the question is difficult because only 20% of students got it right.

Values Interpretation:

P Value	Interpretation	Recommendation
Above 0.80	Very easy	Consider revising or removing if too many questions are this easy.
0.60 – 0.80	Moderately easy	Generally acceptable but may not effectively differentiate students.
0.40 – 0.59	Ideal difficulty range	Best range for assessing student performance effectively.
0.21 – 0.39	Difficult	May be too challenging; consider revising if too many students struggle.
0.00 – 0.20	Too difficult (consider revising)	Likely too hard or confusing; check for clarity or unfair difficulty.

Distractor Analysis

Distractor analysis is the process of evaluating how students interact with the incorrect answer choices in a multiple-choice question (Tarrant et al, 2009). The goal is to ensure that the distractors (wrong answer choices) are plausible and challenging enough that students who do not know the correct answer are drawn to them, while students who have mastered the material are more likely to choose the correct answer. If distractors are too obvious, too tricky, or not selected by any students, they do not contribute to the effectiveness of the question. By analyzing how often each answer choice is selected, we can determine whether the distractors are doing their job in assessing student knowledge. Non-Functional Distractor (NFD) – A distractor chosen by fewer than 5% of students, meaning it is not serving its purpose (Tarrant et al, 2009). Functional Distractor – A distractor chosen by a reasonable percentage of students, ideally more by low-scoring students.

How to Conduct Distractor Analysis

For each multiple-choice question, find out how many students selected each answer option.

Example Data Table:

Answer Option	Total Students	% of Total	High-Scorers	Low-Scorers	Status
A (Correct Answer)	60	60%	50	10	Correct
B (Distractor 1)	20	20%	5	15	Functional
C (Distractor 2)	18	18%	4	14	Functional
D (Distractor 3)	2	2%	1	1	Non-Functional

How to Use Distractor Analysis for Test Improvement

Revise Non-Functional Distractors

If a distractor is rarely chosen (<5% of students), it may be too obviously incorrect.
Revise it to make it more plausible.
Ensure all distractors relate to common misconceptions.

Ensure a Balance in Response Distribution

The correct answer should be the most selected.
Distractors should attract lower-scoring students without misleading high-performing students.

Avoid “Trick” Questions

Distractors should be plausible but clearly incorrect, not designed to “trick” students unfairly.
Ensure wording is clear so that students struggling with the content—not the wording—choose the distractor.

Identify Guessing Patterns

If responses are evenly distributed across options, students might be guessing, meaning the question may be too difficult or unclear.

Reliability – Kuder-Richardson 20 (KR-20)

The reliability of a test refers to the extent to which the test is likely to produce consistent scores. Kuder-Richardson 20 (KR-20) is a statistical measure used to evaluate the internal consistency of a test. It tells us how reliable a test is by checking whether the test items (questions) work together well to measure the same concept. If a test has a high KR-20 score (close to 1.0), it means that the questions are well-aligned, and students’ performance is consistent. A low KR-20 score (closer to 0), on the other hand, suggests that some questions may not be effectively contributing to the test’s overall purpose—possibly because they are too difficult, too easy, or unrelated to the main concept being assessed. The following general guidelines can help interpret KR-20 scores for classroom exams:

Values Interpretations

Reliability	Interpretation
.90 and above	Excellent reliability; at the level of the best standardized tests
.80 – .90	Very good for a classroom test
.70 – .80	Good for a classroom test; in the range of most. There are probably a few items which could be improved.
.60 – .70	Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved.
.50 – .60	Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading.
.50 or below	Questionable reliability. This test should not contribute heavily to the course grade, and it needs revision.

Point-Biserial Correlation (rpb)

The Point-Biserial Correlation (rpb) measures the relationship between how students perform on a specific test question and their overall test score. It helps determine whether a question is aligned with overall student performance. If a question has a high point-biserial correlation (close to +1.0), this means students who scored high on the overall test are more likely to get the question correct, and students who scored low on the test are more likely to get it wrong. This is a good sign that the question is functioning properly. If the rpb is near 0, it suggests that the question does not contribute much to differentiating students, meaning strong and weak students are equally likely to get it right. Such a question may not be useful in assessing students’ abilities.

A negative rpb value indicates a serious issue. It suggests that students who performed well on the overall test were more likely to get the question wrong, while weaker students were more likely to get it right. This often happens when a question is misleading, poorly worded, or has a tricky answer choice that confuses knowledgeable students.

Values Interpretation:

r_pb Value	Interpretation	Recommendations
0.40 and above	Excellent item	Strongly correlates with overall test performance. Ideal for high-quality tests.
0.30 – 0.39	Good item	Generally acceptable and contributes well to test reliability.
0.20 – 0.29	Acceptable item	Acceptable, but may need revision for better differentiation.
0.10-0.19	Weak	Does not contribute much to the test; consider revising or replacing
Below 0.10	Very weak item (consider revising)	Should be removed or rewritten, as it does not distinguish well.
Negative	Item is flawed (high-scoring students are more likely to get it wrong)	Indicates a serious problem—high-scoring students tend to get it wrong while low-scoring students get it right. Needs urgent revision.

Conclusion

Assessing the quality of test items is essential for creating fair and effective assessments that accurately measure student learning. However, no single metric can fully determine the quality of a test. Instead, all the indicators—Item Discrimination Index, Item Difficulty Index, Point-Biserial Correlation, Distractor Analysis, and Kuder-Richardson 20 (KR-20)—should be examined together to gain a comprehensive understanding of how well a test functions. Looking at these measures in isolation can be misleading, as a test may score well in one area while having weaknesses in another. For example, a test with high reliability (KR-20) might still have questions that fail to effectively differentiate between high- and low-performing students (low discrimination index). Considering all these indicators holistically will help you make informed decisions about which multiple choice test items to revise, replace, or retain. This approach ensures that assessments are not only statistically sound but also aligned with learning objectives and fair to all students.

The part two of this post will discuss into practical strategies for enhancing the quality of your multiple-choice questions (MCQs) to ensure they are both reliable and effective in assessing student learning.

References

Christian, D. S., Prajapati, A. C., Rana, B. M., & Dave, V. R. (2017). Evaluation of multiple choice questions using item analysis tool: a study from a medical institute of Ahmedabad, Gujarat. International Journal Community Med Public Health, 4(6), 1876-81.

Crocker, L, & Algina J (2008). Introduction to classical and modern test theory. Belmont, CA: Wadsworth.

Date, A. P., Borkar, A. S., Badwaik, R. T., Siddiqui, R. A., Shende, T. R., & Dashputra, A. V. (2019). Item analysis as tool to validate multiple choice question bank in pharmacology. International Journal Basic Clinical Pharmacol, 8(9), 1999-2003.

Ebel, R. L. (1979). Essentials of educational measurement (3rded.). Englewood Cliffs, NJ: Prentice Hall.

Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Boston, MA: Allyn and Bacon.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

Rao, C., Kishan Prasad, H. L., Sajitha, K., Permi, H., & Shetty, J. (2016). Item analysis of multiple choice questions: Assessing an assessment tool in medical students. International Journal Education Psychology Research, 2(4), 201-204.

Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis. BMC Medical Education, 9, 1-8.

Share this Teaching Tip

David Baidoo-Anu

David Baidoo-Anu, Ph.D. (Education),brings vast professional experience as a researcher, educator, and assessment specialist, particularly within the contexts of North America (especially Canada and the USA) and Africa. He has previously taught courses such as Educational Statistics, Educational Assessment, Educational Research Methods, Evaluation of Teaching and Learning, Psychological Foundations of Education, and several other educational courses. Dr. Baidoo-Anu has also worked as an educational and assessment specialist and consultant with international organizations such as the World Bank, Educational Testing Services (ETS), and others.

Faculty Learning Hub

Understanding Multiple Choice Test Item Analysis Report from DataLink

Leave a Reply Cancel reply

Conestoga College
Teaching and Learning

Learning Opportunities

About Us

Areas of Interest

Resources

Faculty Learning Hub

Related Posts

Proctoring Technologies for Test & Exams

Exam Proctoring in Zoom

In-Person Proctoring How-To’s

Citing Generative Artificial Intelligence (GenAI)

Writing Assignment Descriptions

Is it Time to Refresh My Rubric? A Checklist When Your Rubric Needs a Tune-Up

Leave a Reply Cancel reply

Conestoga College Teaching and Learning

Learning Opportunities

About Us

Areas of Interest

Resources

Conestoga College
Teaching and Learning