Dr.Jiw: Raw scores vs Criterion-referenced evaluation vs Norm-referenced evaluation

วันพุธที่ 11 มีนาคม พ.ศ. 2569

Raw scores vs Criterion-referenced evaluation vs Norm-referenced evaluation

In human performance assessment, the choice between raw scores, criterion-referenced evaluation, and norm-referenced evaluation depends on the instructor's objectives, available resources, and the need for explainability.

Raw Scores

Raw scores are the direct numerical values measured during an assessment before any interpretation or conversion into grades occurs.

Pros: They provide a precise, granular measurement of an individual's performance on a specific task without the potential bias introduced by grading algorithms.
Cons: Raw scores alone are difficult to interpret as they do not inherently inform learners or instructors of the actual level of learning competence or the necessary improvements needed. For example, a raw score of 50/100 could mean: excellent performance on a very difficult test or poor performance on an easy test

Criterion-Referenced Evaluation

This scheme translates performance into absolute rating labels (e.g., Excellent, A) based on a predetermined rubric or fixed standard.

Pros: It ensures that grades reflect a student's mastery of specific content regardless of how their peers perform. It provides a clear, absolute standard that is often easier for stakeholders to understand.
Cons: It is most suitable for examinations that cover all content topics, which typically requires significantly longer exam-taking times and more resources for checking answers. It can be difficult to apply if the assessment is not comprehensive.

Norm-Referenced Evaluation

This scheme converts scores into relative ranking labels by comparing an individual’s performance to the performance of their peers.

Pros: It is highly efficient for large classes or courses where instructors must meet strict time constraints and save on grading resources. It is the preferred "choice of necessity" when exams cannot comprehensively assess all topics due to limited resources. It inherently reflects the relative quality of performance within a specific group.
Cons: It can be difficult to explain the reasoning behind grade boundaries, leading to disputes between learners and instructors when scores are close but result in different grades. Because it lacks predefined absolute criteria, it is more susceptible to bias and concerns regarding fairness.

Labeling or grading implies interpretation.

Comparison Summary

Feature	Raw Scores	Criterion-Referenced grading	Norm-Referenced grading
Primary Focus	Direct measurement without interpretation	Interpretation as Mastery of content	Interpretation as Relative ranking
Standards	None	Absolute/Predefined	Relative/Group-based
Best Use Case	Raw ranking like TCAS exam	Certifying competency	Large-scale ranking
Major Drawback	Lack of context	Resource intensive	Hard to justify boundaries

Discussion about another problem with criterion-referenced grading is that someone may not agree with your criterion: Why must be 80 points++ to get A? Why does F have the widest score range of 0-50?

The following points explain why these criteria can be problematic and how they contrast with the alternative methods discussed in the sources:

Fixed Percent Ranges: Criterion-referenced grading typically maps a learning score to a predefined percent range for a specific grade (e.g., 80% for an A). This means the standards are set before the assessment begins and do not change regardless of the actual distribution of student performance.
Lack of Explainable Discrimination: A core difficulty with fixed boundaries is the "explainability" of the grade. In these systems, a student scoring just below a threshold (like 79 vs. 80) may receive a different grade without a data-driven justification for that specific cut-off. The sources suggest that it is difficult for instructors to resolve disputes when learners score contiguously but fall into different predefined boundaries.
Arbitrary Nature of Absolute Standards: Because these criteria are absolute, they may not reflect the relative quality of an individual’s performance compared to their peers. If an exam is "overly difficult" or "too easy," all learners with similar scores might get the same grade C which cannot accurately differentiate the true learning competence of the group.
Contrast with Data-Driven Gaps: To address the problem of arbitrary cut-offs, the sources propose norm-referenced heuristic methods like the Widest-Gap-First algorithm. Instead of using a predefined number like 80, this method identifies the widest score gaps in the actual data to define boundaries. This provides a "simple and clear-cut justification": a student receives a certain grade because their score is closer to others in that group than to the group above.
Fairness Concerns: When unique grade symbols represent unequal score intervals (such as F covering 0–50 while A covers only 80–100), it can be seen as providing unequal chances for students to receive certain grades. The sources note that "fair" grading should ideally maintain uniform intervals or use widest score gaps to prevent two learners with similar competence from receiving different grades.