In human performance assessment, the choice between raw scores, criterion-referenced evaluation, and norm-referenced evaluation depends on the instructor's objectives, available resources, and the need for explainability.
Raw Scores
Raw scores are the direct numerical values measured during an assessment before any interpretation or conversion into grades occurs.
- Pros: They provide a precise, granular measurement of an individual's performance on a specific task without the potential bias introduced by grading algorithms.
- Cons: Raw scores alone are difficult to interpret as they do not inherently inform learners or instructors of the actual level of learning competence or the necessary improvements needed. For example, a raw score of 50/100 could mean: excellent performance on a very difficult test or poor performance on an easy test
Criterion-Referenced Evaluation
This scheme translates performance into absolute rating labels (e.g., Excellent, A) based on a predetermined rubric or fixed standard.
- Pros: It ensures that grades reflect a student's mastery of specific content regardless of how their peers perform. It provides a clear, absolute standard that is often easier for stakeholders to understand.
- Cons: It is most suitable for examinations that cover all content topics, which typically requires significantly longer exam-taking times and more resources for checking answers. It can be difficult to apply if the assessment is not comprehensive.
Norm-Referenced Evaluation
This scheme converts scores into relative ranking labels by comparing an individual’s performance to the performance of their peers.
- Pros: It is highly efficient for large classes or courses where instructors must meet strict time constraints and save on grading resources. It is the preferred "choice of necessity" when exams cannot comprehensively assess all topics due to limited resources. It inherently reflects the relative quality of performance within a specific group.
- Cons: It can be difficult to explain the reasoning behind grade boundaries, leading to disputes between learners and instructors when scores are close but result in different grades. Because it lacks predefined absolute criteria, it is more susceptible to bias and concerns regarding fairness.
Labeling or grading implies interpretation.
Comparison Summary
| Feature | Raw Scores | Criterion-Referenced grading | Norm-Referenced grading |
|---|
| Primary Focus | Direct measurement without interpretation | Interpretation as Mastery of content | Interpretation as Relative ranking |
| Standards | None | Absolute/Predefined | Relative/Group-based |
| Best Use Case | Raw ranking like TCAS exam | Certifying competency | Large-scale ranking |
Major Drawback
| Lack of context | Resource intensive | Hard to justify boundaries |
Discussion about another problem with criterion-referenced grading is that someone may not agree with your criterion: Why must be 80 points++ to get A? Why does F have the widest score range of 0-50?
The following points explain why these criteria can be problematic and how they contrast with the alternative methods discussed in the sources:
- Fixed Percent Ranges: Criterion-referenced grading typically maps a learning score to a predefined percent range for a specific grade (e.g., 80% for an A). This means the standards are set before the assessment begins and do not change regardless of the actual distribution of student performance.
- Lack of Explainable Discrimination: A core difficulty with fixed boundaries is the "explainability" of the grade. In these systems, a student scoring just below a threshold (like 79 vs. 80) may receive a different grade without a data-driven justification for that specific cut-off. The sources suggest that it is difficult for instructors to resolve disputes when learners score contiguously but fall into different predefined boundaries.
- Arbitrary Nature of Absolute Standards: Because these criteria are absolute, they may not reflect the relative quality of an individual’s performance compared to their peers. If an exam is "overly difficult" or "too easy," all learners with similar scores might get the same grade C which cannot accurately differentiate the true learning competence of the group.
- Contrast with Data-Driven Gaps: To address the problem of arbitrary cut-offs, the sources propose norm-referenced heuristic methods like the Widest-Gap-First algorithm. Instead of using a predefined number like 80, this method identifies the widest score gaps in the actual data to define boundaries. This provides a "simple and clear-cut justification": a student receives a certain grade because their score is closer to others in that group than to the group above.
- Fairness Concerns: When unique grade symbols represent unequal score intervals (such as F covering 0–50 while A covers only 80–100), it can be seen as providing unequal chances for students to receive certain grades. The sources note that "fair" grading should ideally maintain uniform intervals or use widest score gaps to prevent two learners with similar competence from receiving different grades.