Toward Best Practice in Cut-Score Determination: Incorporating Uncertainty
A credentialing examination is designed to distinguish those who meet the minimum standard required for safe and effective practice from those who do not. For that reason, standard setting is one of the most consequential steps in the assessment process. It provides a formal basis for determining the score that corresponds to a defined level of minimum competence (Cizek & Bunch, 2006).
A provisional cut score from the standard-setting process, however, should not be interpreted as an exact and error-free threshold. The sources of uncertainty depend in part on the method used. For example, in the Modified Angoff method, uncertainty reflects multiple sources: sampling variability from the selected items and panelists, since a different set of items or panelists could yield a different recommendation, and measurement error in the test itself, because an observed score is an imperfect indicator of true ability (e.g., Mercado et al., 2024).
For these reasons, a single cut score can be misleading when presented to determine the final operational cut score without appropriate context. A more informative approach is to supplement the cut score with uncertainty information, i.e., quantitative evidence of the possible range of the cut score and the precision of measurement around it. Such information helps decision-makers evaluate the stability of the point estimate and its implications when making the final operational cut-score decision. This is especially important in credentialing, where decision errors have asymmetric implications. A false positive may threaten public safety, whereas a false negative may unfairly block a competent candidate from entering or continuing in professional practice.
Modified Angoff With Uncertainty Information
The Modified Angoff method (Angoff, 1984; Hambleton & Pitoniak, 2006) remains one of the most widely used standard-setting approaches in credentialing. This method requires subject matter experts to review each item and estimate the probability that a minimally competent candidate would answer it correctly. These judgments are then aggregated across items and panelists to produce a recommended, or provisional, cut score.
The popularity of Modified Angoff stems from its intuitive link to the concept of minimal competence. Yet, once the final recommendation is reported, the cut score is often treated as if it were exact. That interpretation overlooks the fact that the estimate is based on sampled items, sampled panelists, and imperfect measurement.
Hypothetical Modified Angoff Example
Consider a hypothetical 200-item examination. Suppose the Modified Angoff process yields a cut at 55% correct, corresponding to a raw cut score of 110 out of 200 and a cut score of 0.30 logits under the Rasch model.1
Now suppose the standard-setting study yields a margin of error (MOE) of 2%, which corresponds to three raw-score points. The cut score can therefore be expressed as 110 ± 4, implying a raw-score interval from 106 to 114 at the 68% confidence level. For illustration, suppose these values correspond approximately to Rasch locations of 0.30 ± 0.01 logits, respectively. This interval does not replace the standard-setting recommendation. Rather, it reminds decision-makers that the reported cut depends in part on sampling-related variability.
Figure 1. Variability in Standard-Setting Ratings
Suppose, further, that the test’s overall standard error of measurement (SEM) is 0.13 logits, while the conditional standard error of measurement (CSEM) at the cut score of 0.30 is 0.12 logits. Then a one-standard-error band around the cut is 0.30 ± 0.12, yielding bounds of 0.18 and 0.42 logits. This interval is more relevant than the overall SEM because it reflects measurement precision at the provisional decision point. If the cut score is used to classify candidates, the local precision at the cut matters more than average precision across the entire score scale.
Figure 2. SEM and CSEM Under the Rasch Model
Hofstee With Uncertainty Information
The Hofstee method (Hofstee, 1983) is often used as a complementary approach to the Modified Angoff because it combines normative and compromise-based considerations. Rather than judging each item individually, panelists provide four global judgments: the minimum acceptable cut score (CMIN), the maximum acceptable cut score (CMAX), the minimum acceptable failure rate (FMIN), and the maximum acceptable failure rate (FMAX).
These values define a line segment in the score-by-failure-rate plane. The Hofstee cut score is obtained by identifying the intersection between that line and the empirical cumulative score distribution, typically represented as an ogive. This procedure is appealing because it constrains the final recommendation to a region that panelists consider acceptable both in terms of score and consequences.
Figure 3. Hofstee Cut Score Determination
Quantifying Uncertainty for Hofstee Cut Scores
Unlike the Modified Angoff, Hofstee does not readily yield a conventional MOE based on a standard sampling framework. However, uncertainty can still be studied through resampling methods, especially bootstrap procedures (Efron & Tibshirani, 1994).
Because the Hofstee method depends on panelists’ judgements about CMIN, CMAX, FMIN, and FMAX, uncertainty in those inputs can be propagated to the final cut score by repeatedly resampling those judgements, allowing some to be selected multiple times and others not at all, and recomputing the Hofstee cut. After a large number of replications(e.g., 1,000), the empirical distribution of resulting cut scores can be summarized with percentile-based intervals, such as the 16th and 84th percentiles (68% confidence interval). Then, the interval can be converted to corresponding raw-score and logit cut scores.
Figure 4. Bootstrap-Based 68% Confidence Interval for the Hofstee Cut Score
Measurement error can be incorporated into the Hofstee cut score, as described above for the Modified Angoff method.
Determining the Final Operational Cut Score
Uncertainty information should not be viewed as replacing expert judgment. Rather, it should support more informed final decisions.
A practical decision process may involve three layers of interpretation. First, when MOE is available, decision-makers can evaluate how sensitive the cut score is to item and panel sampling. Second, in assessments where public safety is critical, decision-makers may place more weight on the upper bound of an uncertainty interval to reduce the risk of false positives. Third, decision-makers may also consider the lower bound to avoid unduly failing candidates whose true competence may exceed the standard, but whose observed performance is affected by measurement error.
These are not purely statistical choices. They are policy choices informed by the purpose of the assessment. The final cut score should therefore be selected through a combination of content-based judgment, psychometric evidence and operational considerations.
Implications for Best Practice
Although standard setting in credentialing is intended to produce a single operational cut score, the provisional recommendation should be interpreted together with the uncertainty and contextual evidence that inform the determination of the final standard. In credentialing, pass/fail decisions are influenced not only by expert judgments about minimum competence, but also by uncertainty due to item sampling, panelist variability and error inherent in the measurement instrument. Presenting only one point estimate risks overstating precision and understating the complexity of defensible decision-making.
Modified Angoff and Hofstee remain valuable and widely used standard-setting methods, but their outputs are better interpreted when accompanied by uncertainty information. For Modified Angoff, MOE can reflect sampling-related variability, and CSEM at the cut provides a more decision-relevant measure than overall SEM. For Hofstee, bootstrap-based intervals offer a practical way to characterize variability in the final cut score arising from panel judgments.
Taken together, these approaches provide a stronger basis for final standard-setting decisions. They allow decision-makers to weigh psychometric uncertainty alongside public protection, candidate fairness, and operational impact. In that sense, best practice in final operational cut-score determination involves interpreting the provisional standard-setting recommendation in light of uncertainty information, rather than relying on a single point estimate alone; it makes it more transparent, more defensible and ultimately more valid.
1 The Rasch model is a probabilistic measurement model that explains item responses using examinee ability and item difficulty on a common scale.
References
Angoff, W. H. (1984). Scales, norms, and equivalent scores. Educational Testing Service.
Cizek, G. J., & Bunch, M. B. (2006). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests. SAGE Publications (CA).
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. Chapman and Hall/CRC.
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. Educational measurement, 4(1), 433-470.
Hofstee, W. K. (1983). The case for compromise in educational selection and grading. On educational testing, 109-127.
Mercado, R., Fitzpatrick, J., Kendallen, S., & Smith, J. (2024). Cut Score Adjustment Background and Best Practices.