Exploring the Psychometric Concept of Reliability

9.20.19

Along with validity and fairness, reliability is a foundational principle of assessment. In explaining the activity of testing to non-testing folks, I have often found it helpful to conceptualize testing as a measurement activity. Under this analogy of measurement, we can define reliability as the *consistency or comparability of repeated measurements of the same or like subjects* (i.e., the focus of the measurement, or the test candidate).

Imagine a measuring device we are all (perhaps painfully) familiar with — a common bathroom scale. Our expectation of that scale is a consistent number will display on the digital read-out each time it’s stepped on. Unless I was doing something intentional to reduce or increase my weight, I would expect approximately the same number, within a narrow range, to appear. It might fluctuate slightly from day to day, but large swings in the results would indicate that the instrument has become unreliable. Likewise for tests.

Unless there is an identifiable factor that should influence dramatic changes, we would expect comparable results across repeated test administrations. The random variation we might observe in repeated measurements we refer to as *error*, which is the converse concept to reliability. The goal of test developers is to minimize errors in our measurements to optimize their precision.

There are many different methods for evaluating reliability of assessment scores. I shall constrain this discussion to the most common ones. Reliability indices almost always take the form of correlations (a common statistical indicator of the degree of association between two variables), and have an effective range of 0 to 1, with a higher value indicating more precision.

Of the many different methods for evaluating reliability of assessment scores, perhaps the most “direct” method is *test-retest reliabilit*y, the correlation of test scores arising from repeated administrations of the test. Administering the same exact form on two occasions could involve memory effects as a confounding factor; therefore, parallel forms built to the same blueprint, but involving different questions could be administered to the same candidates to circumvent this issue. A related index is the *Split-half reliability* index. This method randomly divides items from a test into two halves, calculates a candidate’s scores on both halves, and then calculates the correlation between the two sets of scores.

Another category of reliability index measures *internal consistency* or the degree to which items on a test perform consistently across candidates. The advantage of internal consistency measures is that they can be computed from a single administration of an exam to a group of candidates, making repeated administrations unnecessary. By far the most commonly used internal consistency index is *Cronbach’s coefficient alpha*, the formula for which can be found in psychometric texts. Cronbach’s alpha is formulated as the mean of all possible split-half coefficients. Cronbach’s alpha is often conflated with another measure of internal consistency, the *Kuder-Richardson 20* (or KR-20, so named because it was the 20th formula in a jointly published manuscript). While the two are closely related, the KR-20 is a special case of coefficient alpha for dichotomously scored items that are either correct (1) or incorrect (0). Cronbach’s alpha can be extended to polytomously scored items, which are scored on a rating scale (e.g., 1, 2, 3, 4, 5) or with partial credit (0.25, 0.5. 0.75. 1.0).

While reliability measures are most frequently reported as correlations, it is also common, particularly in achievement tests, to report the converse of precision or a measure of error. *Standard error of measurement* (SEM) represents how much uncertainty is present in test scores, or how much unexplained variability we would observe with repeated administrations of the test. For instance, a candidate may receive a test score of 1850 with an associated SEM of 50. The first thing to digest is that the 1850 score may or may not be influenced by error in this case. We could then surmise that the candidate’s “true” score (the score free from error) is somewhere between 1750 and 1950, with the *confidence interval* around the observed score defined by 2 times SEM. An advantage of SEM is that it can be reported for every observed score on the scale produced by a test, not just in aggregate. The degree of error may actually vary at different points along the scale, so it may be of interest to look at the error value (i.e., *conditional SEM*) at a specific score, such as the passing standard. Reliability coefficients and SEM are conceptually linked and existing formulas can translate one to the other.

Thus far, our discussion has dealt mainly with measures of *score* reliability. Certifying bodies are typically just as interested in reliability of *pass/fail decisions*, referred to as *decision consistency*, or *pass-fail reliability*. These indices are interpreted as the probability of an identical pass/fail decision on repeated administrations of a test. Common pass/fail reliability indices are Livingston, Brennan-Kane, and Subkoviak (see references for more information).

In this treatment of reliability, we must also not forget assessment formats that involve human evaluators. These judge-mediated assessments introduce a subjective element to the assessment system, so it is important to monitor the consistency of evaluators in their application of a rating scale or scoring rubric. The index used for monitoring judge consistency depends greatly on the design of the rating system, but a few indices of inter-rater reliability are discussed here.

The most basic and least robust of these statistics is a “joint probability of agreement,” which is simply the number of times that two independent judges’ ratings coincide, divided by the total number of ratings. *Pearson’s product-moment* (r) or *Spearman’s rank coefficient* (ρ) are both often used as measures of pairwise correlation among raters. *Cohen’s kappa* (*κ*) coefficient of agreement, is probably the most widely used global indicator of judge agreement. Specifically, Cohen’s κ represents the proportion of times that judges’ ratings agree, corrected for the frequency of chance agreement.

A number of factors can contribute to the reliability of assessment scores. Foremost is the number of items on the test; the longer the test, the more precise the score. However, there is a diminished reliability increase with each item that is added to a test beyond a certain point. The inflection point location depends on characteristics within the test and the candidate population, but can be better understood and projected through a formula known as the *Spearman-Brown prophecy*.

Correlational measures of reliability are also highly influenced by the range of abilities reflected in the testing candidates. Generally, the more homogeneous the sample of candidates, the smaller the range of abilities will be. This is seen frequently in credentialing circles with candidates graduating from training or educational programs with a standardized curriculum. These candidates tend to have a restricted range of test scores, which yields a smaller standard deviation. It is harder to make fine-grained distinctions in such groups. Other factors held equal, a group with a smaller standard deviation will result in lower reliability, and it often will require comparatively more items to achieve an adequate degree of reliability and decision consistency.

Psychometric option on reliability thresholds will vary, and an “adequate degree of reliability” will depend on the candidate pool characteristics, as well as the purpose and “stakes” of the credential. Not all testing experts will agree; however, for high-stakes examinations, minimum reliability of 0.8 is generally considered desirable.

Reliability is a foundational concept of testing and is a necessary, if not sufficient, condition for test validity. If test scores and resulting decisions cannot be demonstrated as having adequate precision, and as being devoid of error, then we cannot conclude the test to have a high degree of validity.

There are many options for evaluating test reliability. The “right” answer for your organization will depend on the assessment format, the stakes of the decision being made, and the computational power at your disposal.

Haertel, E. H. (2006). Reliability. In (R. L. Brennan, ed) Educational Measurement, 4^{th} edition, p 64-110

American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational and Psychological Testing (U.S.). (2014). Reliability/Precision and Errors of Measurement. In *Standards for Educational and Psychological Testing *(pp. 33-41). Washington, DC: AERA