By Weiyi Cheng, Psychometrician, Professional Testing Corporation (PTC)
Validation is an essential process to develop a psychometrically sound assessment. In psychometric terms, validity refers to how well an instrument measures what it is supposed to measure. The current edition of the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) refer to validity as "the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests." Just as we would not use a math test to assess verbal skills, we do not want to use a measurement that was not truly measuring what it intends to measure.
Validity is not a characteristic of a test. Simply put, a test itself cannot be valid or invalid, but rather one should ask, “How valid is the interpretation I propose for the test?” Scores from a test could be valid for certain inferences, but not for other inferences. The process of validation is to gather different pieces of evidence to provide a scientific basis for interpreting the test scores in a particular way. It is the “process of constructing and evaluating arguments for and against the identified interpretation of test scores and their relevance to the proposed use” (AERA et al., 2014).
Sources of Validity Evidence
Five sources of validity evidence are outlined in the Standards:
- Test content
- Internal structure
- Response processes
- Relations to other variables, one of which can be called a criterion
- Consequences of testing
Content validity indicates the degree to which the items on the test adequately represent the range of knowledge and skills the instrument is supposed to measure. That is, the content of the exam should reflect the knowledge, skills, and abilities required for a particular job or role the candidate is seeking to practice.
Conducting a job task analysis or role delineation study is the most common way of validating test content. The purpose of the job analysis is to obtain information on the knowledge, skills and abilities required for a given job. The results from the study are used to develop the content outline and test specifications for the examination.
This form of evidence also referred to as construct validity, indicates how well the test measures what it claims to be measuring. The most commonly used method is an assessment that investigates the dimensional structure underlying a test. The major approaches for the assessment of dimensionality include factor analysis methods (a technique to reduce a large number of variables into fewer numbers of factors and to detect variable relationships), and methods based on multidimensional item response theory. We do not want the test to miss things it should be assessing (construct underrepresentation), or assess things outside of the construct leading to construct irrelevant variance
Embedded within construct validity are convergent and discriminant validity; correlation tests measuring the relationship between two tests are usually applied. Convergent validity measures the relationship between test scores and other measures designed to measure similar constructs (e.g., a previously validated instrument). For example, a strong correlation between scores from a new test developed to measure management skills and an existing measure of management skills would provide convergent evidence. By contrast, discriminant validity investigates the relationship between test scores and measures of different constructs. Divergent evidence would be demonstrated by finding little or no relationship between a test measuring management skills and a test measuring something else.
This form of evidence can demonstrate that the assessment requires test takers to engage in specific cognitive skills or behavior that are necessary to complete a task. To obtain response process evidence, we need to ask whether the items or tasks on a test are appropriate to elicit or capture the knowledge or skill we seek to assess. For example, to qualify for a driver’s license, passing a written test is far from sufficient. One should pass a performance test that requires each candidate to demonstrate his or her skills while driving. For some professions, performance tests or simulation tests play an important role, since performance skills (e.g., giving an injection, operating a surgery) are inadequately assessed with a test containing multiple-choice items. When performance is being evaluated, raters’ judgments of the candidate’s performance may exert some influence on the assessment outcomes, potentially attenuating measurement reliability, which in turn, impacts the strength of evidence intended to support validity.
Criterion-related evidence describes the extent to which the test scores correlate with an external variable (e.g., a well-established measurement or an outcome relevant to the proposed purpose of the test) either at present (concurrent validity) or in the future (predictive validity). The well-established measurement or outcome acts as the criterion against which scores from the new test are assessed. A common measurement of this type of validity is the correlation coefficient between the two measures. For example, the relationship between pre-employment test scores and job performances (e.g., measured via supervisor ratings of performance) quantified by a correlation coefficient could serve as a measure of how strongly test scores predict future job performance.
Consequence of Testing
This form of evidence describes the extent to which consequences of the use of test scores align with the proposed uses of the instrument. Messick (1989) defined two component parts: 1) Making sure tests are labeled correctly and thoughtfully (i.e., labels should accurately describe what they are testing); 2) Identifying “…potential and actual social consequences of applied testing”. In the case of certification, considerations of social consequences involve protection of the public. For example, allowing an incompetent candidate to pass a certification exam and practice nursing is a risk to public safety. Therefore, standard setting is a critical component of the test development process to make sound decisions. It is important that a test be able to discriminate between candidates who are qualified and not qualified for certification. Passing unqualified candidates will jeopardize the integrity and value of the credential, while failing qualified candidates will also compromise the validity of the credential and discourage candidates from seeking the credential.
Relationship Between Validity and Reliability
Validity and reliability go hand in hand. Reliability refers to the extent to which the test provides scores that are free from the influence of error. An adequate reliability index is a prerequisite for validity. A test that does not produce reliable scores cannot produce valid interpretations. However, adequate reliability does not guarantee validity.
Validation of test interpretations is linked to evidence from the test development process, in which documentation from each step (e.g., job analysis, test blueprint development, item writing, exam review, standard setting, test administration, psychometric analyses) contributes. Validation is an ongoing process that requires a joint effort of test developers and test users to accumulate, interpret, and integrate various sources of conceptual and empirical evidence to support intended test score interpretations.
AERA, APA, & NCME. (2014). Standards for educational and psychological testing.
Washington, DC: Author.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103).
New York: Macmillan.
Interested in learning more about common psychometric terms? Check out these related articles:
- Your Organization’s Exam Cannot Be Valid, by Robert C. Shaw, Jr., PhD
- Breaking Down Commonly Used Psychometric Terms: Validity, Reliability and Fairness, by Chad W. Buckendahl, PhD