Listening to Test Takers – Pre-Testing New Item Formats and Approaches
By Matthew T. Schultz, PhD, and Joshua I. Stopek, CPA, MBA
The changing nature of accounting and other professions has driven test blueprint development toward an evidence-centered design approach (Huff, 2010; Mislevy, Almond and Lukas, 2003). Testing organizations can gather such evidence about their test from an often under-utilized segment of experts – namely, current and prospective test takers.
Each section of the CPA test covers a focused band of content that is supported by a detailed blueprint that is the product of regularly conducted Practice Analysis. Test content includes assessment of an ability to apply, analyze and evaluate contexts and respond appropriately. A mix of item formats, including multiple-choice items and ask-based simulation (TBS) items are delivered online in proctored testing centers.
The Standards for Educational and Psychological Testing (2014) outline test development, operational delivery, and maintenance processes for programs like these. Validity evidence typically relies on subject matter experts (SMEs), those already licensed and experienced. To learn more about how your content is perceived, this definition of SME can expand to current and prospective test takers. The research methods described below focus on validity approaches that rely on test taker feedback to enhance claims regarding the meaning and interpretation of reported test scores.
Several distinct lines of test taker/test developer research can be undertaken before introducing new content into a test, including think-aloud methods, focus groups, usability testing, and field testing. Below we share some of these lines of research with a focus on psychometrics-orientated research, followed by test development research. Each method is then linked to its role in our test development process, and in each case the providers of this information are a mix of current and prospective test takers (candidates, for whom passing the exam is part of the process of applying for licensure).
Psychometric research focuses on rigorous analysis of test taker responses, both item-based and verbal, to better understand how test takers interact with items. In our case, this has taken two tracks focused on TBSs. First, the think-aloud method (TAM) also referred to as cognitive labs; see Leighton, 2013, 2019) can be used during the test development process to verify that each item measures the intended higher-level thinking skill. Typically, TBSs are associated with aspects of problem-solving such as analysis and evaluation as per Bloom’s taxonomy (see Anderson & Krathwohl, 2001; Bloom, Engerhart, Furst, Hill, & Krathwohl, 1956). According to the Standards (AERA, APA, NCME, 2014), evidence of test-taker response processing is critical for the formation of validity arguments. In particular, TAMs can help identify and measure the underlying response processes test-takers use to generate correct or incorrect responses to tasks. Such evidence is especially important when tasks are designed to measure higher-level thinking because this thought process is complex making it harder to decipher than simpler application skills.
During a TAM, the researcher observes while a test-taker completes a task and requires the test-taker to verbalize his or her thinking. Analysis assesses whether test-takers engage the intended higher-level skills when responding to tasks. In this way, response processing data can supplement other forms of validity evidence, and provide assurance that items tap into the desired skills. These qualitative data are typically augmented and analyzed with psychometric performance (difficulty and discrimination) data as well as timing data. Each element is collected during field testing (both ad hoc field testing as well as the embedding of field test content onto operational tests.
Test Development Research
Other prospective conversations help inform the test development process, including test takers’ interactions with the testing platform itself, item format, and instruction clarity. Using an iterative process of design, discussion, and redesign helps ensure that misconceptions and other errors can be weeded out from a test. This should reduce overall item attrition during field testing/calibration, and ultimately measurement error.
These interactions have markedly different purposes requiring different methodologies. First, to improve overall design, user-experience approaches focus on gaining a “deep understanding of users, what they need…their abilities, and also their limitations” (US Dept. of Health & Human Services, 2016). As with validity claims, more complex user interactions require more thorough vetting to ensure that test and item interfaces do not get in the way of test-takers’ construct-related performances. In tests containing multiple-choice items, this can be readily discerned using standard protocols.
Content designed to assess complex constructs involving multiple exhibits (e.g. charts, tables, videos) to make the context more realistic or complex response types (e.g. dashboard facsimiles) can increase the risk that test-takers spend more time figuring out how to navigate the test than on performing construct-related tasks. In events focused on minimizing this type of construct-irrelevant variance, where each participant may have a unique perspective and experience, one-on-one sessions, similar to TAMs can be adequate. However, it is imperative to guide test takers more thoroughly through multiple different steps (i.e. ‘open [Y] exhibit’; ‘adjust the contrast’) to determine whether and where each interaction may break down. When interactions do break down, changes can be made and tested again to determine the most efficient, usable design prior to implementation. For example, the location of a specific button or its functionality can be reconfigured to reduce confusion and churn.
Likewise, looking at individual content in either a TAM or focus group setting will provide significant information. In these sessions, test developers seek to understand how the content pieces of a task contribute to candidate performance. If we consider TAMs primarily focused on validity-evidence gathering, we can consider focus groups as informing more specifically on the mechanics of test takers’ response on individual items. Candidates may provide insight into how they interpret specific directions, content, and response expectations.
If items are intended to provide summaries of information or generalizations, communication with candidates informs the test developer as to whether these generalizations are interpreted in the intended way, providing feedback regarding if/how directions and stem wording can be changed to enhance clarity and understandability. While item quality may be determined afterward based on statistics, early interactions with candidates can help find issues in advance, setting the stage for successful field testing and subsequent calibration and operational launch. Answers to questions such as “what about [X] stimuli drove your answer?” or “was this a realistic task?” drive conversations regarding candidate perception of the item and potential future roadblocks. In this approach, consensus can uncover hidden nuances that may have been taken for granted by seasoned SMEs during the process.
To summarize, methods discussed in this article allow test developers to focus in-depth with candidates on the nuance of an item or interface functionality. Performance and usability metrics can augment these results, and depending on the method, probe the candidate during or after they perform the tasks (either functionality or construct). This information is invaluable, however, these methods can be time-consuming, labor intensive, and expensive, preventing large-scale data collection, and hence potentially providing limited perspectives. Care while planning can help focus on the appropriate level of effort and time to maximize important information. Focus groups can augment the results of one on one methods, allowing a wider group of candidates to both contribute and provide feedback regarding the testing process. A well-run focus group can elicit consensus on where content and design issues occur. This provides a feedback loop to the content developers that should result in more successful content development as the process matures.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). The Standards for educational and psychological testing. Washington DC: Author.
Anderson, L.W., & Krathwohl, D.R. (Eds). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. Allyn and Bacon.
Bloom, B.S., Engerhart, M.D., Furst, E.J., Hill, W.H., & Krathwohl, D.R. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. New York: David McKay Company.
Leighton, J.P. (2013). Item Difficulty and Interviewer Knowledge Effects on the Accuracy and Consistency of Examinee Response Processes in Verbal Reports. Applied Measurement in Education, 26, 136-157.
Leighton, J.P. (2017). Collecting, analyzing and interpreting verbal response process data. In K. Ercikan & J. Pellegrino (Eds)., National Council on Measurement in Education (NCME) Book Series - Validation of Score Meaning in the Next Generation of Assessments. Routledge.
Leighton, J.P. (2019). The risk-return trade-off: Performance assessments and cognitive validation of inferences. British Journal of Educational Psychology: Special Issue on Performance Assessment of Student Learning in Higher Education.
US Department of Health & Human Services. https://www.usability.gov/what-and-why/user-experience.html Accessed 7/22/2016