Published: June 25, 2021
By Amanda A. Wolkowitz, PhD, Alpine Testing Solutions Inc., Brett P. Foley, PhD, Alpine Testing Solutions Inc., Jared Zurn, AIA, CAE, NCARB, Corina Owens, PhD, Alpine Testing Solutions Inc., and Jim Mendes, Adobe
The effects of the number of options listed for a single answer, multiple-choice question is a topic that has been discussed repeatedly for more than 70 years. Multiple researchers have come to a common conclusion: 3-option multiple-choice (MC3) items tend to perform just as well psychometrically — if not better than — 4-option multiple-choice (MC4) items.1-5, 7-11 A few reasons to consider MC3 items include:
- MC3 items may improve overall content validity of a test6 because with shorter items, more content can be covered in the same amount of time.
- While the theoretical probability of guessing the correct answer does increase when switching from an MC4 to an MC3 item, studies have found no practical difference in item performance when comparing MC3 items to MC4 or 5-option multiple-choice (MC5) items.1, 6, 8
- MC3 items can be developed more quickly, resulting in less development expense.
In 2020, the National Council of Architectural Registration Boards (NCARB), which develops the Architect Registration Examination® (ARE®), worked with Alpine Testing Solutions Inc. (Alpine) to convert select MC4 items that had at least one poorly performing distractor to MC3 items. The goal was to convert 10-20% of the existing operational (i.e., scored) multiple-choice items and include them as operational items on new forms. (The exam includes other item types that were not changed, such as check-all-that-apply, quantitative fill-in-the-blank, drag-and-drop and hotspots.) The intent was to avoid pretesting the converted items and allow the test publisher to maintain the size of their operational item bank. Including MC3 items that were operational as well as additional MC3 items that were in the pretest blocks of the test helped ensure candidates put forth equal effort on all items.
The high-level steps that were implemented to convert MC4 to MC3 items, estimate the MC3 item parameters, estimate the initial cut scores for the new forms and confirm the findings were as follows:
1. Identify MC4 items with at least one non-functioning distractor (NFD). The criteria* used for the ARE items were:
a. Distractor selected by <5% of candidates of who answered the item incorrectly OR
b. Distractor with a positive item-total score correlation.**
If an item had more than one NFD, then the distractor with lower endorsement was selected as the option to remove.*** If the NFDs had equal endorsement, then the distractor with the higher item-total score correlation was selected as the option to remove.
2. Review and approve the selected items for conversion from MC4 to MC3.
Review and approval of the MC3 items used on the operational forms was completed by subject matter experts.
3. Estimate the Rasch item measures for the newly converted MC3 items in two ways:
a. Assuming candidates who selected an NFD for an item would answer the item correctly if it were an MC3 item; and
b. Assuming candidates who selected an NFD for an item would not answer the item correctly if it were an MC3 item.
It was expected that the actual Rasch item measure would likely fall between these two extremes based on the additional assumption that the option selected by candidates who did not select an NFD would be unchanged. This expectation was compared to the actual Rasch item measures estimated using real data after the exam was administered (see Step 5).
4. Assemble the new forms using a pre-equating model with the following constraints:
a. Only use the items that had small differences in the Rasch measures found in Steps 3a/3b.
b. Estimate the cut score in two ways:
i. Based on the Rasch measure in Step 3a
ii. Based on the Rasch measure in Step 3b
c. Use the Rasch measures in step 4b to estimate the corresponding raw cut scores for the forms twice (once for the low estimates and once for the high estimates). Ensure forms were built so that the two estimated cut scores were:
i. Within 0.15 raw score points of each other, and
ii. Within 0.25 of the same integer cut score.
5. NCARB released two of the four forms in each division of the ARE in December 2020. Scores were delayed due to other changes taking place with the ARE that required a psychometric evaluation before release. This allowed the operational MC3 item parameters previously estimated from the data and the initial cut scores estimated in Step 4 to be compared to newly calculated Rasch item measures based on live MC3 data. Using these results, the initial cut scores were also verified.
6. Following the success of Step 5, the final two forms for each division were released with the estimated cut score in Step 4 without delayed scoring in February 2021. The cut scores were confirmed as soon as sufficient data was available.
The conversion/estimation method was successful:
- The effective (rounded) initial cut score estimates for each of the 24 exam forms equaled the cut score estimates obtained after calibrating the MC3 items with live data with 100% accuracy.
- Ninety-three percent of the final cut scores based on live data were within 0.15 raw score points of the initially estimated cut scores. The greatest difference between the initially estimated and confirmed cut scores was 0.37 raw score points.
- Of the 176 converted MC3 items, 88% were within 0.50 logits of the initially estimated Rasch measure ranges. While 0.50 logits is an arbitrary threshold for comparison, it does indicate that the Rasch measures estimated for most items through the method described above were near the observed Rasch measures.****
Candidate reaction to the MC3 item type was positive:
- By informing candidates that new MC3 items could be operational, candidates completed these items with as much intentionality as other items on the forms.
- The test publisher collects candidate feedback on each exam delivery with a post exam survey as well as via a moderated online community. Candidates have expressed no concerns through either channel over experiencing both MC4 and MC3 items during the same exam administration.
We recommend that other programs planning a similar conversion consider the following:
- When converting items, choose a stringent NFD definition to only convert a small percentage of the total items.
- Only use converted items with small differences between the MC3 Rasch values (as estimated in Steps 3a/3b).
- Only use this process for the initial set of forms involved in the MC3 conversion. On future pre-equated forms, newly written or additional converted MC3 items should follow a traditional pretesting plan.
* The criteria established for the ARE exams was a practical decision based on NCARB’s goal to convert 10-20% of the existing operational MC4 items and an analysis of the item and option level data. For this program, these criteria did a reasonable job of identifying the worst performing distractors and resulted in flagging a reasonable number of items. Other programs should review their program’s goals and data to determine appropriate flagging criteria.
** The item-total score correlation (sometimes referred to as the point-biserial correlation for dichotomously scored items) is the correlation of the correct/incorrect responses for an item with the total scores earned on the exam.
***In the cases in which there were two NFDs for an item, the item often had one distractor selected by fewer than 5% of candidates who answered the item incorrectly and another distractor that had a positive item-total score correlation. If an item had two distractors meeting the same 5% or the same correlation criteria, then the item could still be converted, however, a program may choose to retire such an item.
****The MC3 items selected for the forms were those in which the estimated Rasch measures were close to the original MC4 Rasch measures. This allowed the criteria stated in Step 4c to be more easily met. Other than this rule of thumb, we did not specify an operational definition of a “small” difficulty shift due to the MC3 conversion.
- Baghaei, P. & Amrahi, N. (2011). The effects of the number of options on the psychometric characteristics of multiple choice items. Psychological test and assessment modeling, 53(2), 197-211.
- Bruno, J. E., & Dirkzwager, A. (1995). Determining the optimal number of alternatives to a multiple-choice test item: An information theoretic perspective. Educational and Psychological Measurement, 55(6), 959–66.
- Cizek, G. J., Robinson, K. L., & O’Day, D. M. (1998). Nonfunctioning options: A closer look. Educational and Psychological Measurement, 58(4), 605–11.
- Dehnad, A., Nasser, H., & Hoesseini, A. F. (2014). A comparison between three- and four-option multiple choice questions. Procedia – Social and Behavioral Sciences, 98, 398-403. Retrieved from www.sciencedirect.com
- Delgado, A. R., & Prieto, G. (1998). Further evidence favoring three-option items in multiple-choice tests. European Journal of Psychological Assessment, 14(3), 197-201.
- Haladyna, T. M. & Rodriguez, M. C. (2013). Developing and Validating Test Items. Routledge: New York, NY.
- Mackey, P. & Konold, T. R. (2015). What is the optimal number of distractors in exam items? Case Study. Institute for Credentialing Excellence.
- Rodriguez, M. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practices, 24(2), 3-13.
- Rogausch, A., Hofer, R., & Krebs, R. (2010). Rarely selected distractors in high stakes medical multiple-choice examinations and their recognition by item authors: a simulation and survey. BMC Medical Education 10(85), 1-9. https://doi.org/10.1186/1472-6920-10-85
- Tarrant, A. & Ware, J. (2010). A comparison of the psychometric properties of three-and four-option multiple-choice questions in nursing assessments. Nurse Education Today, 30(6), 539-543.
- Vegada, B., Shukla, A., Khilnani, A., Charan, J., & Desai, C. (2016). Comparison between three option, four option and five option multiple choice question tests for quality parameters: A randomized study. Indian Journal of Phamacology, 48, 571-5. Retrieved from http://www.ijp-online.com/text.asp?2016/48/5/571/190757