credentialinginsights

A Simple Method to Detect Score Similarity and Practical Implications for Its Use

By Amanda Wolkowitz, PhD, Russell Smith, PhD

1.20.23

There are numerous methods for detecting aberrant test taking behavior that can be applied by select users within the testing industry. One such behavior is collusion, and a method psychometricians and other analysts within the credentialing industry use to detect it is a statistic known as a score similarity index (SSI).

SSI purports to identify pairs of examinees who have an unusually high number of identical correct/incorrect scores on a set of items. There are multiple methods that estimate the probability two examinees colluded, such as Wollack’s¹ ω, generalized binomial model (GBT)² and a new application of residual correlations of persons known as B3.^3,4 While these methods are useful, they are either computationally complex, require specialized software or take a very long time (hours if not days) to run on large sets of data (e.g., pairwise comparisons of 10,000 examinees on a 60-item form).

For users not in the credentialing field, such as classroom teachers, some of these methods may be difficult to implement due to the lack of software or familiarity with how to implement a method and interpret the results. The purpose of this article is two-fold: 1) provide an approximation SSI (aSSI) method that fills this gap by exemplifying a method that is easily implemented, straightforward to interpret and produces comparable results to other known SSI methods (i.e., Wollack’s ω, GBT and B3) and 2) provide guidance regarding the policies and procedures that should be in place when implementing such a method.

Wollack’s ω, GBT and B3 are all estimation methods aimed at getting as close to “true” SSI as possible. In other words, while no SSI method will be perfect, all of these statistics do a decent job of estimating the probability that two examinees share a given number of identical correct/incorrect scored responses on a set of items and tend to have Type I (false positive) and Type II (false negative) error rates acceptable to testing programs. For that reason, these true SSI methods aim at flagging pairs of examinees with a high probability of collusion. The aSSI method has the same end goal, but aims at approximating the results from the GBT method.

Why approximate the results of an estimation method? According to Welsh mathematician and philosopher Bertrand Russell,⁵ “the behavior of large bodies can be calculated with a quite sufficient approximation to the truth.” He continues, “Although this may seem a paradox, all exact science is dominated by the idea of approximation.” Thus, it is reasonable to try to approximate true SSI via a simpler method that can reach a wider audience than more complicated methods.

Brief Explanation of True SSI Methods

Wollack’s ω and GBT require the use of item response theory (IRT) to compute the expected agreement between two examinees. Then, a z-statistic is computed comparing the difference between the observed and expected agreement. In brief, the difference between these methods is that ω estimates the expected agreement by summing the probabilities that the copier’s response (0,1) matches the observed source’s response given the ability of the copier and the item’s IRT parameters. The GBT method estimates the expected agreement by summing the joint probabilities of matching scores (0,1) between two examinees given the ability of each of the examinees and the item’s IRT parameters. For both methods, Bock’s nominal model is typically used to estimate the person and item parameters. However, given that SSI uses dichotomously scored items, research presented in this study estimates the results of ω and GBT using both Bock’s nominal model (collapsed to a 0/1 model) as well as the Rasch model. Both ω and GBT may be estimated using Zopluoglu’s “CopyDetect”⁶ R package.

B3 also applies the Rasch model, but instead of a z-statistic, this method computes the correlation of the residuals for two examinees. B3 is much like Yen’s Q3 statistic,⁷ which uses item residual correlation values to help identify high item interdependence. B3 is the same statistic but focuses on the examinee residual correlations instead of item correlations. A high B3 value indicates two examinees’ scores are not independent, in other words, the scores patterns are more similar than one would expect by chance. Unlike ω and GBT in which a small value (e.g., < 0.01) would lead to flagging a pair of examinees, high values of B3 suggest aberrant behavior. Winsteps⁸ readily produces person-residual correlations.

Explanation of aSSI

The aSSI method does not require IRT nor a sophisticated program and is less computationally intensive than ω, GBT, and B3. Like ω and GBT, aSSI is a z-score between Examinee 1 and Examinee 2:

(Equation 1) where M is the count of observed score matches, n is the number of items, p is E^*₁₂/n and q = (1-p). E₁₂ is the adjusted expected value of the number of observed matches and is computed as follows:

(Equation 2) where s_i is the proportion correct score for person i and b is an adjustment to the magnitude of the correction. Based on recommendations by Smith,^9,10 this value is set at 12.5% for this study.

While the denominator of Equation 1 has the same look as that of GBT, the expected value is an approximation of the expected value calculated in the true GBT method. In Equation 2, the first half of the equation (left addend) estimates the independent probability of both examinees scoring their observed number of correct scores and the probability of both examinees scoring their observed number of incorrect scores. The second half of the equation is an adjustment value for differences in the variability of the difficulty of the items on the exams. If the two examinees had identical scores, then the adjustment would be n∙b or 0.125n. On the other extreme, if one examinee had a perfect score and the other examinee scored 0 points, then the adjustment would equal 0. Thus, the adjustment value varies from 0 to 0.125n.

Like the true methods, there are assumptions to the aSSI method. The aSSI method assumes the data are approximately normally distributed, the items are independent and the items are dichotomously scored. The method is fairly robust to violations of these assumptions and violations tend to make the results more conservative (i.e., lower type I error).

Example Computation of aSSI

Consider Examinee 1 and Examinee 2 who have the same correct/incorrect responses for 51 items on a 66-item exam. Examinee 1’s raw score is 38/66 and Examinee 2’s raw score is 35/66. Based on these values:

For a normal distribution, a z-score of 2.6911 is equivalent to a probability of < 0.50%. Based on their percent correct scores, the number of expected matched scores is approximately 40 items. Thus, the expected probability of Examinees 1 and 2 matching on 51 items is unlikely, i.e., < 0.50% chance.

Method

Previous work by Smith^9,10 demonstrated through a simulation study that aSSI results are comparable to those found by GBT. Therefore, the purpose of this study was to use real datasets to determine the comparability of aSSI to ω (using both Bock’s nominal model and Rasch model), GBT (using both Bock’s nominal model and Rasch model) and B3.

Three real datasets with known security issues were used for the comparability study:

Exam A: This exam contained 66 items, 416 examinees; content was found on a brain dump site
Exam B: This exam contained 60 items, 1992 examinees; content was found on a brain dump site, where some of the items were mis-keyed
Exam C: This exam contained 66 items, 1109 examinees; other security analyses, such as score by time and scored versus unscored analyses, indicated security concerns

A correlation matrix was used to compare the strength of the relationship between the values used to detect collusion for all possible pairs of examinees given each method. For example, the probability of collusion for each pair of examinees based on ω (using Bock’s model) was correlated with the corresponding probability of collusion for each pair based on the aSSI method.^*

Results

The results indicated aSSI performed just as well as ω, GBT and B3. Tables 1-3 show the correlations. The main findings of this comparison include:

The strongest correlations among all three exams were between GBT-Rasch and aSSI
The weakest correlations tended to involve B3
aSSI had strong positive correlations (> 0.900) with both the ω-Rasch and GBT-Rasch methods

Table 1. Correlation of Flagged Pairs Using Different SSI Methods – Exam A

ω – Wollack’s Omega GBT- Generalized Binomial Test B3 – Person Residual Correlations aSSI- approximation Score Similarity Index

Table 2. Correlation of Flagged Pairs Using Different SSI Methods – Exam B

ω – Wollack’s Omega GBT- Generalized Binomial Test B3 – Person Residual Correlations aSSI- approximation Score Similarity Index

Table 3. Correlation of Flagged Pairs Using Different SSI Methods – Exam C

ω – Wollack’s Omega GBT- Generalized Binomial Test B3 – Person Residual Correlations aSSI- approximation Score Similarity Index

Discussion

Identifying pairs of examinees who have colluded is a problem in the credentialing field as well as fields outside of credentialing (e.g., classroom assessments). As professionals in the field, part of our responsibility is to reach out and educate those not in the testing field to ensure we provide the community with the resources and tools they need to help develop and validate results from their own assessments.

Many of the current methods that identify if collusion has likely occurred involve statistical software packages not easily accessible nor simple to implement by those outside of the assessment field. Thus, these methods effectively restrict the methods to a certain pool of users. The aSSI method presented in this article is one any individual could compute with a calculator and normal distribution table or spreadsheet software. The results are directly interpretable as a probability, e.g., if the probability is < 0.01, then there is a strong possibility some form of collusion has occurred.

While thought needs to be given to the exact flagging threshold one applies (e.g., 0.001, 0.01), the method and results are accessible and understandable to a much wider audience than ω, BST and B3.

Practical Policy and Implementation Considerations

Score similarity indices, such as aSSI, are just one method to detect potential collusion among one or many pairs of examinees. Data forensic techniques, such as aSSI, may detect unusually similar responses that should be investigated further, but one statistical method alone does not provide unequivocal and actionable evidence.¹¹ Multiple methods should be employed to detect potential collusion and evaluate its impact on examination outcomes.¹²

Jacobs, Judish and Murphy¹³ discuss some general ways in which organizations may approach ethics violations (e.g., cheating, gaining pre-knowledge) in a legally defensible way. Based on their work, guidance from the APA, AERA, & NCME Standards,¹⁴ and recent work of others (e.g., Thompson, Weinstein and Schoenig;¹⁵ Twing, Keen, Canto and Friess;¹⁶ O’Leary and Owens,¹⁷ several themes emerge related to taking action based on data forensics:

Programs/schools should have a policy in place that examinees/students must agree to that indicates their exam results may be monitored for aberrant behavior as well as any actions that may be taken. The policy could be published on such documents as a candidate agreement form or class syllabus.
Programs/schools should have a procedure in place for implementing a security policy fairly and consistently.
Programs/schools should make the policies and procedures related to violations of ethics code (e.g., cheating) transparent.
Programs/schools should gather multiple sources of evidence and/or suspicious results from data forensics analyses before taking action.

To this latter point, the results from aSSI alone may likely be insufficient to take action (e.g., canceling an exam score). However, grounds to take action become stronger if aSSI results showed highly unusual behavior and a proctor observed collusion during the administration.

Conclusion

The results in this paper and those by Smith^9,10 suggest aSSI sufficiently approximates SSI and the simplicity of the method does not compromise the method’s effectiveness. In this paper, aSSI strongly correlates with the ω and GBT methods (applying both Bock and Rasch models) as well as B3. As such, the results of this study indicate aSSI is a reasonable method to apply in order to provide a layer of evidence that collusion has or has not likely occurred on an assessment. This method can be easily applied by both members and non-members of the credentialing industry. With the ease of this method, any individual within the extended testing industry can readily apply this data forensic method and couple it with other evidence (e.g., statistical or non-statistical) to make a stronger case for taking action against one or more examinees in accordance with an established and consistently applied set of policies and procedures.

*The results did not compare the number of flagged individuals because the flagging criteria may differ slightly for each method and the goal was not to identify the “best” method, but only the comparability of the methods.

**The negative correlations with B3 are expected as B3 leverages residual correlations with no assumed underlying distribution and, therefore, no probability values. Higher B3 values indicate collusion. The other statistics in these tables include assumed underlying distributions resulting in estimated probabilities.

References

Wollack, J.A. (1996). Detection of answer copying using item response theory. Dissertation Abstracts International, 57/05, 2015.
van der Linden, W. J., & Sotaridona, L. (2006). Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31(3), 283-304.
Foley, B. P. (2019). Collusion Detection Using an Extension of Yen’s Q3 Statistic. Presented at the 8th Annual Conference on Test Security. Miami, FL.
Smith, R. W. (2019). Comparing B3 to Answer Similarity Index for Detecting Collusion. Presented at the 8th Annual Conference on Test Security. Miami, FL.
Russell, B. (1954). The Scientific Outlook. Third impression Great Britain: Unwin Brothers, Ltd. Available online at https://ia801606.us.archive.org/7/items/in.ernet.dli.2015.499767/2015.499767.the-scientific_text.pdf
Zopluoglu C. (2018). CopyDetect: Computing response similarity indices for multiple-choice tests (R Package Version 1.3). https://cran.r-project.org/web/packages/CopyDetect/index.html
Yen W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-213.
Linacre, J. M. (2022). Winsteps® Rasch measurement computer program (Version 5.2.3). Portland, Oregon: Winsteps.com
Smith, R. W. (2021, October 6-7). A Practical Approximation of Response Similarity. Conference on Test Security, online.
Smith, R. W. (2022, April). Approximation answer and response similarity analyses: A practical approach [Paper presentation]. Annual meeting of the National Council on Measurement in Education (NCME), San Diego, CA.
Foster, D. & Mulkey, J. (2019). Practical test security for professional credentialing programs. In J. Henderson (Ed.): The ICE Handbook (3^rd ed., pp. 415-446).
Hurtz, G. M. & J. A. Weiner. (2019). Analysis of test-taker profiles across a suite of statistical indices for detecting the presence and impact of cheating. Journal of Applied Testing Technology, 20 (1). Available online at https://www.jattjournal.com/index.php/atp/article/view/140828
Jacobs, J., Judish, J., & Murphy, D. C. (2019). Certification law. In J. Henderson (Ed.). The ICE Handbook. 3rd ed., pp. 45-70.
AERA, APA, & NCME. (2014). Standards for Educational and Psychological Testing: National Council on Measurement in Education. Washington, DC: American Educational Research Association.
Thompson, C., Weinstein, M., & Schoenig, R. (2022). Opening keynote on Ogletree vs. Cleveland State University. Presented at the 2022 Conference on Test Security. Princeton, NJ.
Twing, J. S., Keen, J. M., Canto, P., Friess, B. (2022). Lessons learned from federal litigation of cheating involving a test preparation company: Security is a pre-requisite for validity. Presented at the 2022 Conference on Test Security. Princeton, NJ.
O’Leary, L. & C. Owens. (2022). Simplifying Security: Deciphering Data Forensics into Accessible Actions. Presented at the 2022 Institute for Credentialing Excellence Conference. Savannah, GA.