Test scores may no longer have their intended interpretation when a test administration or preparation has been compromised by cheating. There are several types of cheating that could threaten a testing program. Item harvesting occurs when a concerted attempt is made to steal test questions. Examinees can do this by memorizing exam content, recording it with a camera, or transcribing it. The goal is to share content with other examinees, often for profit. Pre-knowledge occurs when examinees have knowledge of specific test questions prior to taking the test; questions that had been harvested by other examinees. Collusion occurs when two or more examinees attempt to work together to complete an examination. For example, one examinee might copy answers from the person sitting next to them or examinees might share answers during the test by text messaging or some type of signaling. Finally, proxy testing occurs when an examinee has another person take their test.
There are many different data forensics methods that can be used to help detect cheating. Details regarding the array of methods can be found in Cizek and Wollack (2017). One method is collusion detection, which assesses response similarity between pairs of examinees. The simplest methods use descriptive statistics to summarize the number of responses or errors in common. For example, the Responses in Common index (RIC) is the count of questions for which two examinees have the same response. More complicated methods rely on estimating the probability that response similarity occurs due to chance. Two such methods that can be used within the Classical Test Theory (CTT) framework are Frary, Tideman, and Watts’ (1997) g2 statistic and Wesolowski’s (2000) Z statistic; Wollack’s (2003) Omega and van der Linden and Sotaridona’s (2006) Generalized Binomial Test (GBT) indices can be used with Item Response Theory (IRT).
Many data forensics analyses rely on psychometricians writing their own code to implement them because software for this purpose is scarce. However, collusion indices can be obtained from SIFT (Assessment Systems Corporation, ASC) or CopyDetect (Zopluoglu, 2016). SIFT is software developed by ASC for research and consulting purposes. It computes many common collusion indices and can aggregate results by grouping variables such as testing location or classroom. CopyDetect is a package in the open-source R statistical programming language (R Core Team, 2013) that computes several IRT-based collusion indices. One drawback of CopyDetect is that it only processes one pair of examinees at a time, though a loop can be added to process larger batches of data. One thing to keep in mind about R packages is that their quality control may be limited because of R’s open-source nature. Therefore, R packages should be used with some degree of caution.
This article will present an example of how to find suspicious groups of examinees using some collusion indices provided by SIFT and CopyDetect. The investigation used data from one form of a predominantly multiple-choice, 200-item examination testing knowledge of pharmacy. The dataset contained roughly 500 examinees from 12 different testing locations (labelled as location A through L). Test results from locations G and L were manipulated to simulate collusion and pre-knowledge. Location G’s exam data was modified to mimic a situation where examinees sitting adjacent to one another engaged in answer sharing/copying. In this location, 30 of the 52 examinees were simulated to be cheaters. Among the simulated cheaters, 10 were specified as sources with 1–3 examinees copying answers from a given source (for a total of 20 copiers). Among the copiers, the answers to 40–70% of their test questions were changed to match the answers of the designated source. Pre-knowledge was simulated in Location L by specifying 40 of the 53 examinees to answer the same 50 questions correctly. The objective of this example was to see how well the simulated cheating could be detected.
Results from three CTT-based collusion indices computed by SIFT are presented in Figure 1 as the percentage of all examinee pair comparisons within each location that were flagged for possible collusion. For example, among all examinee pairs in Location A (which had 14 examinees), 6.5% were flagged with the RIC index. These statistics vary in their degree of power to detect collusion and they don’t adhere to nominal Type I error rates, but nevertheless cases with outlying flag proportions might indicate cheating. The RIC and Z indices appear more sensitive than g2, flagging a higher percentage of examinee pairs. Location C has the highest incidence of RIC flags and Location D has the highest incidence of Z flags, but they’re not outliers. However, Location L is a noticeable outlier in g2 flagging.
Figure 1. Percentage of examinee pairs within each location flagged by CTT-based collusion indices.
Another way to summarize the flagging is to compute the percentage of examinees that were flagged for collusion with at least one other examinee. This might do a better job identifying cases where examinees who sit adjacent to one another engage in collusion. These flag results are presented in Figure 2 where it can be seen that—in addition to Location L—Location G has the highest percentage of examinees with at least one g2 flag.
Figure 2. Percentage of examinees flagged by g2 for collusion with at least one other examinee within the same location.
Next, results from two IRT-based indices computed by CopyDetect can be found in figures 3 and 4. Figure 3 presents the percentage of examinee pairs that were flagged, and Location L (which has already started to look suspicious) is a definite outlier for both Omega and GBT flagging. Location C has a somewhat elevated flag rate for Omega, but it is not very compelling. A different picture emerges when looking at the proportion of examinees that had at least one collusion flag (Figure 4). Here, the results for Location C are more compelling, and again Location G has a high flag rate in addition to Location L. At least 69% if examinees had at least one collusion flag in these three locations.
Figure 3. Percentage of examinee pairs within each location flagged by IRT-based collusion indices.
Figure 4. Percentage of examinees flagged by IRT-based indices for collusion with at least one other examinee within the same location.
How well was the simulated cheating detected? Test locations G and L were among the most suspicious, and these were the locations in which cheating was simulated. Location G—where collusion was simulated—did not appear very abnormal until the proportion of examinees with at least one collusion flag was examined. Pre-knowledge was simulated in Location L, which was a noticeable outlier for g2 flagging and even more of an outlier for the IRT-based collusion indices from CopyDetect. The results from Location L illustrate how indices designed to detect answer copying can also identify pre-knowledge because pre-knowledge is just an instance where examinees are copying some part of a test key.
Data forensics can be a valuable tool to help identify test security breaches and irregular testing behavior, and it can be used to trigger a test-security investigation. After an examinee or group of examinees is flagged, the first question to ask is “does it make sense that cheating could have occurred based on what else is known about the data and testing environment?” For example, two examinees flagged for collusion might be found to have navigated through the exam similarly, which starts to build a rational argument for further investigation. The next step would be to gather more supporting evidence by reaching out to the proctor or test administration vendor to obtain a video recording of the exam session, incident reports, seating charts, etc. The case for cheating would be strengthened if, for example, it was found that the two flagged examinees sat next to each other during the exam. Further action to protect the exam program and stakeholders might be warranted after sufficient and compelling evidence is gathered to support the argument that cheating occurred. In the array of evidence gathered during an investigation, data forensics can provide objectivity and reliability that candidate and witness statements commonly lack.
 Special thanks to Andrew Dedes who help prepare the dataset.
 SIFT also computes some IRT-based indices, including Omega.
Cizek, G.J., & Wollack, J.A. (2017). Handbook of quantitative methods for detecting cheating on
tests. New York, NY: Routledge.
Frary, R.B., Tideman, T.N., & Watts, T.M. (1977). Indices of cheating on multiple-choice tests.
Journal of Educational Statistics, 2, 235-256.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
van der Linden, W.J., & Sotaridona, L.S. (2006). Detecting answer copying when the regular
response process follows a known response model. Journal of Educational and
Behavioral Statistics, 31, 283-304.
Wesolowsky, G. (2000). Detecting excessive similarity in answers on multiple choice exams.
Journal of Applied Statistics, 27(7), 909-921.
Wollack, J.A. (2003). Comparison of answer copying indices with real data. Journal of
Educational Measurement, 40, 189-205.
Zopluoglu, C. (2016). CopyDetect: Computing Statistical Indices to Detect Answer Copying on
Multiple-Choice Tests. R package version 1.2.