Big Data Speaks Up for Test Security

Five primary sources of data help us measure the effects of test security weaknesses, threats and vulnerabilities. This article focuses on one: data gathered from exam administrations.

Tuesday, May 3, 2016

In 1973, while discussing the computational difficulties of detecting answer copying on exams, William H. Angoff wrote: “However, even a brief consideration of this solution (i.e., constructing theoretical statistical distributions of identical responses between random test takers) makes it clear that the complexities … are far too great to make it practical.” A lot has happened in 40 years. The power and capacity of today’s computers have made “impractical” solutions possible. In this era of big data, smart people using computers are replacing yesterday’s imagining with today’s reality (e.g., digitization of human knowledge found in libraries, sequencing the human genome, etc.).

By applying computational models to databases containing observations and measurements, we can answer critical questions, such as how to improve test security. The statistician Edward W. Deming, who achieved fame by helping the Japanese industry achieve superb quality after World War II, stressed the importance of using data to improve systems and quality. Deming characterized the improvement process using the PDSA (plan, do, study, act) cycle. Every part of this cycle requires being able to measure outcomes and make positive change. Thus, analysis of data is vital for strengthening test security and ensuring that test scores have their intended meaning.

Five primary sources of data help us measure the effects of test security weaknesses, threats and vulnerabilities. These include:

Data gathered from exam administrations (e.g., data forensics)
Tips received from concerned individuals (e.g., tip lines)
Information retrieved from searching the Internet (e.g., web monitoring)
Reports written by proctors and invigilators (e.g., incident reports)
Observations where tests are administered (e.g., site monitoring and inspection)

Credentialing programs should use and cultivate every source of data possible for improving test security. The remainder of this article focuses on the first source of data.

Using Data Obtained from Exam Administrations

Test result data gathered during test administration arguably provide the richest source for learning about test security threats and weaknesses, because test security threats are most prevalent during exam administration. More non-trusted individuals (some of whom are eager to steal and cheat) are exposed to the questions during exam administration than at any other time. During exam administration, cheaters seek to exploit every weakness and vulnerability in security procedures. Because exam administration is the point of attack for most thieves and cheaters, it is absolutely critical that test result data are analyzed to detect potential test security violations. A proper study of these data is necessary to improve test security using Deming’s PDSA improvement cycle.

While emphasizing that data must be studied, it would be shortsighted to not recognize that some data provide more information about test security threats (i.e., theft and cheating). Test theft can be accomplished using tiny cameras that record the testing session with no human interaction. Thus, there may be no evidence that the test taker did anything wrong. While detection of theft is difficult, certain types of cheating can be found easily using test result data.

Most cheating techniques fall into two categories: collusion between test takers and pre-knowledge of disclosed exam questions. Collusion occurs when examinees share answers during the testing session (e.g., using cellphones) or breaks (e.g., in restrooms), or when surrogate test takers (i.e., proxies) are employed. It can also happen when teachers or proctors coach test takers during the exam. A variation of collusion called tampering occurs when an exam administrator changes answers for test takers during or after the testing session. Pre-knowledge of exam content generally occurs through disclosure of secure test questions on the internet by an instructor or by rogue review courses.

Statistics that can detect collusion and pre-knowledge provide the best information for measuring test security threats and gathering evidence of potential test security violations. These statistics are: similarity or answer-copying statistics (collusion), erasure or answer-change statistics (collusion or tampering), unusual score changes from prior tests (pre-knowledge), inconsistent use of time in responding to questions (pre-knowledge) and inconsistent sub-scores within a test, such as performance differences between old and new questions (pre-knowledge). Of course, additional program-specific statistics may be used if the requisite data are available. One such statistic is analysis of test session time of day to detect when tests are taken after normal business hours.

Applying Data Analysis in Each Phase of the Test Security Process

Once a program collects and processes the above statistics and others obtained from the testing session, the resulting data can and should be used throughout all aspects of the test security process. An ongoing, continual test security process involves four key phases, as shown in Figure 1.

Figure 1: Four Key Phases of Continued Test Security Process

The use of data is now illustrated for each of the test security phases.

Protection. While merely gathering data cannot prevent a person from cheating, the data can be used to prevent a cheater from being awarded an illicit test score. For example, if there is time to analyze data between the exam administration and the score reporting, scores may be withheld for those test results that cannot be certified as being valid. This action protects the exam.

Detection. The above statistics should be used to detect potential test security issues. Monitoring data to detect test security threats is critical. It is similar to installing a security system in a home or business and monitoring to detect break-ins. This is the role most commonly associated with data forensics.

Response. Reliable data are critical for responding to a potential test security breach. The cheater’s goal is to game the test score. By its very nature, this activity represents a statistical attack on the test. This is similar to accounting fraud, which is a numerical attack on account balances. The statistics can be used to confirm or disconfirm the presence of such an activity. Upon confirmation, these data document the test security breach so that it can be analyzed and prosecuted.

Improvement. Test security may be improved by studying the results from the prior three phases. It may also be improved by studying the prevalence of potential test security issues and tracking them through time. In other words, these statistics provide quality control measures that can be used to manage test security threats by revising the test security process.

Thus, data forensics (i.e., analysis of test result data for the purposes of improving test security) has a broader reach than mere detection. These data are invaluable to responding to potential breaches, learning from test security risks, planning and implementing improved security countermeasures, and even protecting exams. The same observations apply to other data sources, such as web monitoring and incident reports.

Using Big Data with PDSA to Improve Test Security

Acknowledging that PDSA is absolutely necessary, space limitations allow providing only general examples on how this should be done.

Plan. Threat assessment is critical to test security planning. By identifying threats, vulnerabilities and weaknesses, program administrators become informed of the data that must be collected, how they should be collected and what analyses will need to be performed. It is vital to plan how the data forensics analyses will be used after they are performed (see remarks by Dr. Greg Cizek).

Do. After test security plans are created along with ways to observe, measure and model threats, test security measures and procedures should be implemented. Proctors need to be trained. Tests should be administered and scored. In short, the credentialing program proceeds forward.

Study. The data that were identified in the planning stage and collected in the implementation stage must be analyzed. These data can inform us about test security hypotheses (e.g., which threats are most prevalent). If the correct data are gathered, they can help program managers understand when tests should be republished, which test security messages to test takers are most effective and how training may be improved.

Act. Given the results of the data analysis, corrective and proactive actions may be taken. These actions will likely lead to modified test security procedures and improved ways to gather data. Also, these actions may generate a test security response such as investigations, enforcement or monitoring.

Measurement professionals can learn from the National Transportation Safety Board (NTSB). It has made air travel in the United States the safest form of travel by consistent adherence to planning, doing, studying and acting. It thoroughly investigates every accident involving air travel until it arrives at the root cause, and then it studies the data in order to implement improvements. Similarly, we need to learn from test security breaches. We need to quantify them, study them and institute corrective measures. A simple phrase resounds with truth: “You can’t manage what you don’t measure. Conversely, when you measure, you can manage.” From a test security perspective, the most important things you can measure are loss, threats (how losses arise) and risks (the potential for loss). By using data to measure these, you are empowered to improve test security and test score validity.

Dennis Maynes is a chief scientist at Caveon Test Security. He has pioneered several methods for the statistical detection of potential test fraud, including the use of clusters to detect cheat rings and the use of embedded verification tests to detect brain dump users. He has conducted more than 450 data forensics projects for more than 50 organizations, including 11 state departments of education, 10 medical programs and 12 information technology certification programs. Maynes holds a master’s degree in statistics from Brigham Young University.

Tags: test security , data , big data , exam administrations