Features

Optimizing LOFT Test Assembly: Strategies Exposure and Form Diversity

In credentialing examinations, maintaining test integrity and providing a secure and equitable assessment experience is paramount. Traditional fixed test forms pose significant security risks due to potential item exposure, compromising the validity and fairness of credentialing decisions. Linear-on-the-fly (LOFT), an innovative form assembly method, addresses these challenges by dynamically generating individualized test forms from a carefully curated item bank, ensuring each examinee receives a unique yet psychometrically equivalent assessment (van der Linden, 2005).

The LOFT approach offers several compelling benefits critical to credentialing exams. By continuously varying test content, LOFT substantially reduces the risk of item exposure, thereby safeguarding test security. It simultaneously enhances fairness by ensuring that each examinee faces an equivalent assessment challenge, regardless of test timing and frequency. Moreover, optimized LOFT designs improve overall item bank efficiency by maximizing item utilization and minimizing redundancy, essential for maintaining the long-term sustainability of the item pool.  

Figure 1 presents a sequential workflow for a LOFT engine that assembles parallel test forms. The process begins with a master item pool, from which a subset of items is selected to form an active item pool, ensuring alignment with predefined content requirements. A mixed integer programming (MIP) model then seeks an optimal solution by balancing multiple objectives and constraints. If the proposed form meets all requirements, it is accepted. An item usage tracker updates the status of included items based on usage thresholds and domain constraints before proceeding. If any requirements are unmet, a new set of items are sampled to create a new active pool and the process repeats.

Figure 1: Sequential Workflow of the LOFT Engine for Parallel Test Form Assembly

This article explores three strategies for optimizing LOFT test assembly. Specifically, it details methods for managing item exposure, enhancing form uniqueness, and improving item pool utilization. These strategies collectively support credentialing organizations in delivering assessments that are secure, fair, psychometrically robust and operationally sustainable, ultimately protecting the integrity and validity of the credentialing process.  

Controlling Item Exposure

Unlike automated, simultaneous form assembly approaches — better suited for assembling a limited set of fixed forms — a LOFT engine sequentially assembles test forms either as examinations are administered or as a preassembled collection of randomized assignments (Luecht and Sireci, 2011). The LOFT engine integrates an item usage tracker that monitors the frequency of item selection independently of the MIP form assembly model (see van der Linden, 2005, for more detail). Items that exceed a predefined usage threshold are excluded from the active item pool for subsequent form assemblies (Way, 1998). Determining the optimal usage threshold is inherently context-sensitive and depends on the depth and breadth of the item pool. Consequently, it is crucial to establish two types of thresholds: global thresholds that apply to the master pool and domain-specific thresholds that apply to individual content areas. In domains that have fewer items than their target representation in the test blueprint, overexposure becomes a likely challenge. While some level of reuse is unavoidable, implementing domain-specific thresholds offers a targeted approach to distributing item usage more evenly. These thresholds are essential when item availability is limited relative to testing demand, thereby minimizing overexposure and maintaining content balance.

Between-Form Similarity

Minimizing between-form similarity is essential to mitigate potential test security risks. Creating an active pool during each assembly sequence through random selection of items is a practical approach (Nunnally and Bernstein, 1994). A multiplier is applied to determine both the overall number of items and the domain-specific selection. For example, an active pool containing 100 items — equivalent to four times the length of a test form — is generated at each assembly iteration. A relatively straightforward implementation of the dynamic randomized active pool approach effectively controls between-form similarity and the reduction of item exposure (Chen, Chang and Wu, 2012).

Maximizing Item Pool Utilization

Even after successfully modeling objectives and constraints, the LOFT engine tends to repeatedly select items whose difficulties align with the specified evaluation points, thereby maximizing the target test information. Since the information function of an item peaks when the item’s difficulty is equal to the evaluation points, the LOFT engine naturally favors these items.

Two strategies can be considered to address the issue of the LOFT engine repeatedly selecting items with similar difficulty levels. Lowering the target test information can encourage the selection of items with a broader range of difficulties (Belov and Armstrong, 2009). Under the Rasch model, since the maximum information for a single item is 0.25i, the ideal target test information for a 100-item test is 25 (De Ayala, 2013, p.102). By adjusting the target information at the evaluation points, the engine is nudged toward choosing items that vary in difficulty. However, the Test Information Function (TIF) is inversely related to the conditional standard error of measurement, so lower information at the cut score can reduce classification precision in criterion-reference testing (Ali and van Rijn, 2016).

Categorizing items by difficulty and imposing specific constraints on each category can diversify item selection, ultimately enhancing item bank utilization (Chen, Chang and Wu, 2012; Lim and Han, 2024). The number of categories and the allocation of items can be determined based on the distribution of item difficulties in the item pool. Specifying a range of values for items to be selected from each difficulty category provides for a more feasible result rather than specifying fixed numbers. For example, if item difficulties in the pool follow a normal distribution, the difficulty range can be divided into 7 categories centered at -2.5 and 2.5. Item allocation can be based on normalized values of the standard normal density at these centers. For a 100-item test, this yields approximately: 1-3 items each -2.5 and 2.5, 7-9 from -1.5 and 1.5, 23-25 from -0.5 and 0.5, and 27-29 from 0 – ensuring a balanced difficulty distribution aligned with the item pool. Yet, adding difficulty constraints increases MIP model complexity, which can hinder feasibility and slow processing. Thus, it is crucial to balance category granularity with realistic constraints to ensure both efficiency and feasibility.

Developing a LOFT Engine Is Achievable

With targeted strategies, robust psychometric support and the appropriate modeling tools, developing a LOFT engine is entirely achievable. By integrating established psychometric principles into the assembly process, the engine can dynamically assemble test forms that ensure measurement precision and balanced content. This approach is especially valuable in the context of credentialing exams, where fairness, security, and comparability across test forms are critical. A well-designed LOFT system not only strengths test security but also supports defensible pass/fail decisions- key to maintaining the integrity and credibility of credentialing programs.


References

Ali, U. S. and van Rijn, P. W. (2016). An Evaluation of Different Statistical Targets for Assembling Parallel Forms in Item Response Theory. Applied Psychological Measurement, 40(3), 163-179.

Belov, D. I. and Armstrong, R. D. (2009). Direct and Inverse Problems of Item Pool Design for Computerized Adaptive Testing. Educational and Psychological Measurement, 69(4), 533-547.

Chen, P. H., Chang, H. H. and Wu, H. (2012). Item Selection for the Development of Parallel Forms From an IRT-based Seed Test Using a Sampling and Classification Approach. Educational and Psychological Measurement, 72(6), 933-953.

De Ayala, R. J. (2013). The theory and practice of item response theory. Guilford Publications.

Lim, H. and Han, K. C. T. (2024). An Automated Item Pool Assembly Framework for Maximizing Item Utilization for CAT. Educational Measurement: Issues and Practice, 43(1), 39-51.

Van der Linden, W. J. (2005). Linear Models for Optimal Test Design. Springer Science+ Business Media, Incorporated.

Luecht, R. M. and Sireci, S. G. (2011). A Review of Models for Computer-Based Testing. Research Report 2011-12. College Board.

Nunnally, J. C., and Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.

Way, W. D. (1998). Protecting the Integrity of Computerized Testing Item Pools. Educational Measurement: Issues and Practice, 17(4), 17-27.


iUnder the Rasch model, the information provided by item j is calculated as Pj*1-Pj, where Pj is the probability of a correct response. The maximum information an item can provide 0.25, calculated as 0.5 x (1-0.5), which occurs precisely when the examinee’s ability matches the item’s difficulty, yielding a probability of 0.5.   


Did you enjoy this article? I.C.E. provides education, networking and other resources for individuals who work in and serve the credentialing industry. Learn about the benefits of joining I.C.E. today. And if you enjoyed, share this article with a friend or on your social media page.