library / libcbc0fc3920aebfc3
Computerized Adaptive Testing: A Primer
In a sentence
A comprehensive primer on how to build, calibrate, maintain, and validate computerized adaptive tests (CAT) that tailor item selection to each examinee while preserving measurement quality, fairness, and security.
Computerized Adaptive Testing: A Primer assembles three decades of psychometric research into a single accessible guide to the design and operation of CAT systems. Using a unifying hypothetical example—the Gedanken Computerized Adaptive Test (GCAT)—the authors walk readers through every component required to field a viable adaptive test: system hardware and software design, item-pool construction and review, item response theory (IRT) and item calibration, testing algorithms for item selection and stopping, scaling and equating, reliability and measurement precision, and validity. The second edition adds hard-won operational lessons about examinee access, item-pool security (including the famous Kaplan/GRE exploit), and the economics of computerized testing. Both technical when it must be and richly illustrated with examples, the book is the standard reference for anyone designing, evaluating, or critiquing adaptive measurement systems, and it candidly weighs when computerization is—and is not—worth the cost.
The four lenses
- Science
- Statistics
- Systems
- Strategy
The model
A factor model expressing how design levers (item pool quality, IRT model fit, item selection algorithm, exposure control, content balancing) and contextual conditions (administration medium, continuous-testing demand) produce psychological and behavioral states during testing (examinee engagement, proficiency estimation) and ultimately outcome metrics (measurement precision, score comparability, validity of inferences, and test security).
Item Pool Qualitydesign lever
The degree to which the pool of candidate items is well written, sensitivity-reviewed, pretested, covers a wide range of difficulty for each content area, and conforms to the assumptions of the psychometric model used for calibration and scoring.
IRT Model Fit / Unidimensionalitydesign lever
The extent to which the chosen item response theory model adequately describes examinee responses, including conformity to assumptions of unidimensionality, local (conditional) independence, item fungibility, and absence of differential item functioning.
Item Selection Algorithmdesign lever
The set of rules determining which item is presented first, which item follows each response, and when testing stops, typically maximizing information or expected posterior precision near the current proficiency estimate subject to content and exposure constraints.
Content Balancing Constraintsdesign lever
Constraints imposed on item selection so that each examinee receives a comparable mixture of items across content subdomains and informal contexts, preserving comparability of scores and content validity despite individualized item administration.
Item Exposure Controldesign lever
Mechanisms that limit how frequently any item is administered, either proportionally or in absolute number, to flatten the item usage distribution and reduce the chance that memorized items materially inflate examinee scores.
Administration Medium (Computer vs Paper)contextual condition
The contextual condition of whether items are presented and responded to via computer versus paper-and-pencil, which can change the cognitive task, difficulty, timing, and response process for some items, producing medium (mode) effects.
Continuous Testing Demandcontextual condition
The economically and operationally imposed requirement that computerized tests be administered continuously over time rather than at a few fixed mass-administration dates, shaping examinee access, item exposure, and security challenges.
Proficiency Estimation Accuracypsychological state
The accuracy and stability with which an examinee's latent proficiency is estimated during and at the end of an adaptive test, reflected in the width of the posterior distribution or the standard error of the proficiency estimate.
Examinee Engagement and Comfortpsychological state
The examinee's behavioral and affective state during testing—being challenged but not discouraged, avoiding boredom from too-easy items or frustration from too-hard items, and freedom from undue test or computer anxiety.
Measurement Precision / Informationoutcome metric
The amount of test information (inverse of error variance) achieved across the proficiency range, indicating how precisely the test measures each examinee and how efficiently it uses items.
Score Comparability / Equating Qualityoutcome metric
The degree to which scores from different adaptive forms, and from CAT versus legacy paper-and-pencil tests, can be placed on a common scale and used interchangeably, satisfying same-construct, equity, population invariance, and symmetry conditions.
Validity of Score Inferencesoutcome metric
The appropriateness, meaningfulness, and usefulness of the specific inferences (selection, placement, prediction) made from test scores, encompassing content, construct, and criterion-related validity and freedom from threats such as differential item functioning.
Test Securityoutcome metric
The degree to which the item pool is protected against compromise, theft, and preknowledge that would allow examinees to inflate scores without commensurate proficiency, threatening score validity.
How they connect
- item pool quality → predicts measurement precision
- item selection algorithm → predicts proficiency estimation accuracy
- proficiency estimation accuracy → predicts measurement precision
- item selection algorithm → mediates measurement precision
- content balancing → moderates validity of inferences
- irt model fit → moderates validity of inferences
- exposure control → predicts test security
- test security → predicts validity of inferences
- continuous testing demand − influences test security
- administration medium − moderates score comparability
- item pool quality → predicts score comparability
- item selection algorithm → predicts examinee engagement
- measurement precision → predicts validity of inferences
- continuous testing demand − influences examinee engagement
The story
The reader A psychometrician, test developer, or measurement professional who wants to build, evaluate, or responsibly deploy a computerized adaptive test.
External problem
Conventional paper-and-pencil tests waste examinee time with inappropriately easy or hard items and cannot individualize measurement, while building a CAT requires solving many interlocking statistical, operational, and security problems.
Internal problem
The reader feels overwhelmed by a scattered, highly technical literature and uncertain whether they can field a defensible adaptive test that is fair, accurate, and economically sensible.
Philosophical problem
Tests should measure people as efficiently, fairly, and validly as possible—asking everyone the same items, or wasting their effort on inappropriate ones, is wasteful and inequitable.
The plan
- Understand the history and rationale of adaptive testing and IRT.
- Design the system: hardware, software, and human factors for administration.
- Construct, review, pretest, and calibrate a high-quality item pool.
- Implement testing algorithms for starting, continuing, and stopping, with exposure control and content balancing.
- Scale and equate scores to make them comparable across forms and to legacy tests.
- Evaluate reliability, measurement precision, and validity.
- Confront operational realities of access, security, and cost before deciding to computerize.
Success
- An adaptive test that measures each examinee accurately in about half the time of a fixed test.
- Scores that are comparable across forms and equatable to legacy instruments.
- A defensible, fair, secure testing program whose inferences are valid for their intended use.
- Confidence to know when computerization adds value and when it does not.
At stake
- A poorly constructed CAT with flawed or insecure items that compromises fairness and validity.
- Score inflation through item theft and uncontrolled exposure.
- Wasted money computerizing a test that gains nothing from it.
- Legal and ethical challenges over incomparable scores or differential validity.
Chapter by chapter
ch01Introduction and History
This chapter reveals the historical trajectory of mental testing, particularly in military and educational contexts, highlighting the evolution of testing methods and their implications for future assessments.
ch02System Design and Operation
This chapter explores the critical elements of system design and operation, highlighting common issues that arise during implementation and strategies for effective management.
ch03Item Pools
This chapter presents a structured approach to creating effective item pools for assessments, emphasizing the importance of dimensionality and systematic development.
Related in the literature
The measurement literature behind this signal — sourced, so you can defend it.
“Once tests are administered on a computer, describe some testing options that would be open that are currently not available. _______________ 1 There is a distinction between sets of items for which cumulative and noncumulative (or unfolding) response models are appropriate.…”
— computerized_adaptive_testingmatch 69%
“A., & Gade, P. A. (1983). An application of computerized adaptive testing in Army recruiting. Journal of Computer-Based Instruction . 10 , 37–89. Sands, W. A., Waters, B, & McBride, J. R. (Eds.). (1997). Computerized adaptive testing: From inquiry to operation . Washington, DC:…”
— computerized_adaptive_testingmatch 68%
“Title : Computerized Adaptive Testing Author: Wainer, Howard,Dorans, Neil J.,Flaugher, Ronald,Green, Bert F.,Mislevy, Robert J. ASIN : B000SHL2Z2 ISBN : 9781135660819 [image "image" file=Image00000.jpg] Computerized Adaptive Testing A Primer Second Edition Computerized Adaptive…”
— computerized_adaptive_testingmatch 67%
Resources: computerized_adaptive_testing