Abstract
A series of 8 tests was administered to university students over 4 weeks for program assessment purposes. The stakes of these tests were low for students; they received course points based on test completion, not test performance. Tests were administered in a counterbalanced order across 2 administrations. Response time effort, a measure of the proportion of items on which solution behavior rather than rapid-guessing behavior was used, was higher when a test was administered in the 1st week. Test scores were also higher. Differences between Week 1 and Week 4 test scores decreased when the test was scored with an effort-moderated model that took into account whether the student used solution or rapid-guessing behavior. Differences further decreased when students who used rapid-guessing on 5 or more of the 30 items were filtered from the data set.
Notes
1The item discriminations were quite low for this test, particularly in the 1st year, and there were only six common items. These conditions can lead to less accurate equating. CitationNorcini, Shea, and Lipman (1994) found greater equating error with less-discriminating items (mean corrected item total correlation = .09, slightly higher than for Test D in this study). In their review of research on equating, CitationCook and Petersen (1987) concluded that, in some cases, six to eight items have been found adequate for an anchor test, but generally more items produce more accurate equating. Given the small number of anchor items on Test D and their low discriminations, the simultaneous calibration results were compared to separate calibrations equated through the mean and sigma method, Haebera's characteristic curve method, and Stocking and Lord's characteristic curve method. The four equating methods led to noticeably different results, which called into question the accuracy of any of the results.