Scaling Item Difficulty Estimates from Nonequivalent Groups


By placing item statistics on a common scale, items piloted with groups of test takers who have different ability levels can be combined to yield a test with known characteristics. This project examined two approaches using the classical testing theory and one approach using the Rasch model for scaling item difficulty estimates. A simulation study was conducted to compare the true item difficulties with the scaled difficulties. While bias and error diminished as sample sizes increased, all three approaches were found to be extremely accurate at all tested sample size values. There are slight differences in the approaches in terms of sensitivity to variations on the test-taker ability distributions.