This study is aimed at: (1) revising the criterion used in Robust Z Method for detecting Item Parameter Drift (IPD), (2) identifying the strengths and weaknesses of the modified Robust Z Method, and (3) investigating the effect of IPD on examinees’ classification consistency using empirical data. This study used two types of data. The simulated data were in the form of responses of 20,000 students on 40 dichotomous items generated by simulating six variables including: (1) ability distribution, (2) differences of groups’ ability between groups, (3) type of drifting, (4) magnitude of drifting, (5) anchor test length, and (6) number of drifting items. The empirical data was 4,187,444 students’ response of UN SD/MI 2011 who administered 41 test forms of Indonesian language, mathematics, and science. Modified Robust Z method was used to detect IPD and the IRT true score equating method was used to analyze the classification consistency. The results of this study show that: (1) the criterion of 0.5 point raw score TCC difference leads to 100% consistency on passing classification, (2) the modified Robust Z is accurate to detect the b and ab- drifting when the minimal length of anchor test is 25%, (3) IPD occurring on empirical data affected the passing status of more than 2,000 students.
Arce, A. J. & Lau, A. C. (2011). Statistical properties of 3PL Robust Z: An investigation withreal and simulated datasets. Paper presented in the Annual Meeting of the National Council on Measurement in Education, in New Orleans, Lousiana.
Brennan. (2008). A discussion of population invariance. Applied Psychological Measurement. Volume 32(1), pp. 102-114.
Cook, L. L. & Eignor,D. R. (1991). IRT equating methods. Educational easurement: Issues and Practice,10, pp. 37-45.
Hambleton, R. K., Swaminathan. H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Han, K. (2007). WINGEN: Windows software that generates IRT parameter and item responses. Applied Psychological Measurement, 31, pp. 457–459.
Huynh & Meyer. (2010). Use of Robust Z in detecting unstable items in item response theory models: Practical assessment. Research and Evaluation Electronic Journal, 15 (2).
Keller & Wells. (2009). The effect of removing anchor items that exhibit differential item functioning on the scaling and classification of examinees. Paper presented in the annual meeting of NCME, inDenver. Wyse & Reckase. (2011). A graphical approach to evaluating equating using test characteristic curve. Applied Psychological Measurement, 35 (3), pp. 217-231.