[Rasch] RMT Article
irothnie at usyd.edu.au
Mon Sep 9 12:14:42 EST 2013
This is a timely discussion for a problem I have just experienced that I wonder if others have experienced..
I used MFORMS to equate 2 MCQ tests (dichotomously scored) with common items.
The tests are given to 1st time takers, (no repeaters). I should note the items are kept securely and never released to test takers.
Usually I compare the average pvalue of common items over the 2 tests and then the average p value of the unique items to each test to get a feel for whether there is a 'test' difficulty difference or a 'sample' ability difference.
For my current example there was no difference in average pvalue of common items and very little difference in average pvalue of unique items.
The MFORMs result was striking in that it produced test characteristic curves showing a difference in expected raw score at ability 0 logits of around 9% !
AN anchored run of the second test also showed it to be more difficult in the frame of reference of test A by about half a logit (which seems a bit small compared to the MFORMS result). The differences in expected scores on MFORMS seem to be exacerbated in the expected scores at the lower end of the score spectrum - is this a clue to odd behavior by either sample?
I have just re-examined the unexplained variance on first contrast for each test given the comments below, and there is a secondary dimension to the strength of about 3-4 items on each 120 item test.
I split the first test into 2 tests, items assigned contrast 1 on the first and contrast 2 on the second. Crossplot the person measures and I find about 9% people are outside the confidence interval bands and I have an empirical slope of about 2. This seems contradictory to the disattenuated correlation reported on the Winsteps excel output worksheet of 0.97, which in turn is different to my manual calculation of 0.70[person measure biserial correlation/SQRT (reliability test 1 * reliability test 2)].
So, does this mean that I have tests that have clear second dimensions (8.5% people outside CI bands on crossplot of person measures) and therefore MFORMS test equating strategies give rise to spurious linking constants?
Any thoughts on what I think is a perplexing outcome would be much appreciated.
From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf Of rsmith
Sent: Saturday, September 07, 2013 6:23 AM
To: rasch at acer.edu.au
Subject: [Rasch] RMT Article
I read with interest the recent note in RMT V27, N2, by Royal and Raddatz that contained a cautionary tale about equating test forms for certification and licensure exams. By the end of the note, I was troubled, by what I feel is a common misunderstanding about the properties of Rasch measurement.
Their tale begins with a test administration and the investigation of item quality and functioning before attempting to equate the current form to a previously established standard.
It is widely known that the properties of item invariance that allow equating in the Rasch model hold, if and only if, the data fit the Rasch model. The investigation of the fit of the data to the model should be investigated in the initial stage of equating. The authors state, "Preliminary item analyses reveal the items appear to be sound and functioning." One assumes that the fit of the data to the model was confirmed in this process, though it is not explicitly stated. As the story continues, we find, in fact, that the equating solution does not hold across the various subgroups represented in the analysis and the calibration sample is subsequently altered to produce a different and more logical equating solution.
This suggests that the estimates of item difficulty were not freed from the distributional properties of the sample. Hence, the data can not fit a Rasch model. One would hope that it is not necessary to get to the very end of the equating process before discovering that the estimates of item difficulty are not invariant and the link constant developed for equating is not acceptable.
What then is the cause of the problem? Without independent confirmation, I would suggest that the fit statistics used in the preliminary analysis lacked the power to detect violations of this type of first-time vs. repeater invariance. This is easily corrected with the use of the between group item fit statistic available in Winsteps. It will not solve the problem lack of fit to the Rasch model, but it will let you know there is a problem before you get too far into the equating process. Developing an item bank that measures both types of examinees fairly is an entirely different issue, and one that should be addressed. The lack of item invariance across subgroups is a classic definition of item bias.
Richard M. Smith, Editor
Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311, USA
Rasch mailing list
Rasch at acer.edu.au
More information about the Rasch