[Rasch] MFRM and Disordered Scales

Stone, Gregory gstone at UTNet.UToledo.Edu
Mon Feb 18 04:04:29 EST 2008

Any help on the following would be greatly appreciated.

We've been delivering oral examinations for quite some time now and assessing performance using MFRM with candidates, items (cases), and judges.  Each judge rates each candidate on each case across five distinct qualities using a four point rating scale that currently runs from "Superior to Basic to Problematic to Unacceptable" (4-3-2-1).  Over the past several years we have changed the rating scale semantics multiple times, using information obtained from discussions with the judges who use the rating scale in order to define terms that are most reasonable for them, such that they would be able to use them consistently.  Unfortunately, we're still having a problem - an oddly consistent problem.

The rating scale in theory should step from greatest to least (4-3-2-1) on the rating form.  The greatest to least arrangement was determined by the judges as being easier.  We have tried least to greatest as well, with no difference.  In any event, we are now consistently seeing a disordered rating scale running 3-4-2-1.  If this were seen with one or two exams, it might simply be unique to those exams.  However, this pattern is seen practically across the board.  

The natural initial suggestion was that perhaps rating categories 3 and 4 should be collapsed, and indeed they are not the desired logit difference apart, but are not identical.  However, collapsing the "upper" rating categories pretty much destroys the effectiveness of the exam.  Judges ARE selecting all four rating categories, although as anyone familiar with testing likely knows, the lower two categories are used much less frequently than the upper two categories.  The disorder is so consistent across the test and across the judges that they appear to be reading the scale as such (3-4-2-1) and the statistical performances of items, candidates and judges are exceptionally strong.  It would not appear reasonable to increase the number of rating scale categories as we are already having difficulties with only four.

The biggest problem with this rating scale issue comes when additional facets are added.  When facets beyond the basic three are included, they are dysfunctional and uninterpretable because of the disordered rating scale categories.  In essence, any added facet that is also supposed to run from least to greatest or vice versa is completely scattered and random.  The randomness is not accurate as a simple review of the raw data indicates.  Indeed, if we take the average of the added facet for each of the four rating scale categories manually, the order is clear.

Any suggestions as to why this is happening and more importantly, what can be done?  


Gregory E. Stone, Ph.D., M.A.

Assistant Professor of Research and Measurement
The Judith Herb College of Education
The University of Toledo, Mailstop #914
Toledo, OH 43606   419-530-7224

Editorial Board, Journal of Applied Measurement     www.jampress.org

Board of Directors, American Board for Certification of Teacher Excellence     www.abcte.org

For information about the Research and Measurement Programs at The University of Toledo and careers in psychometrics, statistics and evaluation, email gregory.stone at utoledo.edu.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20080217/6901f304/attachment.html 

More information about the Rasch mailing list