[Rasch] MFRM and Disordered Scales
Lang, William Steve
WSLang at tempest.coedu.usf.edu
Thu Feb 21 22:48:50 EST 2008
Judy wrote this to me, "I think the best solution is to screen the raters using a "humanitarianism" scale. If they are humanistic, don't let them rate. If that is not possible, use three points." (Judy R. Wilkerson, Ph.D. Associate Professor, Assessment and Research
College of Education Florida Gulf Coast University).
I agree. We've done lots of performance/rubric scoring and noticed a subgroup of "observers" who see almost every judgment as dichotomous: perfect or unacceptable. We've altered rubrics and had training (which helped), but still see a subset of judges as a problem. We've also had some success with multiple "tries" with the expectation for judges that the first try may not be successful, and then calibrated the first observation. We reported on some of that in JAM.
Recently, we've scaled the middle category of 4 or 5 points on the rubric as the expected target on some high-inference observations, and that seems to help. We've used traditional taxonomies (like Blooms/Krathwohl) and examples to help judges frame the responses with the expectation that there is a clear difference between taxonomic steps. We're analyzing pilot data now and will report at IOMW/AERA, but it seems to be working as we're getting better ordered categories, but still have some editing and revising to do.
We also decided that changing the category labels didn't help.
We've almost concluded that some people are "flawed judges" if they can't help but engage in "humanistic rater error", but it's hard to fire the clinical supervisor of the health program or director of teacher education! We're not sure if you are describing the same issue, but that may be the case.
University of South Florida St. Petersburg
From: rasch-bounces at acer.edu.au on behalf of Stone, Gregory
Sent: Sun 2/17/2008 12:04 PM
To: Listserve, Rasch
Subject: [Rasch] MFRM and Disordered Scales
Any help on the following would be greatly appreciated.
We've been delivering oral examinations for quite some time now and assessing performance using MFRM with candidates, items (cases), and judges. Each judge rates each candidate on each case across five distinct qualities using a four point rating scale that currently runs from "Superior to Basic to Problematic to Unacceptable" (4-3-2-1). Over the past several years we have changed the rating scale semantics multiple times, using information obtained from discussions with the judges who use the rating scale in order to define terms that are most reasonable for them, such that they would be able to use them consistently. Unfortunately, we're still having a problem - an oddly consistent problem.
The rating scale in theory should step from greatest to least (4-3-2-1) on the rating form. The greatest to least arrangement was determined by the judges as being easier. We have tried least to greatest as well, with no difference. In any event, we are now consistently seeing a disordered rating scale running 3-4-2-1. If this were seen with one or two exams, it might simply be unique to those exams. However, this pattern is seen practically across the board.
The natural initial suggestion was that perhaps rating categories 3 and 4 should be collapsed, and indeed they are not the desired logit difference apart, but are not identical. However, collapsing the "upper" rating categories pretty much destroys the effectiveness of the exam. Judges ARE selecting all four rating categories, although as anyone familiar with testing likely knows, the lower two categories are used much less frequently than the upper two categories. The disorder is so consistent across the test and across the judges that they appear to be reading the scale as such (3-4-2-1) and the statistical performances of items, candidates and judges are exceptionally strong. It would not appear reasonable to increase the number of rating scale categories as we are already having difficulties with only four.
The biggest problem with this rating scale issue comes when additional facets are added. When facets beyond the basic three are included, they are dysfunctional and uninterpretable because of the disordered rating scale categories. In essence, any added facet that is also supposed to run from least to greatest or vice versa is completely scattered and random. The randomness is not accurate as a simple review of the raw data indicates. Indeed, if we take the average of the added facet for each of the four rating scale categories manually, the order is clear.
Any suggestions as to why this is happening and more importantly, what can be done?
Gregory E. Stone, Ph.D., M.A.
Assistant Professor of Research and Measurement
The Judith Herb College of Education
The University of Toledo, Mailstop #914
Toledo, OH 43606 419-530-7224
Editorial Board, Journal of Applied Measurement www.jampress.org
Board of Directors, American Board for Certification of Teacher Excellence www.abcte.org
For information about the Research and Measurement Programs at The University of Toledo and careers in psychometrics, statistics and evaluation, email gregory.stone at utoledo.edu.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Rasch