FW: [Rasch] MFRM and Disordered Scales

Wilkerson, Dr. Judy jwilkers at fgcu.edu
Tue Feb 26 13:25:37 EST 2008


A bit delayed, but here is my response.  I hope it helps.
 
Judy
 
Judy R. Wilkerson, Ph.D.
Associate Professor, Assessment and Research
College of Education
Florida Gulf Coast University

"Florida has a very broad public records law. As a result, any written communication created or received by Florida Gulf Coast University employees is subject to disclosure to the public and the media, upon request, unless otherwise exempt. Under Florida law, e-mail addresses are public records. If you do not want your email address released in response to a public records request, do not send electronic mail to this entity. Instead, contact this office by phone or in writing." 

________________________________

From: Stephanou, Andrew [mailto:Stephanou at acer.edu.au]
Sent: Sun 2/24/2008 4:30 PM
To: liasonas at lycos.com; WSLang at tempest.coedu.usf.edu
Cc: gstone at UTNet.UToledo.Edu; Wilkerson, Dr. Judy; Stephanou, Andrew
Subject: FW: [Rasch] MFRM and Disordered Scales


Not distributed to the Rasch list because it originated from a non-subscribed email address

________________________________

From: Wilkerson, Dr. Judy [mailto:jwilkers at fgcu.edu] 
Sent: Sunday, 24 February 2008 3:48 PM
To: iasonas lambrianou; Lang, William Steve
Cc: Stone, Gregory; rasch
Subject: RE: [Rasch] MFRM and Disordered Scales


Hi all.  Since my friend and colleague Steve has dragged me into this conversation with a quick quote not meant for worldwide distribution, I suspect I had best explain and then offer my advice on how to resolve this problem.  Know at the start that Steve is the "numbers" person in our duo and I am the "words" person.  While we both do a bit of each, we both have our preferred responsibility and expertise.
 
My "humanitarianism" comment reflects my location in a school of education where there is a general pattern of high ratings.  We work in a profession where the grading scale is roughly 90% A's, 9% B's, and 1% C's and an almost untraceable number of grades below that.  While I exagerrate a bit for effect, the message is that grades are inflated.  (Maybe you do better in Toledo.)  I know one dean (with a PhD in measurement) who tried to circumvent the rating problem in a 4-point scale by requiring faculty to justify top scores, but that didn't work either.  If that is the case where you are, Greg, you are unlikely to find the words that fit the rubric and you will probably continue to have 3/4 ordering problems.  Lose the 4s, and you may then have a 2/3 equivalent problem -- even if I did suggest it.  A simple single word descriptor, no matter how carefully selected, may remain problematic.  Faculty love little more than arguing ad nauseum over a word.  So here is my advice.
 
First and foremost, determine what you need and what you can use for decision-making.  If you need a four-point scale for a specific reason, then it is important to build it.  If not, you could reduce it to 3-points based on the utility rationale.
 
Second, regardless of the number of categories, I have found that the better they are defined, the better the data and the better the statistical result.  Definition, though, is not a simple matter.  It is one thing to name categories, it is another to describe them in a way that consistency and the expected ordering are likely to occur.
 
When Steve and I write or present to a non-measurement crowd, we help them best when we encourage them to visualize the points on the scale.  Corny as it may seem, we tell them to close their eyes, and visualize the teacher performing the skill at various levels.  If you (and they) can't see the point on the scale clearly, I suspect you will not resolve this problem.  It is often hard to differentiate clearly between more than two or three scale points.  That is why we typically go for three.  With two, you get little improvement data (important in our world) because you get virtually no variability.  With three variability improves somewhat.  With four points, you get into messes like you describe.
 
I prefaced this note with the comment that I am the "words" person in our duo.  And that is where I will end.  My advice is to use the data you have accumulated from the exams as a starting point for checking the feasibility of a four-point scale and sorting the attributes of each into distinct, well described descriptions (a.k.a. scoring guidelines or rubrics).  If you can come up with a taxonomy to frame your work, that, too helps.  In our current work, which we will present next month at IOMW and AERA (Rasch SIG), we will describe a battery of dispositions assessments we are developing in teacher education.  Affect is tough, but we have found that if we use a standards-based definition of the construct (the ten INTASC Principles) and then visualize and describe the teacher along the five levels of the Bloom/Krathwohl affective taxonomy, we can score with consistency and make meaningful, useful decisions.  Even we were surprised at the ordering of the five points when we analyzed the results.  
 
The rubrics are very long and very detailed now (roughly 100-150 words each).  They have just been refined to reflect the results of our field test.  But when all is said and done, the better we can describe what we are looking for -- and that means showing the progression from one level to the next -- the more likely we are to have judges who will rate consistently.  These rubrics are a royal pain in the butt to write, but in the end, they may save you lots and lots of aggravation -- reducing the kind of frustration you are experiencing now as well as reducing the amount of time and energy you need to expend in training.  Steve likes to tease me by calling me the "rubric queen" because I write rubrics faster than the eye can see.  But this six-point scale (five levels of the Krathwohl Taxonomy plus what we call "unaware") is taking me about two hours per INTASC Principle.  As the number of points on the scale increases, I feel like the work increases exponentially.
 
In summary, there's no substitute in my book for spending whatever time it takes to fully describe what you want the raters to "see" (or in your case hear) at each scale point, and the progression from one to the next needs to be equal in each pair.  The interval scale exists in both words and numbers.
 
If this sounds like something you would like to explore, let me know, and I will send you a sample or we can whip it out in NYC.  
 
Best of luck,
Judy
 
Judy R. Wilkerson, Ph.D.
Associate Professor, Assessment and Research
College of Education
Florida Gulf Coast University

"Florida has a very broad public records law. As a result, any written communication created or received by Florida Gulf Coast University employees is subject to disclosure to the public and the media, upon request, unless otherwise exempt. Under Florida law, e-mail addresses are public records. If you do not want your email address released in response to a public records request, do not send electronic mail to this entity. Instead, contact this office by phone or in writing." 

________________________________

From: iasonas lambrianou [mailto:liasonas at lycos.com]
Sent: Thu 2/21/2008 7:43 AM
To: Lang, William Steve
Cc: Stone, Gregory; Listserve, Rasch; Wilkerson, Dr. Judy
Subject: RE: [Rasch] MFRM and Disordered Scales


Dear all, 
I have been working for Testing Services for many years, and my understanding is that it is not possible to convince/train/encourage/force everybody to rate properly. I always have raters with inferior performance/accuracy compared to the others. However, if there is a disordered-scale problem, maybe the problem is not the raters but the way the rating scale is applied. Using only three categories may not be meaningful since you will basically loose 25% of the information. I suggest more training: try common rating sessions where raters rate the same people and then they spend enough time discussing between them why the rated the way the did. If those not rating as they should cannot change their mind and follow the 'rule', then try consider adjusting their ratings statistically. I dont know if I understood your case well, but all of us working int he testing sector suffer from similar problems. 

Jason 









	---------[ Received Mail Content ]----------
	
	Subject : RE: [Rasch] MFRM and Disordered Scales
	
	Date : Thu, 21 Feb 2008 06:48:50 -0500
	
	From : "Lang, William Steve" <WSLang at tempest.coedu.usf.edu>
	
	To : "Stone, Gregory" <gstone at UTNet.UToledo.Edu>, "Listserve, Rasch" <rasch at acer.edu.au>
	
	Cc : jwilkers at fgcu.edu
	
	
	
	Greg, 
	
	
	
	Judy wrote this to me, "I think the best solution is to screen the raters using a "humanitarianism" scale. If they are humanistic, don't let them rate. If that is not possible, use three points." (Judy R. Wilkerson, Ph.D. Associate Professor, Assessment and Research 
	
	College of Education Florida Gulf Coast University). 
	
	
	
	I agree. We've done lots of performance/rubric scoring and noticed a subgroup of "observers" who see almost every judgment as dichotomous: perfect or unacceptable. We've altered rubrics and had training (which helped), but still see a subset of judges as a problem. We've also had some success with multiple "tries" with the expectation for judges that the first try may not be successful, and then calibrated the first observation. We reported on some of that in JAM. 
	
	
	
	Recently, we've scaled the middle category of 4 or 5 points on the rubric as the expected target on some high-inference observations, and that seems to help. We've used traditional taxonomies (like Blooms/Krathwohl) and examples to help judges frame the responses with the expectation that there is a clear difference between taxonomic steps. We're analyzing pilot data now and will report at IOMW/AERA, but it seems to be working as we're getting better ordered categories, but still have some editing and revising to do. 
	
	
	
	We also decided that changing the category labels didn't help. 
	
	
	
	We've almost concluded that some people are "flawed judges" if they can't help but engage in "humanistic rater error", but it's hard to fire the clinical supervisor of the health program or director of teacher education! We're not sure if you are describing the same issue, but that may be the case. 
	
	
	
	Steve Lang 
	
	University of South Florida St. Petersburg 
	
	
	
	
	
	-----Original Message----- 
	
	From: rasch-bounces at acer.edu.au on behalf of Stone, Gregory 
	
	Sent: Sun 2/17/2008 12:04 PM 
	
	To: Listserve, Rasch 
	
	Subject: [Rasch] MFRM and Disordered Scales 
	
	
	
	Any help on the following would be greatly appreciated. 
	
	
	
	We've been delivering oral examinations for quite some time now and assessing performance using MFRM with candidates, items (cases), and judges. Each judge rates each candidate on each case across five distinct qualities using a four point rating scale that currently runs from "Superior to Basic to Problematic to Unacceptable" (4-3-2-1). Over the past several years we have changed the rating scale semantics multiple times, using information obtained from discussions with the judges who use the rating scale in order to define terms that are most reasonable for them, such that they would be able to use them consistently. Unfortunately, we're still having a problem - an oddly consistent problem. 
	
	
	
	The rating scale in theory should step from greatest to least (4-3-2-1) on the rating form. The greatest to least arrangement was determined by the judges as being easier. We have tried least to greatest as well, with no difference. In any event, we are now consistently seeing a disordered rating scale running 3-4-2-1. If this were seen with one or two exams, it might simply be unique to those exams. However, this pattern is seen practically across the board. 
	
	
	
	The natural initial suggestion was that perhaps rating categories 3 and 4 should be collapsed, and indeed they are not the desired logit difference apart, but are not identical. However, collapsing the "upper" rating categories pretty much destroys the effectiveness of the exam. Judges ARE selecting all four rating categories, although as anyone familiar with testing likely knows, the lower two categories are used much less frequently than the upper two categories. The disorder is so consistent across the test and across the judges that they appear to be reading the scale as such (3-4-2-1) and the statistical performances of items, candidates and judges are exceptionally strong. It would not appear reasonable to increase the number of rating scale categories as we are already having difficulties with only four. 
	
	
	
	The biggest problem with this rating scale issue comes when additional facets are added. When facets beyond the basic three are included, they are dysfunctional and uninterpretable because of the disordered rating scale categories. In essence, any added facet that is also supposed to run from least to greatest or vice versa is completely scattered and random. The randomness is not accurate as a simple review of the raw data indicates. Indeed, if we take the average of the added facet for each of the four rating scale categories manually, the order is clear. 
	
	
	
	Any suggestions as to why this is happening and more importantly, what can be done? 
	
	
	
	Cheers. 
	
	
	
	
	
	Gregory E. Stone, Ph.D., M.A. 
	
	
	
	Assistant Professor of Research and Measurement 
	
	The Judith Herb College of Education 
	
	The University of Toledo, Mailstop #914 
	
	Toledo, OH 43606 419-530-7224 
	
	
	
	Editorial Board, Journal of Applied Measurement www.jampress.org 
	
	
	
	Board of Directors, American Board for Certification of Teacher Excellence www.abcte.org 
	
	
	
	For information about the Research and Measurement Programs at The University of Toledo and careers in psychometrics, statistics and evaluation, email gregory.stone at utoledo.edu. 
	
	
	
	
	
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20080225/2b413694/attachment.html 


More information about the Rasch mailing list