[Rasch] Rasch scaling
Agustin Tristan
ici_kalt at yahoo.com
Sat Jun 28 00:33:27 EST 2014
Hi Steve: It is very common in many countries to propose a scale from 0-10, or 0-20 or 0-100 and so forth. In other cases it is used a scale from 200-800 like the SAT and other tests.
It depends on the meaning of the variable involved in the project plus the meaning of the tradition of the values in the scale.
In the first case, say 0-100, the mean value of 50 is "bad" and normally it is supposed that a pass value is 60 or 70, and good results are 80 o 90 and, as you say, 100 is great. In the second case, 500 is the score for the "mean expected competency", that could not necessarily be "bad", because it is the mean of the variable, and probably a "bad" result is 400 or 350.
If we design a test with a target on the mean expected knowledge or the mean expected competency plus a scale 0-100, then we should expect that 50 is the score for the mean of the population and not 100 (that is the same to get 500 in the scale 200-800), but the traditional interpretation makes unacceptable to report those expected values of 50.
So what we're doing for many years is to define a completely different scale, where the expected mean competency has an unusual value. For instance in our certification tests we set the mean expected competency in 100 points, similar to what you say, but the scale goes from 70 to 130 (we do not cut above 100 but the scale continues above and below the expected value). Of course, you must explain people that 100 is not the maximum value but it is the mean expected value. A person can perform better than that and can get 110, 125 and so forth, a person below the mean level will have 80 or 92 or even less depending on the Rasch measure. You can see a sample of our report using this scale in page 156 of this reference:
Certification of Teaching Competencies at the High-School Level in Mexico: An Evaluation Model for a Social Challenge
The International Journal of Educational and Psychological Assessment. January 2012, Vol. 9(2)
The cutoff point for approval may be in 105 or 95 or other value, depending on the criterion of the judges (council or other authorities), compared to the curriculum or other reference.
People will take some time to use this kind of scale but this avoids the misunderstanding of the traditional interpretations.
In addition we have used a scale that may be adjusted for different grades going from 70 to 130 for first grade, 85 to 145 for second grade and so forth. The values of the limits can be defined using Winsteps or other transformation. To do this kind of "longitudinal scale" the test must have some anchor items among grades to have continuity in the scale. In this case, lets imagine that a student from 8th grade should have 210 points in the mean expected competency. What happens if this student gets 153 points? we can say that the performance is equivalent to a student of 3rd grade. Inversely if a student of 3rd grade gets 200 points we can identify a performance above the expected value in 7th grade. These points depend on the measure obtained using the Rasch model.
Hope this helps.
Regards
Agustin
INSTITUTO DE EVALUACION E INGENIERIA AVANZADA.
Ave. Cordillera Occidental No. 635
Colonia Lomas 4ª San Luis Potosí, San Luis Potosí.
C.P. 78216 MEXICO
(52) (444) 8 25 50 76 / (52) (444) 8 25 50 77 / (52) (444) 8 25 50 78
Página Web (en español): http://www.ieia.com.mx
On Thursday, June 26, 2014 5:48 PM, "Stephanou, Andrew" <Andrew.Stephanou at acer.edu.au> wrote:
“… and easy to interpret”? Really?
All of these statistical transformations that are based either on sample means and standard deviations, or some perception of scores, are sample dependent and the number of scale score points that correspond to one logit can be anything, making it difficult to make sense of growth expressed in scale scores.
I always use a linear transformation where one logit is equal to ten scale score points and I set the mean item difficulty so that I don’t get negative values anywhere. Usually I set it to 120. This transformation is not sample dependent; it allows me to easily interpret growth in terms of logits and it allows me to do all the things you are trying to do with complicated transformations. Whatever we do, we have to give instructions on what the scale scores we report mean and relate them to a qualitative description of levels on the scale. I find it unacceptable that people don’t know, as it often happens, how many points of their scale scores make one logit with their sample dependent and complicated statistical transformations.
Andrew Stephanou
From:rasch-bounces On Behalf Of Steve Kramer
Sent: Friday, 27 June 2014 6:18 AM
To: rasch
Subject: Re: [Rasch] Rasch scaling
Yes: I see that we need to spend more time on test blueprint and construct validity to make sure “what is included in the exam” really tracks to what we value. But my question was a bit more technical:
As a scale for reporting, have others used a scale where 100=goal theta; scale score is determined by 100-10*#logits below goal theta? I’m proposing this to be a better form of scaling than setting a particular mean or sd, since scale scores are then meaningful and easy to interpret, while still closely approximating peoples’ comfort level of “60 is failing, 80 is kind-of ok, 100 is great”.
From:rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf Of Agustin Tristan
Sent: Thursday, June 26, 2014 3:12 PM
To: rasch at acer.edu.au
Cc: John Baker
Subject: Re: [Rasch] Rasch scaling
Hello Steve:
In Winsteps you can define the value to scale the logit and the translation for the center of the scale. With this you can avoid values above 100 or below 0 if you need. Other manipulations are possible but they depend on your own considerations about the students achievements and what is reasonable as a minimum of performance.
But when you say that 100 really does mean "You learned everything we asked of you", it is not completely true. In fact, if a person answers 100% of the items (or 98% or whatever you are considering), you can say "You respond correctly what is included in the exam", because in a course you include many more things that are not contained in the test. If you want to say that this "measure" corresponds to what the student should learn, then you must demonstrate that the test matches a well defined domain of knowledge and tasks to be performed by the student (you can use the test blueprint) and that your test has a satisfactory number of items that are a representative sample of those topics (this is also part of the test blueprint). The cutoff point to pass the exam is another thing to define and there are several approaches for that using the Rasch model, and whatever you decide could be reasonable or not, but at this moment you must give a better
foundation to your design and the scale.
Hope this helps.
Regards
Agustin
INSTITUTO DE EVALUACION E INGENIERIA AVANZADA.
Ave. Cordillera Occidental No. 635
Colonia Lomas 4ª San Luis Potosí, San Luis Potosí.
C.P. 78216 MEXICO
(52) (444) 8 25 50 76 / (52) (444) 8 25 50 77 / (52) (444) 8 25 50 78
Página Web (en español): http://www.ieia.com.mx
Web page (in English): http://www.ieesa-kalt.com/English/Frames_sp_pro.html
On Thursday, June 26, 2014 12:22 PM, Steve Kramer <skramer1958 at verizon.net> wrote:
I would like the group's opinion on a method I just finished using to translate Rasch "Thetas" to more easily interpretable scale scores. I think the method is exciting and perhaps useful to others. At first, I expected it to be a common approach--but I was using WinSteps, and I didn't find any rescaling procedures much like it in the WinSteps manual. So maybe I have an approach that, while kind-of obvious and quite useful, is nonetheless new.
My group is working with an Egyptian STEM school. We needed to administer and report scores on six "concept" tests, one each on Biology, Chemistry, Geology, Physics, "Pure" math, and Applied Math/Mechanics. On some of the tests, the items were such that good students at the school "ought" to be able to answer every question correctly. On other tests, the items were such that maybe good students "ought" to be able to answer 35 of 40 questions correctly, with an additional 5 extra-challenging items that only top students might answer. We originally wanted to use Rasch for two main reasons: so that we could use item diagnostics on a pilot to identify/fix problematic items an so that next year when we change the test we can use "anchor items" to put next year's tests on the same scale as this year's. But, once we output the Rasch theta scores, we decided to use those, instead of number-correct, as the basis for the final scores we reported out.
Here is what we did and why,
The School and Egyptian Ministry folks grade students on a percentile scale out of 100, with 60% as failing. On some of these tests, the raw percentiles showed "too many" kids failing, and we were asked to consider rescaling all tests to a mean of 80, sd of 10, with our choice of using number-correct or Rasch Theta as the basis for the scaling. BUT the Tests of Concepts were intended to be criterion-referenced (do the kids know what is expected of them?) not norm-referenced (how does everyone rank?) and the mean-80 sd-10 throws out any meaningful criteria.
So here is what we did instead.
Step 1: For each test, set a "Goal". This goal is the number-correct we are really trying to get all the STEM school kids to achieve if the curriculum we are using works as intended . On some tests, this would be all items correct (e.g., 41 out of 41). On other tests it might be the 35 non-super-challenging questions (e.g. 35 out of 40).
Step 2: Give students a who achieved one less than the ideal score a "scale score" of 100.
We chose "one less than" the ideal score because a "perfect score" can't be assigned a Theta without extra assumptions beyond the normal Rasch model. Thus, a score of 100 would be given to a student who correctly answered 40 out of 41 (in the case where the ideal is a perfect score of 41) or 34 out of 40 (in the case where the ideal is a score of 35 out of 40). The latter was a judgment call, since a good Theta could often be computed for 35 out of 40--but in practice the "ideal" scores tended to be for the very top students, with a fairly wide sd of the Rasch Theta, so we decided to give kids the benefit of the doubt and set 100-score at one less than the "ideal".
Step 3: For each student, compute the scale score as follows. First, compute "Theta of this student minus Theta of students who achieved scale score of 100" , then multiply by ten, and add to 100. This is the student's scale score.
Step 4: Round to the nearest whole number, and truncate scores above 100 to 100. (I would have liked to allow scores above 100 to indicate students who showed skill above the "goal" but our clients wouldn't have it. The option of allowing scores above 100 next year, when we put next year's test on the same scale as this year's, is still on the table.)
The great thing about this approach is that it creates a scale where 100 really does mean "You learned everything we asked of you", 60 really does mean "You failed to learn even close to what we wanted" and the difference between 60 and 100 means the same across all tests, instead of depending on the vagaries of particular test construction or the standard deviation of a particular group of students. Here is how we described the scale to our clients:
"Across all tests, we strongly recommend using the recommended Scale Score. Even in cases where raw scores were acceptable, the Scale Score is a more reliable score to interpret across the tests. The Scale Score was created using the estimated ability of each student produced out of the Rasch analysis. Called a Theta, Rasch analysis uses the difficulty of each test item to produce a more accurate “ability level” for each student than a simple raw score. We then scaled the Thetas to a Scale Score, where 100 was set to be the Highest Observable Scale Score, or HOSS. A score of 100 is earned by students who scored at the top of the group, showing excellent conceptual understanding of the concepts covered by an exam. The Scale Score is a simple linear transformation of the Thetas. Each scale score point is equivalent to one tenth of a Rasch “logit”. Students who score below a 60 on this scale are more than 4 logits below students
who score 100, which reduces the odds of their correctly answering any given question of the test by a factor of just over 50. (Technically, the factor is e^4.) This means for example that a student with a score of 60 would have only a 47% chance of correctly answering a question that a person with a score of 100 has a 98% chance of answering correctly—and only a 2% chance of correctly answering a question that a student with a scale score of 100 has a 50% chance of answering correctly. We believe this makes a 60 a justifiable cut score for those who fail."
Have others used this approach?
Is it as good as I think it is?
Steve Kramer
The 21st Century Partnership for STEM Education
________________________________________
Rasch mailing list
email: Rasch at acer.edu.au
web: https://mailinglist.acer.edu.au/mailman/options/rasch/ici_kalt%40yahoo.com
________________________________________
Rasch mailing list
email: Rasch at acer.edu.au
web: https://mailinglist.acer.edu.au/mailman/options/rasch/ici_kalt%40yahoo.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20140627/003272f7/attachment-0001.html
More information about the Rasch
mailing list