# [Rasch] Rasch scaling

Steve Kramer skramer1958 at verizon.net
Fri Jun 27 03:21:55 EST 2014

```I would like the group's opinion on a method I just finished using to
translate Rasch "Thetas" to more easily interpretable scale scores.  I think
the method is exciting and perhaps useful to others. At first, I expected it
to be a common approach--but I was using WinSteps, and I didn't find any
rescaling procedures much like it in the WinSteps manual.  So maybe I have
an approach that, while kind-of obvious and quite useful, is nonetheless
new.

My group is working with an Egyptian STEM school.  We needed to administer
and report scores on six "concept" tests, one each on Biology, Chemistry,
Geology, Physics, "Pure" math, and Applied Math/Mechanics. On some of the
tests, the items were such that good students at the school "ought" to be
able to answer every question correctly.  On other tests, the items were
such that maybe good students "ought" to be able to answer 35 of 40
questions correctly, with an additional 5 extra-challenging items that only
top students might answer.  We originally wanted to use Rasch for two main
reasons: so that we could use item diagnostics on a pilot to identify/fix
problematic items an so that next year when we change the test we can use
"anchor items" to put next year's tests on the same scale as this year's.
But, once we output the Rasch theta scores, we decided to use those, instead
of number-correct, as the basis for the final scores we reported out.  Here
is what we did and why,

The School and Egyptian Ministry folks grade students on a percentile scale
out of 100, with 60% as failing.  On some of these tests, the raw
percentiles showed "too many" kids failing, and we were asked to consider
rescaling all tests to a mean of 80, sd of 10, with our choice of using
number-correct or Rasch Theta as the basis for the scaling.  BUT the Tests
of Concepts were intended to be criterion-referenced (do the kids know what
is expected of them?) not norm-referenced (how does everyone rank?) and the
mean-80 sd-10 throws out any meaningful criteria.

So here is what we did instead.

Step 1:  For each test, set a "Goal".  This goal is the number-correct we
are really trying to get all the STEM school kids to achieve if the
curriculum we are using works as intended .  On some tests, this would be
all items correct (e.g., 41 out of 41).  On other tests it might be the 35
non-super-challenging questions (e.g. 35 out of 40).

Step 2:  Give students a who achieved one less than the ideal score a "scale
score" of 100.

We chose "one less than" the ideal score because  a "perfect score" can't be
assigned a Theta without extra assumptions beyond the normal Rasch model.
Thus, a score of 100 would be given to a student who correctly answered 40
out of 41 (in the case where the ideal is a perfect score of 41) or 34 out
of 40 (in the case where the ideal is a score of 35 out of 40).  The latter
was a judgment call, since a good Theta could often be computed for 35 out
of 40--but in practice the "ideal" scores tended to be for the very top
students, with a fairly wide sd of the Rasch Theta, so we decided to give
kids the benefit of the doubt and set 100-score at one less than the
"ideal".

Step 3: For each student, compute the scale score as follows.  First,
compute "Theta of this student minus Theta of students who achieved scale
score of 100" , then multiply by ten, and add to 100.  This is the student's
scale score.

Step 4: Round to the nearest whole number, and truncate scores above 100 to
100.  (I would have liked to allow scores above 100 to indicate students who
showed skill above the "goal" but our clients wouldn't have it.  The option
of allowing scores above 100 next year, when we put next year's test on the
same scale as this year's, is still on the table.)

really does mean "You learned everything we asked of you", 60 really does
mean "You failed to learn even close to what we wanted" and the difference
between 60 and 100 means the same across all tests, instead of depending on
the vagaries of particular test construction or the standard deviation of a
particular group of students.  Here is how we described the scale to our
clients:

"Across all tests, we strongly recommend using the recommended Scale Score.
Even in cases where raw scores were acceptable, the Scale Score is a more
reliable score to interpret across the tests. The Scale Score was created
using the estimated ability of each student produced out of the Rasch
analysis.  Called a Theta, Rasch analysis uses the difficulty of each test
item to produce a more accurate "ability level" for each student than a
simple raw score.  We then scaled the Thetas to a Scale Score, where 100 was
set to be the Highest Observable Scale Score, or HOSS.    A score of 100 is
earned by students who scored at the top of the group, showing excellent
conceptual understanding of the concepts covered by an exam.  The Scale
Score is a simple linear transformation of the Thetas.  Each scale score
point is equivalent to one tenth of a Rasch "logit".  Students who score
below a 60 on this scale are more than 4 logits below students who score
100, which reduces the odds of their correctly answering any given question
of the test by a factor of just over 50. (Technically, the factor is e^4.)
This means for example that a student with a score of 60 would have only a
47% chance of correctly answering a question that a person with a score of
100 has a 98% chance of answering correctly-and only a 2% chance of
correctly answering a question that a student with a scale score of 100 has
a 50% chance of answering correctly. We believe this makes a 60 a
justifiable cut score for those who fail."

Have others used this approach?

Is it as good as I think it is?

Steve Kramer

The 21st Century Partnership for STEM Education

-------------- next part --------------
An HTML attachment was scrubbed...
```