# [Rasch] DIF,Uni-dimensionality, and testlets

Steve Kramer
Thu Oct 2 22:17:31 EST 2014

```Dear List-serve,
I participate in these conversations mainly as a student, trying to learn
from more expert Rasch modelers.  Dr. Kreiner's response, below, to my
earlier query raises two points I only partially understand, and I would
appreciate help.

First: I had always viewed DIF as a special case of uni-dimensionality
violation.  That is the way I have explained it to others.  Am I wrong? How

Second: I had been taught that test-lets are a big violation of Rasch
modeling assumptions because they create local dependence.  In the email
below, Dr. Kreiner notes that PISA uses testlets.  So: when are testlets
"not too bad", i.e., when is Rasch a close-enough approximation or reality
that it can be used despite the presence of teslets? Is there a way to tell
or a rule of thumb to use? (Honestly, I'd prefer the rule-of-thumb because
I'm often too pressed for time to do an extra set of complex analyses.) This
is a practical question for me, because I have to ammend some tests that
both want to use a Rasch model and use testlets.

Steve Kramer
The 21st Century Partnership for STEM Education

From: Svend Kreiner
Sent: Monday, September 29, 2014 5:04 AM
Hi All,

So it is back to PISA. Again.

It is correct that I and Karl Bang Christensen show that the Rasch model
does not fit PISA’s reading items, but the problem as we see it does not
have anything to do with multidimensionality. The fundamental fundamental
problem is that PISA disregards DIF relative to Country and disregards that
fact that the testlet structure in PISA has creates local dependence among
items. The DIF problem means that country means are confounded, and the
local dependence means that the estimates of the variance of the person
parameter are inflated, and both of these problems influence the ranks of
the countries. Whether or not this is important is another matter. We
therefore tried to assess the robustness of the ranks in several ways
including comparison of PISAQ’s ranks with the ranks from a model that takes
DIF and local dependence into account. Our reading of the results is that
the model errors do matter. I suggest that you read our paper and draw your
own conclusions. If you have other ideas about how to test the robustness we
will be very happy to hear them and promise to try them out and send you the
results.

Up to last summer, PISA has never admitted in print that there is DIF
relative to Country. The most interesting thing that came out of the TES
discussion last year is therefore the letter by Andreas Schleicher that you
can find at http://www.tes.co.uk/article.aspx?storycode=6345213 where he
admits that “it is nonsense” to assume “that there should be no variability
in performance on individual question between students in different
countries”. That is, that it is nonsense to assume that there is no DIF
relative to Country.
You can read my response at
http://www.tes.co.uk/article.aspx?storycode=6360708. Schleicher’s statement
is nothing less than remarkable, because it is the first time that PISA
admits that PISA has a DIF problem. It is also remarkable, because it
follows that he must regard PISA’s Rasch model  as a nonsense model. Let me
quote PISA’s technical reports to put his remarks in perspective. In these
reports, you can read statements like “The interpretation of a scale can be
severely biased by unstable item characteristics from one country to the
next” (page 86 in the report from 2006) and “consistency of item parameter
estimates across countries was of particular interest” (page 147 in the same
report). These and similar statements can be found in all the technical
reports, so we have to assume that they mean it. Schleicher’s remarks can
therefore only mean one of the following things. He either disagrees with
the way ACER has analysed the data, he has not read the technical reports,
or he does not understand what he has read. Whatever the reason, the point
he makes is important. To use his own words, it requires little
consideration to realise that the idea of no DIF relative to Country is
nonsense.
He then goes on to claim that “PISA has convincingly and conclusively shown
that the design of the tests and the scaling model used to score them lead
to robust measures of country performance that are not affected by the
composition of the item pool” and that “the results of these analyses are
documented in the PISA technical reports”. This, of course, is a complete
falsehood and typical of the way PISA argues when they have to defend
themselves. There is nothing whatsoever about assessment of robustness of
PISA ranks in the technical reports. Please check yourself, if you do not
believe me. The only thing I know of from their hands is a report on science
items in PISA 2006. Karl and I have commented on it in our Psychometrika
paper. The report (Adams R, Berezner A. & Jakubowski, M. Analysis of PISA
2006 preferred items ranking using the  percent-correct method) can be
It was published after our discussion back in March 2010. During this
discussion, Ray Adams did not refer to this or similar reports and I am sure
that no such reports existed at that time, because he could have closed our
discussion back then before it started to get serious.
Anyway, you do not have to believe me. Download the things yourself and see
what you can find. And remember to ask PISA about the precise references
every time they claim to and reported something.

And finally, we do appreciate the benefit of the doubt, but actually want
more than this. Please read our Psychometrika paper and let us know, if you
find some errors. We will be happy to acknowledge and correct them. It is
perhaps too much to ask, but it would also be nice if somebody would read
PISA's technical reports. Until now, I know of nobody besides me and Karl
who has done so, but it would be much more difficult for PISA to avoid the
discussion if people systematically checked their claims. And the only way
to do it is by reading their reports.

Best

Svend

From: Borsboom, Denny
Sent: Monday, September 29, 2014 9:48 AM
Hi all

isn't it somewhat strange that one should buy a commercial book to assess
the details on dimensionality assessment of one of the largest psychometric
attacks on the tax payer's wallet in the history of mankind?

In general, the lack of publicy available data and documentation on the
psychometric procedures followed is worrying. I'd suggest that if PISA wants
to gain back some street credibility in scientific circles, they simply have
to make public the data, analysis code, and decision procedures used to
arrive at their results. By this I don't mean the woolly stuff that's in the
tech reports I've seen, but simply datasets with analysis code that
reproduce the analyses used to assess e.g. dimensionality and DIF.

As long as PISA cannot offer such documentation, we should give Kreiner the
benefit of the doubt, simply because, in contrast to PISA's, his work is in
fact sufficiently detailed to pass the minimum scientific tresholds of
reproducibility and transparency.

Best
Denny

From: Adam Wyse
Sent: Monday, September 29, 2014 3:23 AM
Hi Steve,

In the article "Linking PISA competencies over three cycles - results from
Germany", Claus H. Carstensen (p. 204) explains how the uni-dimensionality
of PISA items are assessed. His description may be of value to the
conversation.

Carstensen, C. H. (2013). Linking PISA competencies over three cycles -
results from Germany. In M. Prenzel, M. Kobarg, K. Schöps & S. Rönnebeck
(Eds.), Research on PISA: research outcomes of the PISA research conference
2009 (pp. 199-214): Springer.

Respectfully,

On Mon, Sep 29, 2014 at 8:51 AM, Steve Kramer wrote:
<skramer1958 at verizon.net<mailto:skramer1958 at verizon.net>> wrote:
There  is a lot of noise in the TES article, but one potentially legitimate
problem: Kreiner claims that the PISA questions violate the Rasch
uni-dimensionality assumption to such a degree that a Rasch model can't be
used, or at least can't be used with sufficient precision to rank countries
meaningfully.  He says he tested this by trying out legitimate subsets of
questions, and investigating whether the differing subsets predicted
rankings that were similar to one another-and they didn't.
Specifically, "Canada could have finished anywhere between second and 25th
and Japan between eighth and 40th.".
Ray Adams responded by saying that PISA accounts for this problem, noting,
"We have always shown things like range of possible ranks, standard errors
and so on. We've also reported the effects of item selection ."

In fact the 2012 country report on Japan looks nothing like "between 8th and
40th".  In 2012 Japan was ranked between 1st and 3rd in all subjects.
Kreiner may have been using data from a different year, but I can't imagine
that the RANGE of possible ranks shrank from 32 (40-8) down to 2 (3-1).  I
see only two possibilities:  either Kreiner's methodology was absolutely
lousy, or else the PISA "range of ranks" was computed ASSUMING
uni-dimensionality and did not adequately CHECK for uni-dimensionality.

Ray, do you have any technical articles you can reference explaining how
PISA either designed test items for uni-dimensionality or else checked that
uni-dimensionality was an adequate model for creating country ranks? Also,
are there any tech reports on how PISA determined each country's potential
range of ranks?   Finally, I'd appreciate any tech reports in which PISA
investigated how choosing differing subsets of items affected country
ranking, or else articles (not necessarily PISA) explaining why a procedure
like Kreiner's sub-setting is not a legitimate test.    I suspect that
Kreiner's claims are simply based on invalid methodology, but I'd like to be
able to verify that suspicion.

Steve Kramer
The 21st Century Partnership for STEM Education

Yes, Jason.

Are PISA, TIMSS and similar studies really intended to advance education
agendas" then the most politically-acceptable statistical methodologies are
the ones to choose. If the answer is "advance education" then everyone,
including the politicians, should be working towards discovering and using
the most effective statistical methodologies.

According to www.rasch.org/software.htm<http://www.rasch.org/software.htm>
there are now seven Rasch-related R modules. They are free. Wonderful! But
there are 5,889 R modules. We will need more Rasch R modules before we make
a noticeable impact.

Mike L.

On 9/28/2014 21:21 PM, liasonas wrote:
> Mike, this is a great idea. But can the policy makers and the
> politicians allow us (the academics) to spoil their new toys ( the
> international studies)? The politicians use us (the academics) to
> produce data and reports which then the politicians use to carry out
> their little in-fightings and political debates.
>  We cannot afford to angry them, because we need their money and
> support. Maybe we need to train them on how to use our data most
> appropriately and sensibly. Pisa and Timms tables, for example, can be
> useful, but they are not the equivalent of The Bible.
> Having said all that, we need to thank Margaret and the other
> researchers for providing the methodological tools and packages (have
> you all had a glance of the TAM package on the R platform?). But we
> also need to thak Paul for seeding the seeds of doubt, because this is
> the only way for science to prosper.
> Jason

