[Rasch] PISA critique in TES

Paul Barrett paul at pbarrett.net
Sun Sep 28 16:00:35 EST 2014

Margaret, all very well explained, but of no consequence in the face of these two paragraphs from the TES story:

“But Kreiner responded with a new paper this summer that broke the 2006 reading questions down in the same groupings used by Pisa. It did not include the eight questions that the OECD admits were “dodgy” for some countries, but it still found huge variations in countries’ rankings depending on which groups of questions were used.

In addition to the UK and Denmark variations already mentioned, the different questions meant that Canada could have finished anywhere between second and 25th and Japan between eighth and 40th. It is, Kreiner says, more evidence that the Rasch model is not suitable for Pisa and that “the best we can say about Pisa rankings is that they are useless”.”

The issue which seems to remain unresolved is the ability of multiple imputation to recover accurate population estimates for a country’s student reading ability where presumably none of the students in a particular country have undertaken a test for reading ability.


Surely someone has done the hard yards on this, using PISA data - other than Kreiner? This is not about whether multiple imputation works or not - that is a done deal. What is of concern is whether Kreiner’s analyses of actual PISA data (and the subsequent country rankings) are fundamentally correct.


I don’t understand why the necessary evidence to show Kreiner is wrong has not been forthcoming from those responsible for the methodology within PISA; and I mean evidence of such clarity that it would have rendered the TES article as ‘not even worth writing’. 


Regards .. Paul


Chief Research Scientist



W:  <http://www.pbarrett.net/> www.cognadev.com 

W:  <http://www.pbarrett.net/> www.pbarrett.net 

E:  <mailto:paul at pbarrett.net> paul at pbarrett.net 

M: +64-(0)21-415625


From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf Of Margaret Wu
Sent: Sunday, September 28, 2014 3:08 PM
To: rasch at acer.edu.au
Subject: Re: [Rasch] PISA critique in TES


I think there appears to be some misunderstanding about imputation. Imputation never creates new data. Of course we cannot invent data. Here is a simple example about imputation. Suppose we have a data set with people’s height and weight measures, with some data records missing either height or weight measures. We carry out a regression with only complete data records (respondents with both height and weight measures). We obtain estimates of regression coefficients from this analysis (Analysis A). For the data records with missing responses, we can impute a value in the following way. Suppose person n has a weight of 65kg, but his/her height measure is missing. We look up the regression model from Analysis A, and look at the distribution of heights of people with a weight of 65kg. This (conditional) distribution represents the likely heights of people with a weight of 65 kg. We randomly draw an observation from this conditional distribution and produce an imputed height for person n. If we now re-analyse the data with the imputed height included, we first observe that the regression estimates should not change, since we imputed from the regression model obtained in Analysis A. However, the standard errors for the regression coefficients will now be smaller than those from Analysis A, because our second analysis assumes the imputed values are actually observed (so we have more data than actually observed). To make sure that we don’t have increased precision when imputed data are added, we make multiple imputations. For each imputed data set, we carry out a regression analysis. The results from these regressions will vary, because the imputed values are not the same each time, since we impute from a distribution. (Note that we do not use the mean of the conditional distribution as our imputed value. Instead we do a random draw). The variations between the multiple regression runs will reflect the uncertainty introduced by the imputations. We then have a formula for combining the multiple regression runs to add the uncertainty back into our regression parameters, so that the data sets with imputed values will produce just the same estimates and standard errors as for Analysis A.


So you may ask why bother to do imputation if the results of the imputed data sets produce the same results as the complete data set. Sometimes we have many variables of interest. If we have lots of missing values among different variables, and we do list-wise deletion of records, we may throw away a lot of records. Using imputation, we can have complete data sets for carrying out many statistical analyses, using all the data we have collected, and at the same time take into account that some data are missing.


There has been a lot of literature on imputation. See Rubin, Little, Graham, Schafer,…


Using plausible values (PV) is one method of imputation. I should mention that the methodology of PV was not invented by PISA. The work of Bock, Mislevy, among others, in the 1980s has greatly contributed to large-scale assessment methodologies. I think one misconception of people who are not familiar with Bayesian IRT is that you need to estimate individuals well before you can measure the population parameters well. Actually, in Bayesian IRT (or the MML method), individual person abilities are not parameters in the model to be estimated. Bayesian IRT models overcome a lot of issues relating to forming population estimates from individual ability estimates which contain measurement errors. A lot of work has been done in showing how the (so-called) PISA model works well for large-scale surveys. Of course, for PISA, there are always issues with real-life applications of mathematical models, but these issues are not those raised in the articles recently mentioned such as in this thread of discussion. It appears that there are “suspicions”, but there is a lack of a full understanding of the actual models, so some explanations about the methodologies may help.




From: rasch-bounces at acer.edu.au <mailto:rasch-bounces at acer.edu.au>  [mailto:rasch-bounces at acer.edu.au <mailto:rasch-bounces at acer.edu.au> ] On Behalf Of Paul Barrett
Sent: Sunday, 28 September 2014 7:46 AM
To: rasch at acer.edu.au <mailto:rasch at acer.edu.au> 
Subject: Re: [Rasch] PISA critique in TES


I came across this paragraph in an ‘explanatory’ article in:


What is this site?

This site is produced by the Winton programme for the public understanding of risk based in the Statistical Laboratory in the University of Cambridge. 


The article is at:



The paragraph which caught my interest was:

A simple Rasch model (PISA Technical Report , Chapter 9) is assumed, and five values for each student are generated at random from the 'posterior' distribution given the information available on that student. So for the half of students in 2006 who did not answer any reading questions, five 'plausible' reading scores are generated on the basis of their responses on other subjects.


Look at that last sentence, and the bit I’ve underlined. Ordinarily, as a scientist rather than statistician, I’d burst out laughing at such an idiotic research design which ended up with this state of affairs as a purposeful ‘design’ feature. 


But maybe the “imputation” prediction model really does work as claimed on such data? My laughing is foolish after all. I’m not interested in simple monte-carlo  expositions, but with what happens with real data, messy sampling, and items which don’t all fit the Rasch model. 


And, indeed, empirical evidence using such PISA appears to exist, detailing the accuracy of the plausible value procedure to correctly estimate the population scores for students who answer no items at all on reading ability, from scores on other variables (Svend Kreiner). The result seems to indicate that it is substantively inaccurate. 


The validity or otherwise of such ‘plausible values’ claims are matters for empirically-determined quantified predictive accuracy, where actual observational data are used, doing exactly what PISA does on say several groups of students who have undertaken several ‘same-item’ tests on two or more attributes, Then re-estimating the population parameters for each group based upon half of that group not answering any questions for a particular attribute. This is not rocket science.  I’m assuming it has been done and published, and replicated by independent research groups? (Anyone have a reference or two)?


If this is so, it is puzzling how Kriener’s analyses could have revealed contrasting results.


Regards .. Paul


Chief Research Scientist



W:  <http://www.pbarrett.net/> www.cognadev.com 

W:  <http://www.pbarrett.net/> www.pbarrett.net 

E:  <mailto:paul at pbarrett.net> paul at pbarrett.net 

M: +64-(0)21-415625


From: rasch-bounces at acer.edu.au <mailto:rasch-bounces at acer.edu.au>  [mailto:rasch-bounces at acer.edu.au] On Behalf Of Adams, Ray
Sent: Saturday, September 27, 2014 2:08 PM
To: rasch
Subject: Re: [Rasch] PISA critique in TES




We have always shown things like range of possible ranks, standard errors and so on. We've also reported the effects of item selection and all data collected is publicly accessible for others to scrutinise.


Morrison dismisses all latent variable models kreiner says throw away everything that doesn't fit rasch perfectly and Goldstein says we throw away too much


Oh, and the comments about plausible values are just statistical naïveté, believe them and NAEP would have to be scrapped as would any statistical methods that use montecarlo estimation, the theory of which was regarded as sound last time I looked.


I love this criticism, it shows pisa is important. Putting energy into criticising it is good, I just wish genuine problems were uncovered and addressed. Goldstein does best on that front, I too would love longitudinal components and more finer grained analyses of subsets of items



Sent from my iPhone

On 27 Sep 2014, at 11:13 am, Mike Linacre <mike at winsteps.com <mailto:mike at winsteps.com> > wrote:

Thanks, T.

That article, and the comments following it, suggest to me that PISA results should be reported as box-and-whisker plots, not rankings. Then every country could choose to be at the top of its own whisker ....

Or perhaps PISA already do this??

Mike Linacre

On 9/27/2014 10:18 AM, Bond, Trevor wrote:



Rasch mailing list
email: Rasch at acer.edu.au <mailto:Rasch at acer.edu.au> 
web: https://mailinglist.acer.edu.au/mailman/options/rasch/ray.adams%40acer.edu.au

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20140928/996d1aa1/attachment-0001.html 

More information about the Rasch mailing list