[Rasch] Reverse Digit Span Item Scoring Question

Fidelman, Carolyn Carolyn.Fidelman at ed.gov
Sat Dec 15 02:43:58 EST 2012

Dr. Hess and others,

As always this group comes through for people like me. I am so grateful for all your ideas!!!

Let me clarify that the ECLS-K:2011, similar to its predecessors ECLS-K:1998 and ECLS-B, is a true longitudinal study. See http://nces.ed.gov/ecls/ for all the details. But basically we start with a nationally representative sample of about 22K in Fall Kindergarten and by 5th grade, due to attrition, may have 13K. Various  flavors of weights and replicates are created to correct the variables for various forms of sampling, coverage and nonresponse error, so that users may conduct complex survey analysis with it for their purposes and still have confidence in its representativeness. The survey data is eventually made available in Public Use File (PUF) form and in Restricted Use File (RUF) form that will include item level data (available now for the first two studies and for the study started in 2011, at some point in the future).

Several people have suggested just going with the simple dichotomous analysis. That makes sense and I guess I would always try that first. Occam's razor. But we have done Yen's Q3 analysis on other test data and found format effects causing LID (sometimes quite strong values of .5 etc.) so it is an issue that we want to be conscious of as a source of error in parameter estimation and correct for it if needed.  Since this whole test repeatedly uses a very specific format I am suspicious and want to investigate. To answer some of your questions:

The examinees don't literally get a Number Correct score equal to the number of items they got correct. Instead, only up to the number correct in the category they reached with success.  So for this example


I believe the "Number Correct" scoring is 11 (and not 12) from what I read in the user's manual.

Also, yes, in the user's manual there are special directions for younger children and I have observed younger children being administered this test. They are different in their way of responding to this test and the FACETS analysis, as you suggest, might help to account for this. It makes you wonder if these results should be really linked instead of concurrently calibrated since, essentially it is a slightly different test at the younger ages with a different administration protocol.  Since the growth curve in all areas is pretty steep at the ages (5-10) this study covers I would go with your suggestion to specify the age as a facet.

So just like you, the more I looked at this the more complications there seem to be and the less I am willing to only consider a straight-up, dichotomous Rasch analysis. Dichotomous/polytomous? Hierarchical by set length? Age group facet? Multidimensional by administration protocol? I will try some different approaches and see what comes up. Looks like my work is cut out for me! I will let you all know what happens.

Thanks again to all who offered their ideas!!!


Carolyn G. Fidelman, Ph.D.
Early Childhood, International & Crosscutting Studies, NCES
Rm 9035, 1990 K St. NW | 202-502-7312 | carolyn.fidelman at ed.gov<mailto:carolyn.fidelman at ed.gov>

From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf Of Robert Hess
Sent: Thursday, December 13, 2012 9:12 PM
To: rasch at acer.edu.au
Subject: Re: [Rasch] Reverse Digit Span Item Scoring Question

Since it wasn't clearly stated, I am not sure exactly what you mean by different Rasch model? Are you concerned whether he employed a dichotomous or polytomous model? Are you concerned whether or not he employed a multilevel (such as a hierarchical) model? The real question here is do students receive credit for getting each event correct or do they receive credit only if they get 3 of 5 events correct? And is this effect somehow related to age? If it is the former then your option #1 is a correct choice. On the other hand, if it is the latter, then option #2 is the correct choice (which, based on discussions I had with Dick more than 20 years ago and if memory serves me correct, #2 was the strategy he employed).

But I have a caveat to offer, perhaps instead of using WinSteps why not employ Facets? The Facets model was not available when Dick did his original scorings. I do know from discussions with him that he tended to employ a model similar to your clustered dichotomous perspective. However the last time I saw him and that was quite a long time ago we did discuss possible applications of Facets in his scoring.

In this fashion your items can be one facet with a second facet being the age of the child (this facet would have a different level for each age (5,6,7,8,9,10, & 11) and doing this would allow you to discern if the age level of the child has any impact. My "gut feeling" is that it will with the difference being between ages 5 through 8 versus ages 9 through 11 (with a possible overlap on both groupings by the age 8 group). And, by employing Facets, you can make your trial length (5 v 4) a third facet. You could even add a fourth facet, this would be the number of successful trials necessary needed to obtain credit (3 or 4 or 5). You may actually be able to run a more complete analysis by combining age clusters (5-8 v 9-11) or even into 3 groupings (5-7, 8-9, 10-11) or some other combination. But you'll never know this unless you run a Facets model. Otherwise the only factor you are going to find is whether or not kids can repeat backwards varying sets of numbers which may or may not be independent of their age group. And you could even tweak the model a bit and obtain a hierarchical analysis. I better stop now, otherwise I'm liable to end up making the analytical model more complex than the task.

This age grouping effect would be consistent with some research done by John Meyers in the early 1990s. His dissertation examined the influence of approximation techniques and he found as a serendipitous effect a clear developmental (as in Piaget) influence rather than an instructional influence.

Am I correct in assuming that your design is really Cross-Ages rather than a true longitudinal? By this I mean you're not really going to track the same children over a seven-year span you're going to question children at each age level, am I correct in this?

In any case, good luck with your study.

Robert Hess
Emeritus Professor of Educational Measurement & Evaluation
Arizona State University

From: rasch-bounces at acer.edu.au<mailto:rasch-bounces at acer.edu.au> [mailto:rasch-bounces at acer.edu.au] On Behalf Of Schulz, Matt
Sent: Thursday, December 13, 2012 10:50 AM
To: rasch at acer.edu.au<mailto:rasch at acer.edu.au>
Subject: Re: [Rasch] Reverse Digit Span Item Scoring Question

I think it's reasonable to assume the items are locally independent so I would go for scoring method #1.  The other two methods loose information concerning e.g., item fit, person fit, etc.

From: rasch-bounces at acer.edu.au<mailto:rasch-bounces at acer.edu.au> [mailto:rasch-bounces at acer.edu.au] On Behalf Of Fidelman, Carolyn
Sent: Thursday, December 13, 2012 9:32 AM
To: rasch at acer.edu.au<mailto:rasch at acer.edu.au>
Subject: [Rasch] Reverse Digit Span Item Scoring Question

HI All,

We are considering creating our own scale for a reverse digit span task that is to be administered to children ages 5-11 longitudinally and I am having trouble understanding which Rasch model to use or how to structure the data.

What is done in such a test is that a child is first asked to repeat back two numbers that they hear on an audio recording, but in reverse order. They get five such items and, if they get three of them right, they go on to the next level. Each level consists of an additional number of numbers to repeat back all the way up to 9. Most adults can do about 7. This is correlated to general intelligence and is predictive of a number of achievement related outcomes. This is modeled off of something from the Woodcock Johnson COG III called Numbers Reversed and is Test 7 in their battery of cognitive assessments.

My hope was that the WJ documentation would tell us which Rasch model was used with this "trials"-type data. But it doesn't really go into it, perhaps for proprietary reasons? It says that the original calibration program was devised by Woodcock in approximately 1970. Much has been done to develop models that fit various data types since that time.  It is not clear what the Woodcock procedure is exactly and how it differs from or is similar to standard Rasch 1 parameter, rating scale  or partial credit model.  I am concerned about this because we need to decide whether the structure of the data is to be discrete or clustered and determine whether it is also a problem that it is hierarchical where performances are nested within type.

One question I have for this test is "What is an item?"  Is an item a single trial or the  5-trial (4-trial later on) cluster? How is that scored for the purposes of score calibration? Is it a 0,1 for each trial or is it a scale from 0-5 (0-4 later on) for each cluster? Would we lose too much information going to a polytomous scoring?  The administration protocol is that if the child gets three in a cluster such as the two-numbers type, they move on to the next greater number type. If they only get two in a cluster, they are considered to have achieved that span but not more and no further items are presented.

Item type/names by number of digits to recall:


Example item scorings:

1. One record of a discrete, dichotomously scored raw score file:

1111111110111000088888888888888888*                                                       *8=not presented

2. One record of a clustered, dichotomously scored file:


3. One record of a clustered, polytomously scored file:


What we have seen in other ECLS tests is that when we violate the assumption of local item independence, through either a prompt or format effect, item parameters can be inflated and standard errors artificially low or unstable. A possible further violation in this case is that of not taking into account the extra variability between levels of data in a hierarchy, in addition to those within item sets.

Does anyone here know informally or through some publication source(1), what exact Rasch model the WJ folks used for this test? Or, regardless of that, what do you think one would do here? I am not comfortable with just treating it as in example 1 above.  Thanks for any help you can offer!


Carolyn G. Fidelman, Ph.D.
Early Childhood, International & Crosscutting Studies, NCES
Rm 9035, 1990 K St. NW | 202-502-7312 | carolyn.fidelman at ed.gov<mailto:carolyn.fidelman at ed.gov>

(1) I consulted the following:

Jaffe, L. E. (2009). Development, interpretation, and application of the W score and the relative proficiency index (Woodcock-Johnson III Assessment Service Bulletin No. 11). Rolling Meadows, IL: Riverside Publishing.


McGrew, K., Schrank, F., Woodcock, R. (2007). Woodcock-Johnson Normative Update: Technical Manual. Rolling Meadows, IL: Riverside Publishing.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20121214/fabb2946/attachment.html 

More information about the Rasch mailing list