[Rasch] counts to scale conversion

Jack Stenner jstenner at lexile.com
Wed Jan 25 08:25:14 EST 2006

            A construct theory is the story we tell about what it means to move up and down the scale for a variable of interest (e.g. temperature, reading ability, short term memory). Why is it, for example, that items are ordered as they are on the item map? The story evolves as knowledge regarding the construct increases. We call both the process and the product of this evolutionary unfolding "construct definition." (Stenner, Smith and Burdick, 1983) Advanced stages of construct definition are characterized by calibration equations (or specification equations) that operationalize and formalize a construct theory. These equations, make point predictions about item behavior or item ensemble distributions. The more closely theoretical calibrations coincide with empirical item difficulties, the more useful the construct theory and the more interesting the story.

            Twenty-five years of experience in developing the Lexile Framework for Reading enable us to distinguish five stages in our thinking. Each subsequent stage can be characterized by an increasingly sophisticated use of substantive theory. Evidence that a construct theory and its associated technologies have reached a given stage or level can be found in the artifacts, instruments, and social networks that are realized at each level.


Level 1

            At this stage there is no explicit theory as to why items are ordered as they are on the item map. Data are used to estimate both person measures and item difficulties. Just as with other actuarial sciences, empirically determined probabilities are of paramount importance. When data are found to fit a Rasch Model, relative differences among persons are independent of which items or occasions of measurement are used to make the measures. Location indeterminacy abounds: each instrument/scale pairing for a specified construct has a uniquely determined "zero". At level 1, instruments don't share a common "zero" i.e. location parameter.  A familiar artifact of this stage is the scale annotated with empirical item difficulties (artifact 1). Most educational and psychological instruments in use today are level one technologies.


Level 2

            A construct theory can be formalized in a specification equation used to explain variation in item difficulties. If what causes variation in item difficulties can be reduced to an equation, then a vital piece of construct validity evidence has been secured. We argue elsewhere that the single most compelling piece of evidence for an instrument's construct validity is a specification equation that can account for a high proportion of observed variance in item difficulties (Stenner, Smith and Burdick, 1983). Without such evidence only very weak correlational evidence can be marshaled for claims that "we know what we are measuring" and "we know how to build an indefinitely large number of theoretically parallel instruments that measure the same construct with the same precision of measurement." 

Note that the causal status of a specification is tested by experimentally manipulating the variables in the specification equation and checking to see if the expected changes in item difficulty are, in fact, observed. Stone (2002) performed just such an experimental confirmation of the specification equation for the Knox Cube Test - Revised (KCT_R), when he designed new items to fill in holes in the item map and found the theoretical predictions coincided closely with observed difficulties. Can we imagine a more convincing demonstration that we know what we are measuring than when the construct theory and its associated specification equation accord well with experiments (Stenner & Smith, 1982, Stenner & Stone (2003)?

Similar demonstrations have now been realized for hearing vocabulary (Stenner, Smith and Burdick, 1983), reading (Stenner and Wright, 2002), quantitative reasoning (Enright and Sheehan, 2002), and abstract reasoning (Embretson, 2002). Artifacts that signal level 2 use of theory are specification equations, RMSE's from regressions of observed item difficulties on theory, and evidence for causal status based on experimental manipulation of item design features (Artifact 2).


Level 3

            The next stage in the evolving use of theory involves application of the specification equation to enrich scale annotations. We move beyond using empirical item difficulties as annotations.  One example of this use of the specification equation is in the measurement of text readability in the Lexile Framework for Reading. In this application a book or magazine article is conceptualized as a test made up of as many "imagined" items as there are paragraphs in the book. The specification equation is then used to generate theoretical calibrations for each paragraph which then stand in for empirical item difficulties (Stone, Wright & Stenner, 1999). 

For instance, the text measure for a book is the Lexile reader measure needed to produce a sum of the modeled probabilities of correct answers over paragraphs, qua items, equal to a relative raw score of 75%. We can imagine a thought experiment in which every paragraph (say 900) in a Harry Potter novel is turned into a reading test item. Each of the 900 items is then administered to 1000 targeted readers and empirical item difficulties are computed from a hugely complex connected data collection effort.  The text measure for the Harry Potter novel (880L) is the amount of reading ability needed to get a raw score of 675/900 items correct, or a relative raw score of 75% (Artifact 3). 

The specification equation is used in place of the tremendously complicated and expensive realization of the thought experiment for every book we want to measure. The machinery described above can also be applied to text collections (book bags or briefcases) to enable scale annotation with real world text demands (college, workplace, etc.).

            Artifacts of a level 3 use of theory include construct maps (artifact 3) that annotate the reading scale with texts that, thanks to theory, can be imagined to be tests with theoretically derived item calibrations.


Level 4

            In biochemistry, when a substance is successfully synthesized using amino acids and other building blocks, the structure of the purified entity is then commonly considered to be understood. That is, when the action of a natural substance can be matched by that of a synthetic counterpart, we argue that we understand the structure of the natural substance. Analogously, we argue that when a clone for an instrument can be built and the clone produces measures indistinguishable from those produced by the original instrument, then we can claim we understand the construct under study. What is unambiguously cumulative in the history of science is not data text or theory but is rather the gradual refinement of instrumentation (Ackerman, 1985).

            In a level 4 use of construct theory there is enough confidence in the theory and associated specification equation that a theoretical calibration takes the place of an empirical item difficulty for every item in the instrument or item bank. There are now numerous reading tests (e.g., Scholastic Reading Inventory - Interactive and the Pearson PASeries Reading Test) that use only theoretical calibrations. Evidence abounds that the reader measures produced by these theoretically calibrated instruments are indistinguishable from measures made using the more familiar empirically scaled instruments (Artifact 4). At Level 4, instruments developed by different laboratories and corporations share a common scale. The number of unique metrics for measuring the same construct (e.g. reading ability) diminishes. 


Level 5


            Level 5 use of theory builds on level 4 to handle the case in which theory provides not individual item calibrations but rather a distribution of "potential" item calibrations. Again, the Lexile Framework has been used to build reading tests incorporating this more advanced use of theory. Imagine a Time magazine article that is 1500 words in lengths. Imagine a software program that can generate a large number of "cloze" items (see Artifact 5) for this article. A sample from this collection is served up to the reader when she chooses to read this article. As she reads, she chooses words to fill in the blanks (missing words) distributed throughout the article. How can counts correct from such an experience produce Lexile reader measures, when it is impossible to effect a one-to-one correspondence between a reader response to an item and a theoretical calibration, specific to that particular item? The answer is that the theory provides a distribution of possible item calibrations (specifically, a mean and standard deviation), and a particular count correct is converted into a Lexile reader measure by integrating over the theoretical distribution (Artifact 6).


In Conclusion:

            "There is nothing so practical as a good theory" Kurt Lewin


 Artifacts available on request.      Jack           




Ackerman, R. J. (1985) Data, Instruments, and Theory. Princeton, NJ, Princeton University Press. 


Embretson, S.E. (1998) A cognitive design system approach to generating valid tests:

Application to abstract reasoning. Psychological Methods, 3, 380-396.


Enright, M.K. & Sheehan, K.M. (2002). Modeling the difficulty of quantitative reasoning

items: implications for item generation. In S. H. Irvine & P. C. Kyllonen (Eds) Item Generation for Test Development. Hillsdale, NJ. Lawrence Erlbaum Associates.


Stenner, A.J., Burdick, H., Sanford, E., & Burdick, D. How accurate are Lexile text

measures? Manuscript accepted Journal of Applied Measurement. 


Stenner, A.J., & Smith, M.  Testing construct theories.  Perceptual and Motor Skills,

1982, 55, 415-426.


Stenner, A.J., Smith, M. & Burdick, D. (1983) Toward a Theory of Construct Definition.

Journal of Educational Measurement, 20, (4) 305 - 316.


Stenner, A.J. & Stone, M.H. (2003) Item Specifications vs. Item Banking. Transactions

of the Rasch SIG; 17 (3) 929 - 930.


Stenner, A.J., & Wright, B.D., (2002) Readability, Reading Ability, and Comprehension.

Paper presented at the Association of Test Publishers Hall of Fame Induction for Benjamin D. Wright, San Diego. In Wright, B.D. and Stone, M.H. (2004) Making Measures. Chicago: Phaneron Press.


Stone, M. H. (2002) Knox Cube Test - revised. Itasca, IL. Stoelting.


Stone, M.H., Wright, B.D. & Stenner, A.J., (1999) Mapping variables. Journal of

Outcome Measurement, 3 (4), 308 - 322.


From: William Fisher (External) 
Sent: Monday, January 23, 2006 10:44 AM
To: Agustin Tristan; Tim Pelton; Rasch at acer.edu.au
Cc: Jack Stenner
Subject: RE: [Rasch] counts to scale conversion

Regarding the model as an ideal disconnected from reality forgets that the data may be derived from questions that may be irrelevant, poorly formulated, or off-construct for some other reason, or that some respondents may not belong to the intended population. 


Why try to describe data that are not reproducible and replicable? How well do we understand a construct when the only data we can produce are not theoretically tractable, and so remain tied to particular questions and respondents? It seems pretty cynical to me to do research with the sole aim of applying fancy statistics to data, publishing articles, and advancing one's own career, while deliberately limiting your potential for generalizing your results past your own local samples of persons and items because you choose models and methods that do not push you toward the highest possible level of generality.


I vote for 3. Strong construct theory is not automatically implied by strong measurement theory. Being able to predict item difficulties when the items have been previously calibrated is great, but the real goal is to be able to predict their calibrations on the basis of their theoretical properties, in the manner of Lexiles or Commons' stage scoring system. 


When we have this, then we're getting somewhere. After all, imagine how different our economic lives would be if rulers, weight scales, thermometers, clocks, volt meters, and the resistance properties of every meter of every type of electrical cable all had to be calibrated individually on data, instead of en masse, by theory....  Theoretical predictability is the mark of a real science, where we understand a variable to the point that we can recognize it for what it is in any amount when we see it. 


After all, don't we say that a basic mark of knowing what we're talking about is being able to put it in our own words? Shouldn't any valid articulation of a construct be a viable medium for measuring in a univerally uniform reference standard metric?


Jack Stenner has recently done some work describing several more than three stages of this kind in the development of measurable constructs....  Maybe we can get him to weigh in....


William P. Fisher, Jr., Ph.D.
WFisher at avatar-intl.com 


From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf Of Agustin Tristan
Sent: Monday, January 23, 2006 10:26 AM
To: Tim Pelton; Rasch at acer.edu.au
Subject: RE: [Rasch] counts to scale conversion


Hi Tim...it could be nice to have further votes for the three options and see how well this example fits to the person's opinion in this listserve... or how well our opinions fit a model...

Thank you.


Tim Pelton <tpelton at uvic.ca> wrote:

	What a great example - and starting point for a discussion...
	My vote is for option 2.
	I think that the phrase in option 3 "...telling us (and the crickets too)..."
	demonstrates quite nicely the
	limitations of blindly applying an 'ideal' model. Is it reasonable to favor
	an elegant theoretical model
	that deviates substantially in it's predictions from the observed data when
	our lack of understanding of
	related factors means that we cannot effectively explain such deviations? Is
	it not more appropriate to
	choose a pragmatic model (balancing simplicity and accuracy) as an
	intermediate model to help us
	establish a control or baseline which may then be used to support our search
	for other factors?
	- >===== Original Message From Agustin Tristan =====
	>Hi! I'm trying to follow this topic concerning crickets and scales.
	> In abstract: Which is better?
	> 1) The simple linear model even if it doesn't fit (the linear model for the
	crickets' case).
	> 2) Any model who permits us to fit the data (the exp(something) looks to be
	like that).
	> 3) A theoretical model telling us (and the crickets too) how is the way
	crickets should adjust the
	frequency of the noise they produce according to temperature...specially if
	this theoretical model is
	exp(someting) because it looks more interesting or impressive.
	> I like Nature and its relationship with math and for me it was interesting
	to know that crickets may
	use the exponential (even if they don't care about the mathematical
	formulation), as well as the seeds in
	sunflowers grow exponentially from their center, or the snails grow their
	shell, or the ivy plants grow in
	an helical 3D curve, or the soil slopes (in soil mechanics) become unstable
	and fail according to a
	logarithmic spiral, or the growth of populations follows a logistic model, and
	so forth... I can also
	recognize that I prefer objective items that behave as the Rasch model,
	but...I cannot decide in all those
	case which is better between (1), (2) and (3)...
	> Regards
	> Agustin Tristan
	>Rense wrote:
	>All this illustrates that if we want to stay in business as test gurus then
	>we'd better forget about meaningful item hierarchies, sample independence,
	>additive measures, and other such niceties. Rather, using methods whose
	>results need to be recalibrated for boys, girls, old, and young, ..,
	>whatever, ... - and adding a few things like "log(exp(something ...))" to
	>our tech reports - should greatly help with job security. :)
	>Rense Lange
	>-----Original Message-----
	>From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au]On
	>Behalf Of Trevor Bond
	>Sent: Saturday, January 21, 2006 5:43 PM
	>To: Rasch listserve
	>Subject: [Rasch] counts to scale conversion
	>For all budding scale constructors, this is a hoot:
	>check the lovely graphs
	>Trevor G BOND Ph D
	>Professor and Head of Dept
	>Educational Psychology, Counselling & Learning Needs
	>D2-2F-01A EPCL Dept.
	>Hong Kong Institute of Education
	>10 Lo Ping Rd, Tai Po
	>New Territories HONG KONG
	>Voice: (852) 2948 8473
	>Fax: (852) 2948 7983
	>Rasch mailing list
	>Rasch at acer.edu.au
	>Rasch mailing list
	>Rasch at acer.edu.au
	>Mariano Jiménez 1830 A
	>Col. Balcones del Valle
	>78280, San Luis Potosí, S.L.P. México
	>TEL (52) 44-4820 37 88, 44-4820 04 31
	>FAX (52) 44-4815 48 48
	>web page (in Spanish AND ENGLISH): http://www.ieesa-kalt.com
	>Yahoo! Photos
	> Ring in the New Year with Photo Calendars. Add photos, events, holidays,
	>Rasch mailing list
	>Rasch at acer.edu.au
	Tim Pelton
	Assistant Professor
	Department of Curriculum and Instruction
	University of Victoria
	PO Box 3010 STN CSC
	Victoria BC V8W 3N4
	Phone: (250)721-7803
	Fax (250) 721-7598


Mariano Jiménez 1830 A 

Col. Balcones del Valle 

78280, San Luis Potosí, S.L.P. México 

TEL (52) 44-4820 37 88, 44-4820 04 31 

FAX (52) 44-4815 48 48 

web page (in Spanish AND ENGLISH): http://www.ieesa-kalt.com


What are the most popular cars? Find out at Yahoo! Autos <http://us.rd.yahoo.com/evt=38382/_ylc=X3oDMTEzNWFva2Y2BF9TAzk3MTA3MDc2BHNlYwNtYWlsdGFncwRzbGsDMmF1dG9z/*http:/autos.yahoo.com/newcars/popular/thisweek.html%20%0d%0a>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20060124/8aa77bbd/attachment.html 

More information about the Rasch mailing list