[Rasch] counts to scale conversion
jstenner at lexile.com
Wed Jan 25 08:25:14 EST 2006
A construct theory is the story we tell about what it means to move up and down the scale for a variable of interest (e.g. temperature, reading ability, short term memory). Why is it, for example, that items are ordered as they are on the item map? The story evolves as knowledge regarding the construct increases. We call both the process and the product of this evolutionary unfolding "construct definition." (Stenner, Smith and Burdick, 1983) Advanced stages of construct definition are characterized by calibration equations (or specification equations) that operationalize and formalize a construct theory. These equations, make point predictions about item behavior or item ensemble distributions. The more closely theoretical calibrations coincide with empirical item difficulties, the more useful the construct theory and the more interesting the story.
Twenty-five years of experience in developing the Lexile Framework for Reading enable us to distinguish five stages in our thinking. Each subsequent stage can be characterized by an increasingly sophisticated use of substantive theory. Evidence that a construct theory and its associated technologies have reached a given stage or level can be found in the artifacts, instruments, and social networks that are realized at each level.
At this stage there is no explicit theory as to why items are ordered as they are on the item map. Data are used to estimate both person measures and item difficulties. Just as with other actuarial sciences, empirically determined probabilities are of paramount importance. When data are found to fit a Rasch Model, relative differences among persons are independent of which items or occasions of measurement are used to make the measures. Location indeterminacy abounds: each instrument/scale pairing for a specified construct has a uniquely determined "zero". At level 1, instruments don't share a common "zero" i.e. location parameter. A familiar artifact of this stage is the scale annotated with empirical item difficulties (artifact 1). Most educational and psychological instruments in use today are level one technologies.
A construct theory can be formalized in a specification equation used to explain variation in item difficulties. If what causes variation in item difficulties can be reduced to an equation, then a vital piece of construct validity evidence has been secured. We argue elsewhere that the single most compelling piece of evidence for an instrument's construct validity is a specification equation that can account for a high proportion of observed variance in item difficulties (Stenner, Smith and Burdick, 1983). Without such evidence only very weak correlational evidence can be marshaled for claims that "we know what we are measuring" and "we know how to build an indefinitely large number of theoretically parallel instruments that measure the same construct with the same precision of measurement."
Note that the causal status of a specification is tested by experimentally manipulating the variables in the specification equation and checking to see if the expected changes in item difficulty are, in fact, observed. Stone (2002) performed just such an experimental confirmation of the specification equation for the Knox Cube Test - Revised (KCT_R), when he designed new items to fill in holes in the item map and found the theoretical predictions coincided closely with observed difficulties. Can we imagine a more convincing demonstration that we know what we are measuring than when the construct theory and its associated specification equation accord well with experiments (Stenner & Smith, 1982, Stenner & Stone (2003)?
Similar demonstrations have now been realized for hearing vocabulary (Stenner, Smith and Burdick, 1983), reading (Stenner and Wright, 2002), quantitative reasoning (Enright and Sheehan, 2002), and abstract reasoning (Embretson, 2002). Artifacts that signal level 2 use of theory are specification equations, RMSE's from regressions of observed item difficulties on theory, and evidence for causal status based on experimental manipulation of item design features (Artifact 2).
The next stage in the evolving use of theory involves application of the specification equation to enrich scale annotations. We move beyond using empirical item difficulties as annotations. One example of this use of the specification equation is in the measurement of text readability in the Lexile Framework for Reading. In this application a book or magazine article is conceptualized as a test made up of as many "imagined" items as there are paragraphs in the book. The specification equation is then used to generate theoretical calibrations for each paragraph which then stand in for empirical item difficulties (Stone, Wright & Stenner, 1999).
For instance, the text measure for a book is the Lexile reader measure needed to produce a sum of the modeled probabilities of correct answers over paragraphs, qua items, equal to a relative raw score of 75%. We can imagine a thought experiment in which every paragraph (say 900) in a Harry Potter novel is turned into a reading test item. Each of the 900 items is then administered to 1000 targeted readers and empirical item difficulties are computed from a hugely complex connected data collection effort. The text measure for the Harry Potter novel (880L) is the amount of reading ability needed to get a raw score of 675/900 items correct, or a relative raw score of 75% (Artifact 3).
The specification equation is used in place of the tremendously complicated and expensive realization of the thought experiment for every book we want to measure. The machinery described above can also be applied to text collections (book bags or briefcases) to enable scale annotation with real world text demands (college, workplace, etc.).
Artifacts of a level 3 use of theory include construct maps (artifact 3) that annotate the reading scale with texts that, thanks to theory, can be imagined to be tests with theoretically derived item calibrations.
In biochemistry, when a substance is successfully synthesized using amino acids and other building blocks, the structure of the purified entity is then commonly considered to be understood. That is, when the action of a natural substance can be matched by that of a synthetic counterpart, we argue that we understand the structure of the natural substance. Analogously, we argue that when a clone for an instrument can be built and the clone produces measures indistinguishable from those produced by the original instrument, then we can claim we understand the construct under study. What is unambiguously cumulative in the history of science is not data text or theory but is rather the gradual refinement of instrumentation (Ackerman, 1985).
In a level 4 use of construct theory there is enough confidence in the theory and associated specification equation that a theoretical calibration takes the place of an empirical item difficulty for every item in the instrument or item bank. There are now numerous reading tests (e.g., Scholastic Reading Inventory - Interactive and the Pearson PASeries Reading Test) that use only theoretical calibrations. Evidence abounds that the reader measures produced by these theoretically calibrated instruments are indistinguishable from measures made using the more familiar empirically scaled instruments (Artifact 4). At Level 4, instruments developed by different laboratories and corporations share a common scale. The number of unique metrics for measuring the same construct (e.g. reading ability) diminishes.
Level 5 use of theory builds on level 4 to handle the case in which theory provides not individual item calibrations but rather a distribution of "potential" item calibrations. Again, the Lexile Framework has been used to build reading tests incorporating this more advanced use of theory. Imagine a Time magazine article that is 1500 words in lengths. Imagine a software program that can generate a large number of "cloze" items (see Artifact 5) for this article. A sample from this collection is served up to the reader when she chooses to read this article. As she reads, she chooses words to fill in the blanks (missing words) distributed throughout the article. How can counts correct from such an experience produce Lexile reader measures, when it is impossible to effect a one-to-one correspondence between a reader response to an item and a theoretical calibration, specific to that particular item? The answer is that the theory provides a distribution of possible item calibrations (specifically, a mean and standard deviation), and a particular count correct is converted into a Lexile reader measure by integrating over the theoretical distribution (Artifact 6).
"There is nothing so practical as a good theory" Kurt Lewin
Artifacts available on request. Jack
Ackerman, R. J. (1985) Data, Instruments, and Theory. Princeton, NJ, Princeton University Press.
Embretson, S.E. (1998) A cognitive design system approach to generating valid tests:
Application to abstract reasoning. Psychological Methods, 3, 380-396.
Enright, M.K. & Sheehan, K.M. (2002). Modeling the difficulty of quantitative reasoning
items: implications for item generation. In S. H. Irvine & P. C. Kyllonen (Eds) Item Generation for Test Development. Hillsdale, NJ. Lawrence Erlbaum Associates.
Stenner, A.J., Burdick, H., Sanford, E., & Burdick, D. How accurate are Lexile text
measures? Manuscript accepted Journal of Applied Measurement.
Stenner, A.J., & Smith, M. Testing construct theories. Perceptual and Motor Skills,
1982, 55, 415-426.
Stenner, A.J., Smith, M. & Burdick, D. (1983) Toward a Theory of Construct Definition.
Journal of Educational Measurement, 20, (4) 305 - 316.
Stenner, A.J. & Stone, M.H. (2003) Item Specifications vs. Item Banking. Transactions
of the Rasch SIG; 17 (3) 929 - 930.
Stenner, A.J., & Wright, B.D., (2002) Readability, Reading Ability, and Comprehension.
Paper presented at the Association of Test Publishers Hall of Fame Induction for Benjamin D. Wright, San Diego. In Wright, B.D. and Stone, M.H. (2004) Making Measures. Chicago: Phaneron Press.
Stone, M. H. (2002) Knox Cube Test - revised. Itasca, IL. Stoelting.
Stone, M.H., Wright, B.D. & Stenner, A.J., (1999) Mapping variables. Journal of
Outcome Measurement, 3 (4), 308 - 322.
From: William Fisher (External)
Sent: Monday, January 23, 2006 10:44 AM
To: Agustin Tristan; Tim Pelton; Rasch at acer.edu.au
Cc: Jack Stenner
Subject: RE: [Rasch] counts to scale conversion
Regarding the model as an ideal disconnected from reality forgets that the data may be derived from questions that may be irrelevant, poorly formulated, or off-construct for some other reason, or that some respondents may not belong to the intended population.
Why try to describe data that are not reproducible and replicable? How well do we understand a construct when the only data we can produce are not theoretically tractable, and so remain tied to particular questions and respondents? It seems pretty cynical to me to do research with the sole aim of applying fancy statistics to data, publishing articles, and advancing one's own career, while deliberately limiting your potential for generalizing your results past your own local samples of persons and items because you choose models and methods that do not push you toward the highest possible level of generality.
I vote for 3. Strong construct theory is not automatically implied by strong measurement theory. Being able to predict item difficulties when the items have been previously calibrated is great, but the real goal is to be able to predict their calibrations on the basis of their theoretical properties, in the manner of Lexiles or Commons' stage scoring system.
When we have this, then we're getting somewhere. After all, imagine how different our economic lives would be if rulers, weight scales, thermometers, clocks, volt meters, and the resistance properties of every meter of every type of electrical cable all had to be calibrated individually on data, instead of en masse, by theory.... Theoretical predictability is the mark of a real science, where we understand a variable to the point that we can recognize it for what it is in any amount when we see it.
After all, don't we say that a basic mark of knowing what we're talking about is being able to put it in our own words? Shouldn't any valid articulation of a construct be a viable medium for measuring in a univerally uniform reference standard metric?
Jack Stenner has recently done some work describing several more than three stages of this kind in the development of measurable constructs.... Maybe we can get him to weigh in....
William P. Fisher, Jr., Ph.D.
AVATAR INTERNATIONAL INC.
WFisher at avatar-intl.com
From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf Of Agustin Tristan
Sent: Monday, January 23, 2006 10:26 AM
To: Tim Pelton; Rasch at acer.edu.au
Subject: RE: [Rasch] counts to scale conversion
Hi Tim...it could be nice to have further votes for the three options and see how well this example fits to the person's opinion in this listserve... or how well our opinions fit a model...
Tim Pelton <tpelton at uvic.ca> wrote:
What a great example - and starting point for a discussion...
My vote is for option 2.
I think that the phrase in option 3 "...telling us (and the crickets too)..."
demonstrates quite nicely the
limitations of blindly applying an 'ideal' model. Is it reasonable to favor
an elegant theoretical model
that deviates substantially in it's predictions from the observed data when
our lack of understanding of
related factors means that we cannot effectively explain such deviations? Is
it not more appropriate to
choose a pragmatic model (balancing simplicity and accuracy) as an
intermediate model to help us
establish a control or baseline which may then be used to support our search
for other factors?
- >===== Original Message From Agustin Tristan =====
>Hi! I'm trying to follow this topic concerning crickets and scales.
> In abstract: Which is better?
> 1) The simple linear model even if it doesn't fit (the linear model for the
> 2) Any model who permits us to fit the data (the exp(something) looks to be
> 3) A theoretical model telling us (and the crickets too) how is the way
crickets should adjust the
frequency of the noise they produce according to temperature...specially if
this theoretical model is
exp(someting) because it looks more interesting or impressive.
> I like Nature and its relationship with math and for me it was interesting
to know that crickets may
use the exponential (even if they don't care about the mathematical
formulation), as well as the seeds in
sunflowers grow exponentially from their center, or the snails grow their
shell, or the ivy plants grow in
an helical 3D curve, or the soil slopes (in soil mechanics) become unstable
and fail according to a
logarithmic spiral, or the growth of populations follows a logistic model, and
so forth... I can also
recognize that I prefer objective items that behave as the Rasch model,
but...I cannot decide in all those
case which is better between (1), (2) and (3)...
> Agustin Tristan
>All this illustrates that if we want to stay in business as test gurus then
>we'd better forget about meaningful item hierarchies, sample independence,
>additive measures, and other such niceties. Rather, using methods whose
>results need to be recalibrated for boys, girls, old, and young, ..,
>whatever, ... - and adding a few things like "log(exp(something ...))" to
>our tech reports - should greatly help with job security. :)
>From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au]On
>Behalf Of Trevor Bond
>Sent: Saturday, January 21, 2006 5:43 PM
>To: Rasch listserve
>Subject: [Rasch] counts to scale conversion
>For all budding scale constructors, this is a hoot:
>check the lovely graphs
>Trevor G BOND Ph D
>Professor and Head of Dept
>Educational Psychology, Counselling & Learning Needs
>D2-2F-01A EPCL Dept.
>Hong Kong Institute of Education
>10 Lo Ping Rd, Tai Po
>New Territories HONG KONG
>Voice: (852) 2948 8473
>Fax: (852) 2948 7983
>Rasch mailing list
>Rasch at acer.edu.au
>Rasch mailing list
>Rasch at acer.edu.au
>FAMILIA DE PROGRAMAS KALT.
>Mariano Jiménez 1830 A
>Col. Balcones del Valle
>78280, San Luis Potosí, S.L.P. México
>TEL (52) 44-4820 37 88, 44-4820 04 31
>FAX (52) 44-4815 48 48
>web page (in Spanish AND ENGLISH): http://www.ieesa-kalt.com
> Ring in the New Year with Photo Calendars. Add photos, events, holidays,
>Rasch mailing list
>Rasch at acer.edu.au
Department of Curriculum and Instruction
University of Victoria
PO Box 3010 STN CSC
Victoria BC V8W 3N4
Fax (250) 721-7598
FAMILIA DE PROGRAMAS KALT.
Mariano Jiménez 1830 A
Col. Balcones del Valle
78280, San Luis Potosí, S.L.P. México
TEL (52) 44-4820 37 88, 44-4820 04 31
FAX (52) 44-4815 48 48
web page (in Spanish AND ENGLISH): http://www.ieesa-kalt.com
What are the most popular cars? Find out at Yahoo! Autos <http://us.rd.yahoo.com/evt=38382/_ylc=X3oDMTEzNWFva2Y2BF9TAzk3MTA3MDc2BHNlYwNtYWlsdGFncwRzbGsDMmF1dG9z/*http:/autos.yahoo.com/newcars/popular/thisweek.html%20%0d%0a>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Rasch