[Rasch] Not a Fan of Lexiles?
commons at tiac.net
Wed Oct 31 08:05:29 EST 2007
There is Theo Dawson's measure of difficulty that yields stage of
development for material. It is based on the Model of Hierarchical
Complexity and Skill Theory. It would be interesting to see the correlation
of these measures.
Michael Lamport Commons, Ph.D.
Assistant Clinical Professor
Department of Psychiatry
Harvard Medical School
Beth Israel Deaconess Medical Center
234 Huron Avenue
Cambridge, MA 02138-1328
commons at tiac.net
From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf
Of Andrew Kyngdon
Sent: Tuesday, October 30, 2007 4:54 PM
To: Paul Barrett; rasch at acer.edu.au
Subject: RE: [Rasch] Not a Fan of Lexiles?
A great firebrand post! One major quibble, however, with what you said.
In the Lexile system, measures of the difficulty of written prose text
passages are NOT obtained from either test scores, ratings, simulations or
IRT model analyses of such data.
The Lexile theory argues that the difficulty of written prose text is a
functional composition of two key cognitive variables - syntactic complexity
and semantic rarity. Syntactic complexity is the complexity of the sentence
/ thought unit structure of a written prose text passage. Semantic rarity is
the familiarity of the semantic units (words) which comprise a prose text
Each is assessed using a proxy variable. For syntactic complexity. it is the
natural logarithm of the mean sentence length (LMSL). Why is this proxy
used? Research in cognition has found that it is a good proxy for the demand
written prose text places upon Short Term Memory (e.g. Crain & Skankweiler,
1988). The longer a sentence is, the harder it is to keep all the
information in a sentence in STM. Thus text passages with longer sentences
are more difficult to comprehend.
For semantic rarity, the proxy variable is the mean natural logarithm of
word frequency (MLWF). Why is this proxy used? Stenner, Smith & Burdick
(1983) investigated over 50 variables, including part of speech, number of
letters, number of syllables, modal grade, word content classification and
various transformations of these variables. They found MLWF was the best
proxy for the likelihood a person has a word in their mental lexicon.
Moreover, reaction time and gaze duration studies have found that people
will look at rarer words for several hundred milliseconds longer than
commonly occurring words, suggesting rarer words are harder to locate and
retrieve from the mental lexicon. Thus text passages containing words which
are infrequently used will be more difficult to comprehend.
The difficulty of a written prose text passage i, Di, is a functional
composition of these proxies such that: Di = (a x LMSL) - (b x MLFW) - c.
This expression, referred to as a construct specification equation, is a 5
variable simple polynomial conjoint structure (Krantz, Luce, Suppes &
Tversky, 1971) where a, b and c are real valued constants. The actual values
of these constants are proprietary; however, they result in Di being a scale
in logit units.
Notice the complete absence of "run of the mill' psychometry - test scores,
IRT model estimates, fit statistics, ratings, DIF, reliability coefficients
and so on - in the measurement of the difficulty of written prose text
The biggest weakness of the theory is the assumption of something more than
monotone relationships existing between proxy variables and the relevant
psychological attributes (Krantz & Tversky, 1971). The theory of conjoint
measurement (Krantz, et al, 1971) can be used to address this issue. I have
done this and found the axioms to be satisfied using Karabatsos's (2005)
methods for the probabilistic tests of deterministic axioms.
The Lexile theory argues that the probability of comprehending a written
prose text passage is a non-interactive, additive function of the text's
difficulty and the persons' reading ability. A suitable model for this is
the Rasch (1960) model. However, unlike conventional psychometry, we modify
the Rasch model to suit the theory. If a person's ability is equal to the
text difficulty, we do not consider the person has successfully comprehended
a text passage when the response probability is only .5. A constant is
therefore introduced into the Rasch model such that when ability equals
difficulty, the response probability is .75. Note this modification does not
destroy the raw score sufficiency property (or any others) of the Rasch
This modified Rasch model is Pr [Person v comprehends text passage i] =
exp(Bv - Di + k) / 1 + exp(Bv - Di + k), where Bv is the ability of person v
and k is the aforementioned constant. Test scores enter the Lexile theory in
the estimation of person abilities. These come from tests consisting of
Lexile passage native items (text passages with a sentence "close"). Use of
the Rasch model puts the abilities on the same scale as the text difficulty
variable. Both then can use the same affine transformation to become
measurements in the Lexile scale.
Hence I consider your point:
".there is no substantive difference between probabilistic (latent variable)
or deterministic (raw score) scale scores. Which, by default, means all of
the Metametrics Lexile and Quantile work could have been achieved using
conventional deterministic/actuarial score algorithms (now there's a
sobering thought).. "
to be incorrect. One cannot arrive at the measurement of text difficulty, as
theorized in the Lexile system, following conventional psychometric
P.S. One serious problem with connectionist models as theories of human
thought processes - no biological equivalent to the backpropagation
algorithm. Does this mean connectionism is merely another "unopened black
box" theory of psychology?
Crain, S. & Skankweiler, D. (1988). Syntactic complexity and reading
acquisition. In A. Davidson and G.M. Green (Eds.), Linguistic complexity and
text comprehension: readability issues reconsidered. Hillsdale, N.J.:
Karabatsos, G. (2005). The exchangeable multinomial model as an approach to
testing deterministic axioms of choice and measurement. Journal of
Mathematical Psychology, 49(1), 51-69.
Krantz, D.H.; Luce, R.D; Suppes, P. & Tversky, A. (1971). Foundations of
Measurement, Vol. I: Additive and polynomial representations. New York:
Krantz, D.H. & Tversky, A. (1971). Conjoint measurement analysis of
composition rules in psychology. Psychological Review, 78(2), 151-169.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment
tests. Copenhagen: Danish Institute for Educational Research.
Stenner, A.J., Smith, M. & Burdick, D. S. (1983) Toward a theory of
construct definition. Journal of Education Measurement, 20, 305-315.
Andrew Kyngdon, PhD
Senior Research Scientist
1000 Park Forty Plaza Drive
Durham NC 27713 USA
Tel. 1 919 354 3473
Fax. 1 919 547 3401
From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf
Of renselange at earthlink.net
Sent: Tuesday, October 30, 2007 5:28 AM
To: Anthony James; rasch at acer.edu.au
Subject: Re: [Rasch] Fan
My take on this is "so what" as well. Getting "scores" is hardly the main
achievement of Rasch modeling. There simply is little appreciable difference
between raw score sums and the Rasch model, as well as estimates obtained
via one, two, or three-parameter logistic models.
There you have it in a nutshell, why many scientists (vs psychometricians)
concerned with explaining substantive phenomena, sometimes using
questionnaire data, are not concerned with itemmetric models.
Edumetrics is frankly "technician work"; the utility and benefits of
probabilistic modeling of item response data seems to be irrelevant to
explaining or predicting phenomenal outcomes more accurately. Whilst other
benefits might accrue to those primarily concerned with examinations,
licensing, or other such pragmatic ventures, as a scientist, you are
concerned with minimizing substantive error in any predictions you might
make on the basis of theory or even "dustbowl" empiricism.
As you state Reese, there is no substantive difference between probabilistic
(latent variable) or deterministic (raw score) scale scores. Which, by
default, means all of the Metametrics Lexile and Quantile work could have
been achieved using conventional deterministic/actuarial score algorithms
(now there's a sobering thought)..
Whether or not IRT is a more "efficient" method of going about these
activities seems to be the issue - which is an important one neverthless.
But what use is "efficiency" when the magnitudes obtained are equivalent to
just summing up item responses? Person reliability can easily be computed
without any fuss using conventional item difficulties and rank order
response discrepancy (between item difficulty ranking and observed item
responses). You don't need a "data model" for this simple heuristic.
Cross-validation sorts out replicability.
I'm not so sure IRT is any more efficient at all than a sensible and
intelligent approach to construct assessment, except perhaps where
edumetrics is concerned - but even here, where are the advantages to be
seen, and by whom? Are educational standards improving because of IRT
methods being applied?
As far as I can see, the only reason IRT came into existence was because
Rasch and others were fascinated with test theory, statistical data models,
and fundamental measurement, instead of the actual phenomena of interest,
and capturing these directly without the need for bucketloads of
assumptions or practically incomprehensible algebra and math.
Indeed, what is apparent (to me at least) is that the major source of
explanatory/predictive error is not in the "measurement" metrics per se, but
in the concepts we invoke to explain phenomena, and how we actually try to
measure or assess "magnitudes" on them.
And, I haven't forgotten my reply to Andrew or Moritz(!) - but I'm flat out
on an interrater discrepancy/reliability program right now - for the next 5
days or so - but I couldn't resist replying to this as I've seen the same
"even though there is no difference, this is still wonderful stuff"
statements in a recent paper comparing Ideal Point unfolding models,
dominance IRT, and simple raw score methods ...
Cherneyshenko, O.S., Stark, S., Drasgow, F., & Roberts, B.W. (2007)
Constructing personality scales under the assumption of an Ideal Point
response process: toward increasing the flexibility of personality measures.
Psychological Assessment, 19, 1, 88-106.
See page 103, Col 2, 2nd para beginning "Note also that examination of
correlations .. " ... which contains the sentence ... "The use of ideal
point methods, however, is unlikely to yield increases in criterion-related
and then read the 1st para of the discussion ..
Their argument, as far as I can see, is more about the
technicalities/differential pragmatic benefits of test construction ...and
the consequences on particular measurement ranges within a scale.
But, if the criterion prediction remains the same - what is the achieved
real-world or scientific benefit?
I have the feeling they are discovering that the real problem is not with
the technology of assessment, but with the very nature of the constructs we
try to assess.
However, this is not a call to undo or dump IRT - as those like Metametrics,
ACER, and others who have generated impressive measurement and assessment
systems around IRT are testament to its functionality, utility, and
This is merely a statement which says that it is just another way of
"modeling data" and constructing test scores which has proved to have no
tangible advantages over any other "intelligent" method of analyzing data to
make important decisions.
If you think think this is unfair, consider the development of the
"perceptron" (The perceptron is a type of artificial neural network invented
in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt ..
Wikipedia). In 50 years look at what this simple "algorithm" for a neural
net has spawned .. an entire new area of psychological science investigation
"connectionist cognitive psychology", a neurophysiological model for
developmental brain networks and brain function, artificial intelligence,
computational biology, and literally physical working models of the
development of human intelligence, stock market trading algorithms,
seriously accurate market research/brand analysis prediction models,
security camera person-feature detection, biometric sensors, handwriting,
and speech detection, etc, etc..
And what has IRT spawned in the same period? A few tests which are almost
indistinguishable from the old tests, hundreds of books and thousands of
papers on test theory and tiny one-shot applications for which CTT would
have done the same job just as well - and was probably doing so, and a few
major-league companies who sell the technology and some of its results to
If I am to do any work as a psychological scientist - I want to be the
inventor of a "perceptron", and not the inventor of yet another "test
theory" - which is why methods which predict or measure no better than
simple "back of the matchbox" algorithms or heuristics are not given the
"time of day" by those like myself who want to make big strides in our
So, in response to:"The main problem here is that non-IRT folks often do not
(or cannot) even conceive of the possibilities afforded by test / form
equating, item / person fit and bias testing, Hence, the advantages of all
this appear rather "academic" ( read: useless). " ..
You now have a better idea why some like me really don't bother with IRT at
all anymore - it's virtually irrelevant to substantive progress in
psychological science. And, if you need to make decisions using test scores,
why bother when a simple sum score interpreted using representative
population norms and other relevant and somewhat more obvious item and scale
statistics does "the business" just as well as scores and indices associated
with the "latent variable" work which only a "psychometric wizard" can
What made the neural net so easy to "sell" was that it predicted valuable
and substantive outcomes for which no other model came close. That's why it
was grabbed with open arms by those "non-academics and academics" alike who
looked at the results and thought " I've gotta have this ". Sheesh, even
Shepard, Kruskal, Lingoes, and Guttman's non-metric Multidimensional Scaling
algorithms had a bigger impact in 10 years from their inception than IRT.
Until IRT, unfolding, or any test theory model can produce that "wow" effect
- they will remain "niche" activities for a minority of individuals in a
tiny sub-domain of the social and human sciences.
Regards ... Paul
Paul Barrett, Ph.D.
2622 East 21st Street | Tulsa, OK 74114
Chief Research Scientist
Office | 918.749.0632 Fax | 918.749.0635
pbarrett at hoganassessments.com
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 3176 bytes
Desc: not available
Url : https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20071030/06300b21/attachment.jpe
More information about the Rasch