[Rasch] Not a Fan of Lexiles?

Andrew Kyngdon akyngdon at lexile.com
Wed Oct 31 07:53:32 EST 2007


Dear Paul,

 

A great firebrand post! One major quibble, however, with what you said.

 

In the Lexile system, measures of the difficulty of written prose text
passages are NOT obtained from either test scores, ratings, simulations
or IRT model analyses of such data.

 

The Lexile theory argues that the difficulty of written prose text is a
functional composition of two key cognitive variables - syntactic
complexity and semantic rarity.  Syntactic complexity is the complexity
of the sentence / thought unit structure of a written prose text
passage. Semantic rarity is the familiarity of the semantic units
(words) which comprise a prose text passage.

 

Each is assessed using a proxy variable. For syntactic complexity. it is
the natural logarithm of the mean sentence length (LMSL). Why is this
proxy used? Research in cognition has found that it is a good proxy for
the demand written prose text places upon Short Term Memory (e.g. Crain
& Skankweiler, 1988). The longer a sentence is, the harder it is to keep
all the information in a sentence in STM. Thus text passages with longer
sentences are more difficult to comprehend.

 

For semantic rarity, the proxy variable is the mean natural logarithm of
word frequency (MLWF). Why is this proxy used? Stenner, Smith & Burdick
(1983) investigated over 50 variables, including part of speech, number
of letters, number of syllables, modal grade, word content
classification and various transformations of these variables. They
found MLWF was the best proxy for the likelihood a person has a word in
their mental lexicon. Moreover, reaction time and gaze duration studies
have found that people will look at rarer words for several hundred
milliseconds longer than commonly occurring words, suggesting rarer
words are harder to locate and retrieve from the mental lexicon. Thus
text passages containing words which are infrequently used will be more
difficult to comprehend.

 

The difficulty of a written prose text passage i, Di, is a functional
composition of these proxies such that: Di = (a x LMSL) - (b x MLFW) -
c. This expression, referred to as a construct specification equation,
is a 5 variable simple polynomial conjoint structure (Krantz, Luce,
Suppes & Tversky, 1971) where a, b and c are real valued constants. The
actual values of these constants are proprietary; however, they result
in Di being a scale in logit units.

 

Notice the complete absence of "run of the mill' psychometry - test
scores, IRT model estimates, fit statistics, ratings, DIF, reliability
coefficients and so on - in the measurement of the difficulty of written
prose text passages. 

 

The biggest weakness of the theory is the assumption of something more
than monotone relationships existing between proxy variables and the
relevant psychological attributes (Krantz & Tversky, 1971). The theory
of conjoint measurement (Krantz, et al, 1971) can be used to address
this issue. I have done this and found the axioms to be satisfied using
Karabatsos's (2005) methods for the probabilistic tests of deterministic
axioms.

 

The Lexile theory argues that the probability of comprehending a written
prose text passage is a non-interactive, additive function of the text's
difficulty and the persons' reading ability. A suitable model for this
is the Rasch (1960) model. However, unlike conventional psychometry, we
modify the Rasch model to suit the theory. If a person's ability is
equal to the text difficulty, we do not consider the person has
successfully comprehended a text passage when the response probability
is only .5. A constant is therefore introduced into the Rasch model such
that when ability equals difficulty, the response probability is .75.
Note this modification does not destroy the raw score sufficiency
property (or any others) of the Rasch model. 

 

This modified Rasch model is Pr [Person v comprehends text passage i] =
exp(Bv - Di + k) / 1 + exp(Bv - Di + k), where Bv is the ability of
person v and k is the aforementioned constant. Test scores enter the
Lexile theory in the estimation of person abilities. These come from
tests consisting of Lexile passage native items (text passages with a
sentence "close"). Use of the Rasch model puts the abilities on the same
scale as the text difficulty variable. Both then can use the same affine
transformation to become measurements in the Lexile scale.

 

Hence I consider your point:

 

"...there is no substantive difference between probabilistic (latent
variable)  or deterministic (raw score) scale scores. Which, by default,
means all of the Metametrics Lexile and Quantile work could have been
achieved using conventional deterministic/actuarial score algorithms
(now there's a sobering thought).. "

 

to be incorrect. One cannot arrive at the measurement of text
difficulty, as theorized in the Lexile system, following conventional
psychometric practice.

 

Cheers,

 

Andrew

 

P.S. One serious problem with connectionist models as theories of human
thought processes - no biological equivalent to the backpropagation
algorithm. Does this mean connectionism is merely another "unopened
black box" theory of psychology?

 

Refs:

 

Crain, S. & Skankweiler, D. (1988). Syntactic complexity and reading
acquisition. In A. Davidson and G.M. Green (Eds.), Linguistic complexity
and text comprehension: readability issues reconsidered. Hillsdale,
N.J.: Erlbaum.

 

Karabatsos, G. (2005). The exchangeable multinomial model as an approach
to testing deterministic axioms of choice and measurement. Journal of
Mathematical Psychology, 49(1), 51-69.

 

Krantz, D.H.; Luce, R.D; Suppes, P. & Tversky, A. (1971). Foundations of
Measurement, Vol. I: Additive and polynomial representations. New York:
Academic Press.

 

Krantz, D.H. & Tversky, A. (1971). Conjoint measurement analysis of
composition rules in psychology. Psychological Review, 78(2), 151-169.

 

Rasch, G. (1960).  Probabilistic models for some intelligence and
attainment tests. Copenhagen: Danish Institute for Educational Research.

 

Stenner, A.J., Smith, M. & Burdick, D. S. (1983) Toward a theory of
construct definition. Journal of Education Measurement, 20, 305-315.

 

Andrew Kyngdon, PhD

Senior Research Scientist

MetaMetrics, Inc.

1000 Park Forty Plaza Drive

Durham NC 27713 USA

Tel. 1 919 354 3473

Fax. 1 919 547 3401

 

 

	
________________________________


	From: rasch-bounces at acer.edu.au
[mailto:rasch-bounces at acer.edu.au] On Behalf Of renselange at earthlink.net
	Sent: Tuesday, October 30, 2007 5:28 AM
	To: Anthony James; rasch at acer.edu.au
	Subject: Re: [Rasch] Fan

	My take on this is "so what" as well. Getting "scores" is hardly
the main achievement of Rasch modeling. There simply is little
appreciable difference between raw score sums and the Rasch model, as
well as estimates obtained via one, two, or three-parameter logistic
models. 

Hello Reese

 

There you have it in a nutshell, why many scientists (vs
psychometricians) concerned with explaining substantive phenomena,
sometimes using questionnaire data, are not concerned with itemmetric
models. 

 

Edumetrics is frankly "technician work"; the utility and benefits of
probabilistic modeling of item response data seems to be irrelevant to
explaining or predicting phenomenal outcomes more accurately.  Whilst
other benefits might accrue to those primarily concerned with
examinations, licensing, or other such pragmatic ventures, as a
scientist, you are concerned with minimizing substantive error in any
predictions you might make on the basis of theory or even "dustbowl"
empiricism.  

 

As you state Reese, there is no substantive difference between
probabilistic (latent variable)  or deterministic (raw score) scale
scores. Which, by default, means all of the Metametrics Lexile and
Quantile work could have been achieved using conventional
deterministic/actuarial score algorithms (now there's a sobering
thought).. 

 

Whether or not IRT is a more "efficient" method of going about these
activities seems to be the issue - which is an important one
neverthless. 

 

But what use is "efficiency" when the magnitudes obtained are equivalent
to just summing up item responses? Person reliability can easily be
computed without any fuss using conventional item difficulties and rank
order response discrepancy (between item difficulty ranking and observed
item responses). You don't need a "data model" for this simple
heuristic. Cross-validation sorts out replicability.

 

I'm not so sure IRT is any more efficient at all than a sensible and
intelligent approach to construct assessment, except perhaps where
edumetrics is concerned - but even here, where are the advantages to be
seen, and by whom? Are educational standards improving because of IRT
methods being applied?

 

As far as I can see, the only reason IRT came into existence was because
Rasch and others were fascinated with test theory, statistical data
models, and fundamental measurement, instead of the actual phenomena of
interest, and capturing these directly without the need for bucketloads
of  assumptions or practically incomprehensible algebra and math.

 

Indeed, what is apparent (to me at least) is that the major source of
explanatory/predictive error is not in the "measurement" metrics per se,
but in the concepts we invoke to explain phenomena, and how we actually
try to measure or assess "magnitudes" on them.

 

And, I haven't forgotten my reply to Andrew or Moritz(!) - but I'm flat
out on an interrater discrepancy/reliability program right now - for the
next 5 days or so - but I couldn't resist replying to this as I've seen
the same "even though there is no difference, this is still wonderful
stuff" statements in a recent paper comparing Ideal Point unfolding
models, dominance IRT, and simple raw score methods ...

 

Cherneyshenko, O.S., Stark, S., Drasgow, F., & Roberts, B.W. (2007)
Constructing personality scales under the assumption of an Ideal Point
response process: toward increasing the flexibility of personality
measures. Psychological Assessment, 19, 1, 88-106. 

 

See page 103, Col 2, 2nd para beginning "Note also that examination of
correlations .. " ... which contains the sentence ... "The use of ideal
point methods, however, is unlikely to yield increases in
criterion-related validities"

 

and then read the 1st para of the discussion .. 

 

Their argument, as far as I can see, is more about the
technicalities/differential pragmatic benefits of test construction
...and the consequences on particular measurement ranges within a scale.


 

But, if the criterion prediction remains the same - what is the achieved
real-world or scientific benefit? 

 

I have the feeling they are discovering that the real problem is not
with the technology of assessment, but with the very nature of the
constructs we try to assess.

 

However, this is not a call to undo or dump IRT - as those like
Metametrics, ACER, and others who have generated impressive measurement
and assessment systems around IRT are testament to its functionality,
utility, and profitability. 

 

This is merely a statement which says that it is just another way of
"modeling data" and constructing test scores which has proved to have no
tangible advantages over any other "intelligent" method of analyzing
data to make important decisions.

 

If you think think this is unfair, consider the development of the
"perceptron" (The perceptron is a type of artificial neural network
invented in 1957 at the Cornell Aeronautical Laboratory by Frank
Rosenblatt .. Wikipedia). In 50 years look at what this simple
"algorithm" for a neural net has spawned .. an entire new area of
psychological science investigation "connectionist cognitive
psychology", a neurophysiological model for developmental brain networks
and brain function, artificial intelligence, computational biology, and
literally physical working models of the development of human
intelligence, stock market trading algorithms, seriously accurate market
research/brand analysis prediction models, security camera
person-feature detection, biometric sensors, handwriting, and speech
detection, etc, etc.. 

 

And what has IRT spawned in the same period? A few tests which are
almost indistinguishable from the old tests, hundreds of books and
thousands of papers on test theory and tiny one-shot applications for
which CTT would have done the same job just as well - and was probably
doing so, and a few major-league companies who sell the technology and
some of its results to others.

 

If I am to do any work as a psychological scientist - I want to be the
inventor of  a "perceptron", and not the inventor of yet another "test
theory" - which is why methods which predict or measure no better than
simple "back of the matchbox" algorithms or heuristics are not given the
"time of day" by those like myself who want to make big strides in our
science. 

 

So, in response to:"The main problem here is that non-IRT folks often do
not (or cannot) even conceive of the possibilities afforded by test /
form equating, item / person fit and bias testing, Hence, the advantages
of all this appear rather "academic" ( read: useless). " ..

 

You now have a better idea why some like me really don't bother with IRT
at all anymore - it's virtually irrelevant to substantive progress in
psychological science. And, if you need to make decisions using test
scores, why bother when a simple sum score interpreted using
representative population norms and other relevant and somewhat more
obvious item and scale statistics does "the business" just as well as
scores and indices associated with the "latent variable" work which only
a "psychometric  wizard" can seemingly convey!

 

What made the neural net so easy to "sell" was that it predicted
valuable and substantive outcomes for which no other model came close.
That's why it was grabbed with open arms by those "non-academics and
academics" alike who looked at the results and thought " I've gotta have
this ". Sheesh, even Shepard, Kruskal, Lingoes, and Guttman's non-metric
Multidimensional Scaling algorithms had a bigger impact in 10 years from
their inception than IRT.

 

Until IRT, unfolding, or any test theory model can produce that "wow"
effect - they will remain "niche" activities for a minority of
individuals in a tiny sub-domain of the social and human sciences.

 

Regards ... Paul

 

 

 

 

Paul Barrett, Ph.D.

2622 East 21st Street | Tulsa, OK  74114

Chief Research Scientist

Office | 918.749.0632  Fax | 918.749.0635

pbarrett at hoganassessments.com

      

hoganassessments.com <http://www.hoganassessments.com/> 

 

 

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20071030/084168ed/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 3176 bytes
Desc: image001.jpg
Url : https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20071030/084168ed/attachment.jpe 


More information about the Rasch mailing list