[Rasch] Misfitting Individuals

Tue May 1 14:12:34 EST 2007

```There is a measure of misfit that corrects for n, it is called a
significance test, and signficance tests are included in all of the
software packages.

Every statistic estimated from a sample of data has a sampling
distribution and the variance of that sampling distribution is typically
a function of the sample size.  MNSQ fit statistics are no different.
The variance of the sampling distribution for person MNSQ is a function
of the number of items and the variance of the sampling distribution for
item MNSQ is a function of the number of cases.  It follows that it is
inappropriate to set arbitrary limits of the range of the MNSQ, since,
regardless of the "true" value, the spread of the observed values will
depend on the sample size.  Smaller samples will give rise to a wider
distribution than will larger samples.  Since item samples are
typically smaller than person samples, the spread of the item MNSQs will
be less than the spread of the person MNSQs regardless of the "true"
fit.

It is a significance test (or confidence interval) that adjusts for this
and tells you whether the observed figure could reasonably be due to
chance or not.

But of course, now we have a problem.  Our null hypothesis is that the
data fit the model-- WHICH IS NEVER TRUE. It follows that a
sufficiently large sample will always lead to rejection of the model.
As an aside, my standard response to anybody who tells me that their
data scale well with the model or that their data fit the model, is to
tell them that their sample size must have been too small.

So what does one do in practice?

1. identify those items (or cases) that have statistically significant
misfit.
2. order those items by the magnitude of the MNSQ fit statistics
3. examine each item in turn from highest to lowest MNSQ and diagnose
the problem and determine remedial action
4. stop when you get sick of it (in the case of person fit), or, in the
case of item fit, stop when you reach the minimum number of items you
must have in your test. This is the pragmatic approach.

________________________________

From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On
Behalf Of Stone, Gregory
Sent: Tuesday, 1 May 2007 8:05 AM
To: Michael Lamport Commons; Twing, Jon; Petroski, Greg
Cc: rasch
Subject: RE: [Rasch] Misfitting Individuals

An excellent idea.  In fact, it would be an excellent addition to
Winsteps, Facets, RUMM, etc. if that were built in to assist in
evaluating performance.  We should all remember however that measurement
and statistics are not equivalent, just as statistics make use of
mathematics but not all mathematical principles work in calculating
statistics.  Regardless of the precision of our measures, it is our
evaluation that brings fruitful meanings to our puzzles.  Whether
standard setting or survey analysis, we cannot give over our human
insight to numbers.  They are our slaves, as it were, not the reverse,
as so often seems to occur.

Cheers.

Gregory E. Stone, Ph.D., M.A.

Assistant Professor of Research and Measurement
The Judith Herb College of Education
The University of Toledo, Mailstop #914
Toledo, OH 43606   419-530-7224

Editorial Board, Journal of Applied Measurement     www.jampress.org

Board of Directors, American Board for Certification of Teacher
Excellence     www.abcte.org

For information about the Research and Measurement Programs at The
University of Toledo and careers in psychometrics, statistics and
evaluation, email gregory.stone at utoledo.edu.

________________________________

From: Michael Lamport Commons [mailto:commons at tiac.net]
Sent: Sun 4/29/2007 1:23 PM
To: Stone, Gregory; 'Twing, Jon'; 'Petroski, Greg'
Cc: rasch at acer.edu.au
Subject: RE: [Rasch] Misfitting Individuals

So make up a measure of misfit that corrects for n.

My Best,

Michael Lamport Commons, Ph.D.
Assistant Clinical Professor
Program in Psychiatry and the Law
Department of Psychiatry
Harvard Medical School
Beth Israel Deaconess Medical Center
234 Huron Avenue
Cambridge, MA 02138-1328

Telephone (617) 497-5270
Facsimile (617) 491-5270
Commons at tiac.net
http://dareassociation.org/

________________________________

From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On
Behalf Of Stone, Gregory
Sent: Sunday, April 29, 2007 8:26 PM
To: Twing, Jon; Petroski, Greg
Cc: rasch at acer.edu.au
Subject: RE: [Rasch] Misfitting Individuals

The one caveat in this description are the "rules of thumb".  As noted
they are as variable as are the citations they are found among.  Having
used MNSQ since I began, I was recently made amptly aware of the
problems associated with them.  Rules of thumb are OK for a few hundred,
but become less than useful as the sample size increases.  Indeed they
tend towards 1.0.

The example I spoke of was a major statewide testing program I reviewed.
Tens of thousands took the examination, yet the vendor chose the
relatively arbitrary path of the "rule of thumb" when determining
whether items misfit.  Of course they didn't.  The larger the sample,
the more the distribution regresses to the mean.  In short, test 10,000
people and all the items look great.  Use 10,000 items to score a group
of people and all the people look perfect as well.  Since both the MNSQ
and Z-STD are sample dependent, as Mike Linacre notes in the
instructions with Winsteps, we need to use them wisely.  As the samples
increase, items tend to conversely demonstrate misfit (>2.0) more often
when using Z-STD.

So the question becomes, do you want to risk overlooking dysfunctional
items (or persons) by using a predetermined MNSQ range that does not
reflect the sample characteristics, or, would you rather overestimate
the number of misfits.  It is a Type I/II error question really.
Personally, I tend to err on the side of caution.  If you do wish to use
MNSQ then it is more reasonable to calculate the precise error for the
sample being used.  For my 10,000 people, for instance, the range of
good fit may look more like .96-1.04 for instance - a much different
kettle of fish.

Ultimately, it is an evaluative question with statistical aids
(including the unmentioned Pt. Biserial) to assist us.  There are no
purely algorithmic solutions that will take our judgment out of the mix.

Good luck!

Gregory E. Stone, Ph.D., M.A.

Assistant Professor of Research and Measurement
The Judith Herb College of Education
The University of Toledo, Mailstop #914
Toledo, OH 43606   419-530-7224

Editorial Board, Journal of Applied Measurement     www.jampress.org

Board of Directors, American Board for Certification of Teacher
Excellence     www.abcte.org

For information about the Research and Measurement Programs at The
University of Toledo and careers in psychometrics, statistics and
evaluation, email gregory.stone at utoledo.edu.

________________________________

From: rasch-bounces at acer.edu.au on behalf of Twing, Jon
Sent: Sat 4/28/2007 2:24 PM
To: Petroski, Greg
Cc: rasch at acer.edu.au
Subject: RE: [Rasch] Misfitting Individuals

Greg:

This is often more art than science.  Here is what we sometimes do:

1.)     Use Rasch Person Fit to identify "anomalies" in the testing
experience (this could be pure guessing, cheating or other unusual
student engagements).

2.)     Since most of our work requires a student score, we will score
them but we might choose to drop them from the calibration.

3.)     In the "old days" we might have included an asterisk indicating
peculiar response string, but we have not done this in the last 10 years
or so.

4.)     Typically we use Mean Square Fit, INFIT and OUTFIT when
diagnosing person anomalies.  We typically use the values dictated for
items and apply them to persons.

5.)     Below are the criteria I have collected over the years.

Item INFIT and OUTFIT should be between 0.60 and 1.40 (Bond & Fox, 2001;
Linacre & Wright, 1999)

Item INFIT between 0.70 and 1.30 (Bode, Heineman, & Semik, 2000; Bogner,
Corrigan, Bode & Heinemann, 2000).

Mean-square fit statistics are defined such that the model-specified
uniform value of randomness is 1.0.  Values greater than 1.5 (more than
50% unexplained randomness) are problematic. (Wright and Panchapakesan,
1969; Linacre, 1999).

Hope this helps.  Good luck.

-Jon

**************************************************************

Jon S. Twing, Ph.D.

Executive Vice President, Test & Measurement Services

Pearson Educational Measurement

2510 N. Dodge Street, P.O. Box 30, Mailstop 165

Iowa City, Iowa  52245-9945

Phone: 319-339-6407

Fax: 319-339-6477

Cell: 319-331-6547

Jon.S.Twing at Pearson.com

http://www.pearsonsolutions.com/testmeasure/index.htm

**************************************************************

________________________________

From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On
Behalf Of Petroski, Greg
Sent: Friday, April 27, 2007 1:29 PM
To: rasch at acer.edu.au
Subject: [Rasch] Misfitting Individuals

I have a few questions centering on person-fit.  The need to understand
the cause of aberrant response patters is obvious. But in applications,
i.e. not in the test development phase, what is does with misfitting
person-responses?

Score them anyway?

Exclude them from the reporting?  This could be a very unpopular
solution in some applications.

Are rules of thumb for person INFIT and OUTFIT the same as when judging
item fit?  In which case one might not report scores for individuals
with INFIT or OUTFIT exceeding certain limits.  Is this done?

Gregory F. Petroski, Ph.D.

Dept. of Health Management and Informatics &
Office of Medical Research/Biostatistics
137 Hadley Hall (DC 018)
University of Missouri - Columbia,
Columbia, Mo.  65212

```