[Rasch] Test Translation & DIF
Matt Schulz
mschulz at pacificmetrics.com
Fri Jan 16 07:58:38 EST 2009
Here is a discussion I wrote up for a study of effects of translation and
culture on an instrument called the Goal Instability Scale (GIS). It
addresses many issues including the role of DIF. It leads to statements of
hypotheses tied directly to Rasch statistics. This discussion was just a
section of the paper, prior to the methods section. The complete study is:
Casillas, A., Schulz, E. M., Robbins, S, Santos, P. J., & Lee,R. (2006).
Exploring the meaning of motivation across cultures: IRT analysis of the
goal instability scale. Journal of Career Assessment. 14 (2), 1-18.
---------------------------------------------------------
The purpose of this study is to further compare the internal meaning
of English and Portuguese versions of the GIS. Comparisons of internal
meaning are generally based on criteria for measurement invariance in the
literature. Measurement invariance means that items function the same way
across groups or cultures. Criteria for measurement invariance were first
formulated from the perspective of exploratory factor analysis (Hui &
Triandis, 1985). Later, criteria were formulated within the framework of
structural equation modeling (Steenkamp & Baumgartner, 1998). More
recently, criteria within the frameworks of item response theory (Reise,
Widaman, & Pugh, 1993; Raju, Laffitte, & Byrne, 2002; Gerber et al., 2000)
and latent class analysis (Eid, Langeheine, & Diener, 2003) have been put
forward. It is beyond the scope of this study to compare and contrast the
criteria and relative strengths of these various frameworks. Articles by
Reise, et al., (1993), Raju, et al., (2002), and Eid, et al., (2003) include
comparisons of approaches. There seems to be general agreement in this
field that the forms and stringency of measurement invariance with which one
needs to be concerned depend on practice and the goals of the study (e.g.,
Steenkamp & Baumgertner, 1998).
The following section presents the criteria for measurement invariance
used in this study. However, before presenting these criteria, it is
important to note how certain assumptions led us to use an IRT framework and
a particular IRT model for this comparison. We believe it is important to
use a framework that represents key assumptions in the scoring and intended
use of the GIS, and to use a model that allows the assumptions to be
evaluated. First, the practice of obtaining only one measure from the
GIS-goal instability-suggests the assumption of undimensionality. It is
therefore reasonable to use a unidimensional model to evaluate measure
invariance. Specifically, we are interested in the fit of the GIS data to a
unidimensional model because that is how the GIS data are treated.
Second, we assume that differences among the GIS items, in terms of
their endorsability by students, define a progression, or descent, into goal
instability that is shared by most persons. This assumption leads first to
the use of an item response theory (IRT) model, and second to the use of
"person fit" statistics that indicate whether a given person's responses to
the GIS items are consistent with the progression evidenced by the
arrangement of items on the latent IRT scale. IRT models generally locate
items on a latent scale that also represents measures of the trait. The
production of person-fit statistics is associated primarily with a subset of
measurement theory and applications within IRT where the arrangement of
items on the latent scale is assumed to have meaning for most, if not all,
individuals. For example, items located at one end of the goal instability
scale may show how any person begins the descent into goal instability.
Items located at the other end of the scale may show the final or most
advanced stages of goal instability in any person. These kinds of
interpretations have implications for how goal instability can be addressed
through counseling or more general preventative interventions.
Third, we assume that the GIS items contribute equally to the
measurement of goal instability. This assumption is implicit in the fact
that the unweighted total score across GIS items is taken as the measure of
goal instability. With this assumption, it is important to use an IRT model
that explicitly incorporates the assumption of equal weighting and to
evaluate the fit of the data to this model. In a structural modeling
framework, one would evaluate the fit of a model in which equal weights were
specified for the items. In an IRT framework, one evaluates the fit of a
model in which the slope parameter in the model is assumed to be a constant
(e.g., 1.0) for all items. It will be seen in the following section that
the use of a model with constant slope (e.g., with no slope parameter) has
certain advantages for assessing other facets of measure invariance as well.
The Rating Scale Model and Criteria for Measurement Invariance
The Rating Scale Model (Andrich, 1978) is a unidimensional item
response theory model for data where all items share a common set of ordered
response categories, such as exists with a Likert scale. A formulation of
the model and an interpretation of its parameters with respect to GIS data,
where items are scored 1 (for strongly agree) to 6 (for strongly disagree)
is:
j=1,2,.,4 (1)
where
tj is a category threshold parameter. A threshold parameter represents the
relative difficulty of choosing category j rather than category j-1 in
response to any item,
Pnij is the probability that person n surmounts exactly j thresholds
on statement i,
Pnij-1 is the probability that person n surmounts exactly j-1
thresholds on statement i,
bn is the goal instability of person n, and
di is the location, or calibration, of item i on the measurement
scale.
An important feature of the Rating Scale Model in the present context
is that it does not include a slope parameter. This is an important
difference from the graded response model (Samajima, 1969), which is often
applied to Likert data, and many other IRT models. The slope parameter is
so named because it allows the slope of the item characteristic curve (ICC)
to vary across items. The ICC is the trace line of the expected score on an
item as a function of the trait value, or b. The slope parameter
essentially multiplies the additive combination of other parameters in the
model. For example, ai(bn - (di +tj)) represents the addition of a slope
parameter, ai, to the Rating Scale Model. Models with slope parameters
tend to fit data better, and may be useful in exploratory work, but they do
not correspond to practice when the unweighted total score across items is
used to estimate the underlying trait.
Measure invariance in IRT ultimately implies equivalence of item
parameters across groups (Reise, et al., 1993, Raju, et al., 2002).
Parameters in IRT models typically represent distinct, substantive issues
with regard to measure invariance, so it would be useful to compare model
parameters directly across groups. For example, differences in the Rating
Scale Model threshold parameters, tj, across groups reflect group
differences in how categories of the Likert scale are interpreted and used,
as opposed to differences in how the items are interpreted and used. Group
differences in how items are interpreted and used are represented by
differences between paired item parameters, di, across groups. Thus, in the
Rating Scale Model, the meaning of differences in the tj and di across
groups is separate and clear. This is a consequence of the additive
combination of all model parameters on the right side of the model
formulation (1).
More generally, measure invariance in IRT is not assessed by directly
comparing item parameters. See Raju, et. al., (2002) for a description of
measure invariance with regard to more general class of IRT models. IRT
models frequently include parameters, such as a slope or a pseudo-guessing
parameter, such that the combination of parameters in the model is not
completely linear, or additive. Without additivity, the meaning of model
parameters is not separate and clear. For example, differences in item
location parameters (e.g., di) across groups cannot be evaluated
independently of differences in a slope parameter.
It is interesting to note that IRT-based studies of measure invariance
generally use the same methods employed in the study of differential item
functioning, or DIF (Reise, et al., 1993; Raju, et al., 2002). One of the
most popular and powerful methods of assessing DIF, the Mantel-Haenzel
method, is not directly based on IRT, but is mathematically equivalent to
comparing item difficulty parameters in the one item-parameter (a location
or difficulty parameter) Rasch model for dichotomously scored (0 or 1) data
under certain conditions, including the fit of data to the model (Holland &
Thayer, 1988). The Rating Scale Model is a member of the Rasch family of
measurement models (Rasch, 1960; Wright and Masters, 1982).
Fit statistics commonly used in conjunction with the Rating Scale
Model include a weighted and unweighted mean squared residual, which are
referred to as infit and outfit, respectively (Wright & Masters, 1982). The
fit statistics are computed for each person and item (Wright & Masters,
1982). Only the outfit statistic will be used in this study. The outfit
statistic is comparable to a chi square statistic divided by its degrees of
freedom. It has an expected value of 1.00 under the hypothesis that data
fit the model. Fit statistics greater than 1.0 indicate response patterns
having more noise than expected according to the probabilities specified by
the model (e.g., Equation 1). Fit statistics outside the range of 0.6 to
1.5 may indicate practically significant overfit (less than 0.6) or underfit
(greater than 1.5).
For some practical purposes, such as counseling, only cases of
underfit may be of concern. The responses of overfitting persons conform
unusually well to the arrangement of items on the GIS scale. The
assumptions of a counselor or intervention program about how goal
instability progresses, as inferred from the order of items on the GIS
scale, would not be invalid for overfitting persons. Likewise, overfitting
items tend to be associated with a higher-than-average correlation between
the item score and person measure. This relationship extends to the weight
or slope the item would have in structural equation models or IRT models
that allow weights, or slopes, to vary. Overfitting items tend to be
associated with greater slope or weight values. These kinds of items are
not generally viewed as problem items in instrument development.
In terms of the Rating Scale Model and associated fit statistics, the
following criteria for measurement invariance are explored in this study.
The term "groups" refers to the English and Portuguese samples taking their
respective language versions of the GIS.
1) The proportion of person fit statistics less than 1.5 should be
reasonably large and comparable across groups. This criterion will be
evaluated informally and readers may judge for themselves what is large and
comparable. The larger the proportion, the more the arrangement of items on
the IRT scale can be used to understand the dynamics of goal instability
within a given student and to deliver effective intervention and counseling
at the individual student level. If the proportion is comparable across
language versions, the GIS can be said to have similar potential for
counseling in both populations. If the order and arrangement of items on
the scale differ in the two populations, individual counseling and
intervention strategies would differ by population.
2) Item calibrations (di) will be the same across groups. To meet this
criterion, the difference between paired item calibrations should differ by
no more than 0.3 logits (the scale unit in a Rating Scale Model Analysis).
This standard is commonly applied in evaluating measure invariance in
educational testing (insert refs.). The failure of an item to meet this
criterion may be due to non-equivalence in translation or to more
fundamental differences between populations in how the item defines goal
instability. If the GIS measure is invariant in this respect, individual
counseling and intervention strategies would not differ by population.
3) Step calibrations (tj) will be the same across groups. To meet this
criterion, the same 0.3 standard described above will be used. Failure of a
step calibration to meet this criterion would call the translation of
category labels into question or suggest more fundamental cultural
differences in how persons use the Likert-type scale categories.
4) Item fit statistics will be comparable across groups. In addition to
their use in detecting technical flaws in individual items, such as scoring
errors or ambiguous language that may be interpreted differently by
different persons, item fit statistics can indicate a variety of
substantively meaningful patterns in the data such as dependencies among
related items (which leads to overfit), or areas of performance or cognition
that are not as strongly related to the central trait as others (which leads
to underfit). These substantive patterns, as well as any superficial
characteristics of the item which may cause misfit, are part of the meaning
of the variable, and should be the same across groups. Due to the
approximate relations between item fit statistics, SEM item weights, and IRT
item slope parameters, a comparison of item fit statistics across groups is
about as productive and meaningful with regard to measure invariance as
comparing SEM item weights or IRT item slope parameter estimates across
groups.
_____
From: rasch-bounces at acer.edu.au [mailto:rasch-bounces at acer.edu.au] On Behalf
Of Anthony James
Sent: Thursday, January 15, 2009 1:22 PM
To: rasch at acer.edu.au
Subject: [Rasch] Test Translation & DIF
Dear all,
I want to study DIF in the context of test translation. That is, whether
items exhibits different levels of difficulty in the original version and
the translated version of a test. How should one go about doing this in
Winsteps?
Two different linguistic groups have taken the original version of a test
and its translation.
Do I need some bilingual test-takers to take both versions to establish a
link to run DIF?
What else can be done, within classical test theory and Rasch measurement,
to investigate translation equivalence and the validity of the translated
version of a test?
Cheers
Anthony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20090115/d04328a2/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/x-ms-wmz
Size: 498 bytes
Desc: not available
Url : https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20090115/d04328a2/attachment.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 363 bytes
Desc: not available
Url : https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20090115/d04328a2/attachment.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 495 bytes
Desc: not available
Url : https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20090115/d04328a2/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 868 bytes
Desc: not available
Url : https://mailinglist.acer.edu.au/pipermail/rasch/attachments/20090115/d04328a2/attachment-0001.obj
More information about the Rasch
mailing list