[Rasch] DIF and multiple comparisons - BH method

Roger Graves rgraves at uvic.ca
Wed Feb 8 10:17:28 EST 2006

```The recent discussion of the Benjamini and Hochberg (B&H) procedure as a
method of controlling Type I error the multiple comparisons involved in
testing for DIF across multiple items has not considered an important issue,
namely, whether or when this is the proper method to use.

It seems to me we have two scenarios where we are testing items for DIF.

Scenario 1: We are analyzing data for an existing test and we have a priori
reasons (theoretical or empirical, such as results from another study) for
suspecting that particular items (or conceivably all items) are biased for some
group. We then test for DIF on just those items we hypothesize to be biased.
In this case we have the same situation as in the usual experimental research
situation where one has a theory that one is trying to support, you believe the
nulls are false and you wish to reject the null hypotheses as a way of gathering
evidence in support of your theory. You have made a priori hypotheses that
the null's are wrong. In this situation you want to have as much power as
possible to reject the null's and you would properly choose to use the B&H
procedure, which provides considerably more power than the Bonferroni
procedure. The B&H procedure does not control "Family wise error rate
(FWER)", rather it controls the "False discovery rate (FDR)". FDR is the
expected proportion of rejected null hypotheses that are falsely rejected. Thus
if one only rejected one null, the FDR says your probability of a false rejection
(Type I error) of this hypothesis is .05. Iif one rejects 20 out of 100 null
hypotheses, the FDR would say the probability of error in those 20 is .05, or 1
of the 20 is expected to have been falsely rejected. This is useful information
and for this kind of situation, this kind of Type I error rate control makes
good sense. I would recommend the B&H procedure to my graduate students
for their dissertations where they want to get significant results (maximize
power) but still satisfy those troublesome committee critics that want Type I
error controlled. Importantly, to maximize power we need to minimize the
number of hypotheses. We would not test DIF on all items (and lose a lot of
power) but only test those items where we had reasons to predict DIF, and we
would control the FDR for however many tests we did by using the B&H
procedure.

Scenario 2: We are developing a new test instrument. At least in the usual test
development situation, one does not expect any of the items to be faulty, else
you wouldn't have included them in the first place. You test for DIF in this
situation as a screen to see if you might have missed something and as a way
of gathering construct validity evidence. In this situation you will conduct a
DIF analysis on all items as post hoc tests - you did not expect on an a priori
basis that any of the nulls would be false. Further, and relevant to the current
discussion, in this situation it does not seem to me proper to be controlling the
FDR. Rather, it seems this is a situation where control of FWER is more
appropriate. Here is seems that the question is 'if all the nulls are true as I
expect (or at least the effect size is close enough to zero that there is no DIF
of any consequence), what should I do to keep the probability of falsely
finding one or more "significant" DIF's to .05"?
You could use the Bonferroni procedure which does control FWER. Since it
not to disprove the null hypothesis'. However, this would not be honest if you
know, as we now do, that the Bonferroni is overly conservative.
An improved procedure for controlling the FWER error rate was described by
Hochberg (1988) based on work by Holms. This procedure follows the same
sequential procedure as does the B&H procedure (or vice versa since the
B&H procedure came later), but uses a different set of criterion p values.

Thus, for both procedures, (using Mike Linacre's nomenclature) we rank the p
values for N items in order with n=1 being the smallest p value to n=N the
largest p value.

If one accepts the above reasoning, then we can summarize.

For scenario 1, we would use the B&H (1995) criterion values: for item n of
N items, alpha*(n/N). (For example, for N=5, the series is, for n=1 to 5, .01,
.02, .03, .04, .05.)

For scenario 2, we would use the Hochberg (1988) criterion p values: for item
n of N iems, alpha/(N-n+1). (For example, for N=5, the series is, for n=1 to 5,
.01, .0125, .017, .025, .05.)

The two procedures have identical criterion p values for n=1 and n=N, but
differ otherwise, with the B&N values being substantially larger, allowing for
more power.

Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of
significance. Biometrika, 75, 800-802.

Roger

Dr. Roger Graves
Dept. of Psychology
University of Victoria
P.O. Box 3050
Victoria, BC
V8W 3P5