[Rasch] DIF and multiple comparisons - BH method

Roger Graves rgraves at uvic.ca
Wed Feb 8 10:17:28 EST 2006

The recent discussion of the Benjamini and Hochberg (B&H) procedure as a 
method of controlling Type I error the multiple comparisons involved in 
testing for DIF across multiple items has not considered an important issue, 
namely, whether or when this is the proper method to use.

It seems to me we have two scenarios where we are testing items for DIF.

Scenario 1: We are analyzing data for an existing test and we have a priori 
reasons (theoretical or empirical, such as results from another study) for 
suspecting that particular items (or conceivably all items) are biased for some 
group. We then test for DIF on just those items we hypothesize to be biased.
In this case we have the same situation as in the usual experimental research 
situation where one has a theory that one is trying to support, you believe the 
nulls are false and you wish to reject the null hypotheses as a way of gathering 
evidence in support of your theory. You have made a priori hypotheses that 
the null's are wrong. In this situation you want to have as much power as 
possible to reject the null's and you would properly choose to use the B&H 
procedure, which provides considerably more power than the Bonferroni 
procedure. The B&H procedure does not control "Family wise error rate 
(FWER)", rather it controls the "False discovery rate (FDR)". FDR is the 
expected proportion of rejected null hypotheses that are falsely rejected. Thus 
if one only rejected one null, the FDR says your probability of a false rejection 
(Type I error) of this hypothesis is .05. Iif one rejects 20 out of 100 null 
hypotheses, the FDR would say the probability of error in those 20 is .05, or 1 
of the 20 is expected to have been falsely rejected. This is useful information 
and for this kind of situation, this kind of Type I error rate control makes 
good sense. I would recommend the B&H procedure to my graduate students 
for their dissertations where they want to get significant results (maximize 
power) but still satisfy those troublesome committee critics that want Type I 
error controlled. Importantly, to maximize power we need to minimize the 
number of hypotheses. We would not test DIF on all items (and lose a lot of 
power) but only test those items where we had reasons to predict DIF, and we 
would control the FDR for however many tests we did by using the B&H 

Scenario 2: We are developing a new test instrument. At least in the usual test 
development situation, one does not expect any of the items to be faulty, else 
you wouldn't have included them in the first place. You test for DIF in this 
situation as a screen to see if you might have missed something and as a way 
of gathering construct validity evidence. In this situation you will conduct a 
DIF analysis on all items as post hoc tests - you did not expect on an a priori 
basis that any of the nulls would be false. Further, and relevant to the current 
discussion, in this situation it does not seem to me proper to be controlling the 
FDR. Rather, it seems this is a situation where control of FWER is more 
appropriate. Here is seems that the question is 'if all the nulls are true as I 
expect (or at least the effect size is close enough to zero that there is no DIF 
of any consequence), what should I do to keep the probability of falsely 
finding one or more "significant" DIF's to .05"? 
You could use the Bonferroni procedure which does control FWER. Since it 
has low power, doing so will work to your advantage in your goal of 'trying 
not to disprove the null hypothesis'. However, this would not be honest if you 
know, as we now do, that the Bonferroni is overly conservative. 
An improved procedure for controlling the FWER error rate was described by 
Hochberg (1988) based on work by Holms. This procedure follows the same 
sequential procedure as does the B&H procedure (or vice versa since the 
B&H procedure came later), but uses a different set of criterion p values. 

Thus, for both procedures, (using Mike Linacre's nomenclature) we rank the p 
values for N items in order with n=1 being the smallest p value to n=N the 
largest p value. 

If one accepts the above reasoning, then we can summarize.

For scenario 1, we would use the B&H (1995) criterion values: for item n of 
N items, alpha*(n/N). (For example, for N=5, the series is, for n=1 to 5, .01, 
.02, .03, .04, .05.)

For scenario 2, we would use the Hochberg (1988) criterion p values: for item 
n of N iems, alpha/(N-n+1). (For example, for N=5, the series is, for n=1 to 5, 
.01, .0125, .017, .025, .05.)

The two procedures have identical criterion p values for n=1 and n=N, but 
differ otherwise, with the B&N values being substantially larger, allowing for 
more power.

Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of 
significance. Biometrika, 75, 800-802.


Dr. Roger Graves
Dept. of Psychology
University of Victoria
P.O. Box 3050
Victoria, BC
V8W 3P5
email: rgraves at uvic.ca

More information about the Rasch mailing list