Here is one reference:

Glass, G.V. & Hopkins, K.D. (1995). /Statistical //Methods in Education
and Psychology/. Boston: Allyn & Bacon.

Concerning the point-biserial:

Recall, that the point-biserial coefficient in itself is not a fit
statistic for the Rasch model. If you want to use it for that purpose
you have to calculate the expected point-biserial coefficient under the
Rasch model /*in the current study population*/ (where it may be close
to zero as explained by David) and then test whether the observed
coefficient disagrees with the Rasch model. In this way, the test may
provide evidence that the observed coefficient is too strong or that it
is too low indicating in both cases that something is wrong and (I am
sure because I have tried it) will tell the same story as the infits,
outfits and the other item fit statistics that we use.



> Forgive the non-Rasch question but given the recent flurry of activity I couldn’t resist:
> I've been searching for a good reference/history/formula/definition of the terms "biserial correlation" and "point-biserial correlation" for quite a while. I am aware that the common item(binary)-total(continuous) Pearson's coefficient is often referred to as a point-biserial but I am more interested in a formal definition or historical reference to these terms (biserial and point-biserial). Does anyone have a suggestion for a seminal reference or good overview for these terms? 
> Thanks much - John
The usual tests of fit have very little power if there is no person separation and little item separation. In that case a correct and incorrect score are more or less equally likely and so there is no evidence of unlikely responses (misfit). The connection between a traditional test theory statistic and power of the test of fit is the simple traditional reliability index (coefficient alpha), or in the case of some missing data, the index based on Rasch model estimates (which I call person separation but can be called Rasch model reliability). I consider that no fit statistics should be reported without this statistic also being reported, and a comment as to whether it is large enough to have power in detecting misfit. The usual number of 0.75 and above seems mandatory. In the program RUMM for example, we interpret this number as evidence of the power of the test of fit, with a colour coding from Excellent, Good, Reasonable, Low, and Too Low. 
> Hope this helps
> David
> On Tue, 2012-03-06 at 15:36 -0600, Stuart Luppescu wrote:
>> Are you saying that if I just generated 1's and 0's randomly and tried 
>> to calibrate them they would all fit? Hmmm. I'm going to have to try 
>> that....
> Very interesting, indeed! Of course, you get 0 reliability and point-biserials near 0, but all the fit statistics are very close to 1.0! Mark Moulton gets a beer next time I see him for providing the instructional moment of the day.

