A free exact 2 x 2 test
 

Three different tests are often thought of as exact 2 x 2 tests, even though they yield different p-values. They are the Fisher 2 x 2 test, an exact test based on the Pearson chi-square (PCS) statistic, and an exact test based on the likelihood ratio or equivalently on the likelihood-ratio chi-square (LRCS) statistic. I consider the last of these three to be far preferable to the others. Brian Lukoff, a former student of mine, has written a computer program for this test. You can run a single test on his website (http://stats.brianlukoff.com), or you can download Lukoff's program. Running Lukoff's program on your own computer requires a separate program called the .NET Framework, but that is free from Microsoft and a download link is available on Lukoff's site. MS Windows 98 or higher is required. Below I explain why I prefer the Lukoff test to the other two for most purposes.
 

When the Fisher test is exact
 

The Fisher 2 x 2 test is an exact test only in the rare circumstance in which the row and marginal column totals are all fixed - known in advance. For instance, suppose a judge is given 6 cups of coffee and told that three are decaffeinated, and asked to try to identify those three. There are exactly 6!/(3!3!) or 20 ways to divide 6 objects into two groups of 3 each. Thus the significance level for perfect division would be 1/20 or .05. That is the value the Fisher test would give for this problem. To arrange these data in a 2 x 2 table, let rows represent the true nature of the coffee while the columns represent the judge's guesses. In this case we know in advance that both row totals will be 3, as will both column totals. That's the situation for which the Fisher test gives the correct answer - when both row totals and both column totals are known in advance.

Far more common is the situation in which we know two group sizes (e. g., treatment and control group sizes) in advance. If the group sizes are the row totals, then we know the row totals. But we typically don't know in advance the column totals - the total number of successes and failures. This is often called the double-binomial problem. It is well known that the Fisher test can be seriously over-conservative for this problem. Tests based on PCS and LRCS have been designed for this problem. We consider them below.
 

Exact double-binomial tests based on PCS and LRCS
 

Let the null hypothesis be that success rates are equal in two groups. Suppose we have some measure of consistency or inconsistency between the null hypothesis and a set of sample data; the best-known such measure is the Pearson chi-square (PCS) statistic. To use PCS as the basis for an exact test, we can compute PCS for every possible outcome. Given that we have observed one of these outcomes in a set of sample data, we can identify all outcomes at least as inconsistent with the null hypothesis as the actual outcome. Call this set of outcomes set P.

We can also determine the specific instance of the null hypothesis (e. g., both success rates are 0.472) most consistent with the sample data. We can then use the binomial theorem (separately within each group) to compute the exact probability of each outcome in set P under this particular hypothesis. The sum of all these probabilities is an exact significance level for the observed outcome. A program for 2 x 2 tests based on this procedure is available in the StatXact statistical package - a quite expensive package.

We could use the same procedure just outlined with some other measure of consistency between data and hypothesis. In particular, there is the likelihood-ratio chi-square (LRCS). We could compute exact significance levels using this measure, and they will often differ from the values computed using PCS, even though both are exact values.

How can two tests both yield exact significance levels when those levels often differ between the two tests? A test is exact whenever its calculated significance level exactly equals the probability of finding that significance level when the null hypothesis is true. Thus we could easily generate a procedure which calculates a significance level entirely from a random number table. If the true probability of finding a value p or smaller is exactly p, then the procedure is exact, even though the test doesn't even use any real data.

Thus to select a "best" test, we want to apply some criterion other than "exactness." The most common way to do this is to try to design tests with maximum power. Such tests will generally use the data in a reasonable way. We have already mentioned two measures that might be used to construct a test: PCS and LRCS. The latter value is exactly monotonically related to the likelihood ratio itself. The Neyman-Pearson lemma states in effect that tests based on the likelihood ratio generally have excellent power characteristics. That's an argument in favor of LRCS. PCS is well known to be a poor test when used by itself (not merely as a measure in an exact test) when any cell's expected frequency is very low. That suggests that an exact test based on PCS will be a poor test (i.e., have low power) in that situation. As described more fully later, my own analyses confirm that conjecture.

But to say that a test has low power does not fully convey the limitations of that test. If someone uses a low-power test and still gets significance, they are likely to say, "I got significance despite the low power. That makes my finding even more impressive." And that reasoning is correct when the test's low power stems entirely from a small sample size. But consider again the imaginary test in which p is determined entirely from a random number table. Suppose someone were trying to sell you an expensive medicine for a serious disease you had, and the medicine had been shown to be effective with p < .001 using an exact text, but that exact test consisted of choosing p from a random number table. You would presumably have no faith in the medicine unless there were some other strong argument for it.

I would suggest that a PCS-based exact test has this same limitation, though of course not as strongly as my imaginary random test. Pearson himself recognized that PCS had some important limitations, but proposed it due to its computational simplicity in a pre-computer age. There are outcomes (with one very low expected cell frequency) in which PCS is high but the outcome is actually highly consistent with the null hypothesis as measured by, for instance, the exact likelihood ratio. Thus I would suggest that we should reject the PCS-based exact test for the same reason we would reject the exact test based on a random number table. This point seems particularly true when one expected cell frequency is low, but the point always has some relevance.
 

The comparative power of exact 2 x 2 tests based on PCS and LRCS
 

I studied the power of exact tests based on PCS and LRCS in 19,125 different situations. I let N1, the larger of the two sample sizes, range from 10 to 50 in increments of 10. I let N2 range from 10 to N1, also in increments of 10. This gave 15 different sample-size combinations. I also let tp1, the true probability of success in group 1, range from .02 to 1.0 in increments of .02, and let tp2 range from 0 to tp1 - .02 in increments of .02. This gave 50*51/2 or 1275 combinations of tp1 and tp2 for each of the 15 sample-size combinations. This yields altogether 15*1275 or 19,125 combinations of N1, N2, tp1, and tp2. All tests were one-tailed tests at the .05 level, testing for the experimental hypothesis (which was in fact always true) that tp1 > tp2. The power values calculated for each test are exact values calculated from the exact probabilities of outcomes in each possible cell; they are not merely based on a simulation study with some limited number of trials.

In 3650 of the 19,125 cases, the two tests had identical rejection regions and thus identical power. In 10,858 of the cases, PCS had higher power than LRCS. In the remaining 4617 cases LRCS had higher power. But that's not the whole story. The power advantage of PCS over LRCS never exceeded 0.1238, while the power advantage of LRCS over PCS ranged up to 0.7304. This occurred when N1 = 50, N2 = 10, tp1 = 0.28, and tp2 = 0. The two power values were 0.8663 and 0.1359. In general, large power advantages of LRCS over PCS occurred when the two sample sizes were very unequal, and tp1 and tp2 were both well on the same side of 0.5. This circumstance of course produces a very small expected frequency in one cell - the very circumstance Pearson warned against for using PCS as a simple chi-square test. The largest power advantages of PCS occurred when N1 and N2 were equal or nearly equal.

If power were the only consideration, the question would be whether we want to use a test which is often slightly better but sometimes dramatically worse. However, I consider these findings to be only of tangential interest. As mentioned above, I feel there are strong logical grounds for preferring the LRCS-based test.