Significance Tests: Their Logic and Early History
Dissertation, Stanford University (
1981)
Copy
BIBTEX
Abstract
Significance tests are the mainstay of much experimental analysis. They formalize reasoning of the following sort: Assuming hypothesis h, evidence e is improbable, if e is observed, reject h. Most philosophical work on induction is either about a simpler form of reasoning, such as Reichenbach's straight rule , or a more powerful form of reasoning, such as Neyman/Pearson confidence intervals, Bayesian posterior densities or Fisher fiducial probabilities. By focusing on this simple, common method of inductive inference, many of the subtleties of the problem of scientific induction become apparent. ;Three features of the logic of significance tests are isolated. The test statistic must single out the correct aspect of the evidence for inference. The stringency measure must correctly formalize the improbability of the evidence. Composite hypotheses pose additional problems since they do not stipulate exact probabilities for all outcomes. ;Until Karl Pearson's 1895 paper on skew frequency curves, statisticians chiefly used the Normal curve. A question of goodness of fit became crucial with many different frequency curves available. Pearson proposed the Chi-Squared statistic as a measure of fit. He justified it by its relation to correlation--a category which replaced causation within Pearson's positivist philosophy. In fact, Chi-Squared works well only when correlation models adequately describe the phenomena of concern. ;Levels of significance measure test stringency as the probability of the observed value of the test statistic being in the tails of the test statistic density. This practice stems from the theory of errors of observation . Probable error, defined in terms of tail areas under the Normal density, measures the precision of a series of observations. Early significance tests using the Normal density measured stringency in terms of multiples of the probable error. Nowadays many densities are used in significance testing; stringency is still measured by tail areas. Hence, rejection occurs if and only if the test statistic takes a value in the tails. No completely satisfactory analysis of this practice now exists. ;In 1904 Pearson extended Chi-Squared to test the composite hypothesis of statistical independence. This extension yielded conflicting inferences from those based on other tests of independence. In 1922, by means of his new concept of degrees of freedom, R. A. Fisher proposed a solution. Degrees of freedom measure the informativeness of an hypothesis. From 1922 onward, significance tests test not only the putative truth of an hypothesis but its informativeness. Fisher's solution violates a rule of implication: If h implies i, then evidence sufficient to reject i is sufficient to reject h. This rule is widely endorsed by philosophers; indeed, Hempel calls it a condition of adequacy for any theory of confirmation. But, if h implies i, h is more informative than i. Consequently, if we test for informativeness, if h implies i, evidence sufficient to reject i need not be sufficient to reject the more informative h. Since the introduction of degrees of freedom, significance tests have checked both informativeness and truth. This examination of significance testing reveals aspects of induction missed by analyses from first principles