R A Fisher: Statistical Methods Introduction
The science of statistics is essentially a branch of Applied Mathematics, and may be regarded as mathematics applied to observational data. As in other mathematical studies, the same formula is equally relevant to widely different groups of subject-matter. Consequently the unity of the different applications has usually been overlooked, the more naturally because the development of the underlying mathematical theory has been much neglected. We shall therefore consider the subject-matter of statistics under three different aspects, and then show in more mathematical language that the same types of problems arise in every case. Statistics may be regarded as (i) the study of populations, (ii) as the study of variation, (iii) as the study of methods of the reduction of data.
The original meaning of the word "statistics" suggests that it was the study of populations of human beings living in political union. The methods developed, however, have nothing to do with the political unity of the group, and are not confined to populations of men or of social insects. Indeed, since no observational record can completely specify a human being, the populations studied are always to some extent abstractions. If we have records of the stature of 10,000 recruits, it is rather the population of statures than the population of recruits that is open to study. Nevertheless, in a real sense, statistics is the study of populations, or aggregates of individuals, rather than of individuals. Scientific theories which involve the properties of large aggregates of individuals, and not necessarily the properties of the individuals themselves, such as the Kinetic Theory of Gases, the Theory of Natural Selection, or the chemical Theory of Mass Action, are essentially statistical arguments, and are liable to misinterpretation as soon as the statistical nature of the argument is lost sight of. In Quantum Theory this is now clearly recognised. Statistical methods are essential to social studies, and it is principally by the aid of such methods that these studies may be raised to the rank of sciences. This particular dependence of social studies upon statistical methods has led to the unfortunate misapprehension that statistics is to be regarded as a branch of economics, whereas in truth methods adequate to the treatment of economic data, in so far as these exist, have only been developed in the study of biology and the other sciences.
The idea of a population is to be applied not only to living, or even to material, individuals. If an observation, such as a simple measurement, be repeated indefinitely, the aggregate of the results is a population of measurements. Such populations are the particular field of study of the Theory of Errors, one of the oldest and most fruitful lines of statistical investigation. Just as a single observation may be regarded as an individual, and its repetition as generating a population, so the entire result of an extensive experiment may be regarded as but one of a population of such experiments. The salutary habit of repeating important experiments, or of carrying out original observations in replicate, shows a tacit appreciation of the fact that the object of our study is not the individual result, but the population of possibilities of which we do our best to make our experiments representative. The calculation of means and standard errors shows a deliberate attempt to learn something about that population.
The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of any one individual, together with the number in the group. The populations which are the object of statistical study always display variation in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the alms of modern statisticians and those of their predecessors. For, until comparatively recent times, the vast majority of workers in this field appear to have had no other aim than to ascertain aggregate, or average, values. The variation itself was not an object of study, but was recognised rather as a troublesome circumstance which detracted from the value of the average. The error curve of the mean of a normal sample has been familiar for a century, but that of the standard deviation was the object of researches up to 1915. Yet, from the modern point of view, the study of the causes of variation of any variable phenomenon, from the yield of wheat to the intellect of man, should be begun by the examination and measurement of the variation which presents itself.
The study of variation leads immediately to the concept of a frequency distribution. Frequency distributions are of various kinds; the number of classes in which the population is distributed may be finite or infinite; again, in the case of quantitative variates, the intervals which separate the classes may be finite or infinitesimal. In the simplest possible case, in which there are only two classes, such as male and female births, the distribution is simply specified by the proportion in which these occur, as for example by the statement that 51 per cent of the births are of males and 49 per cent of females. In other cases the variation may be discontinuous, but the number of classes indefinite, as with the number of children born to different married couples; the frequency distribution would then show the frequency with which 0, 1, 2 . . . children were recorded, the number of classes being sufficient to include the largest family in the record. The variable quantity, such as the number of children, is called the variate, and the frequency distribution specifies, how frequently the variate takes each of its possible values. In the third group of cases, the variate, such as human stature, may take any intermediate value within its range of variation; the variate is then said to vary continuously, and the frequency distribution may be expressed by stating, as a mathematical function of the variate, either (i) the proportion of the population for which the variate is less than any given value, or (ii) by the mathematical device of differentiating this function, the (infinitestimal) proportion of the population for which the variate falls within any infinitesimal element of its range.
The idea of a frequency distribution is applicable either to populations which are finite in number, or to infinite populations, but it is more usefully and more simply applied to the latter. A finite population can only be divided in certain limited ratios, and cannot in any case exhibit continuous variation. Moreover, in most cases only an infinite population can exhibit accurately, and in their true proportion, the whole of the possibilities arising from the causes actually at work, and which we wish to study. The actual observations can only be a sample of such possibilities. With an infinite population the frequency distribution specifies the fractions of the population assigned to the several classes; we may have (i) a finite number of fractions adding up to unity as in the Mendelian frequency distributions, or (ii) an infinite series of finite fractions adding up to unity, or (iii) a mathematical function expressing the fraction of the total in each of the infinitesimal elements in which the range of the variate may be divided. The last possibility may be represented by a frequency curve; the values of the variate are set out along a horizontal axis, the fraction of the total population, within any limits of the variate, being represented by the area of the curve standing on the corresponding length of the axis. It should be noted that the familiar concept of the frequency curve is only applicable to an infinite population with a continuous variate.
The study of variation has led not merely to measurement of the amount of variation present, but to the study of the qualitative problems of the type, or form, of the variation. Especially important is the study of the simultaneous variation of two or more variates. This study, arising principally out of the work of Galton and Pearson, is generally known under the name of Correlation, or, more descriptively, as Covariation.
The third aspect under which we shall regard the scope of statistics is introduced by the practical need to reduce the bulk of any given body of data. Any investigator who has carried out methodical and extensive observations will probably be familiar with the oppressive necessity of reducing his results to a more convenient bulk. No human mind is capable of grasping in its entirety the meaning of any considerable quantity of numerical data. We want to be able to express all the relevant information contained in the mass by means of comparatively few numerical values. This is a purely practical need which the science of statistics is able to some extent to meet. In some cases at any rate it is possible to give the whole of the relevant information by means of one or a few values. In all cases, perhaps, it is possible to reduce to a simple numerical form the main issues which the investigator has in view, in so far as the data are competent to throw light on such issues. The number of independent facts supplied by the data is usually far greater than the number of facts sought, and in consequence much of the information supplied by any body of actual data is irrelevant. It is the object of the statistical processes employed in the reduction of data to exclude this irrelevant information, and to isolate the whole of the relevant information contained in the data.
The discrimination between the irrelevant information and that which is relevant is performed as follows. Even in the simplest cases the values (or sets of values) before us are interpreted as a random sample of a hypothetical infinite population of such values as might have arisen in the same circumstances. The distribution of this population will be capable of some kind of mathematical specification, involving a certain number, usually few, of parameters, or "constants" entering into the mathematical formula. These parameters are the characters of the population. If we could know the exact values of the parameters, we should know all (and more than) any sample from the population could tell us. We cannot in fact know the parameters exactly, but we can make estimates of their values, which will be more or less inexact. These estimates, which are termed statistics, are of course calculated from the observations. If we can find a mathematical form for the population which adequately represents the data, and then calculate from the data the best possible estimates of the required parameters, then it would seem that there is little, or nothing, more that the data can tell us; we shall have extracted from it all the available relevant information.
The value of such estimates as we can make is enormously, increased if we can calculate the magnitude and nature of the errors to which they are subject. If we can rely upon the specification adopted, this presents the purely mathematical problem of deducing from the nature of the population what will be the behaviour of each of the possible statistics which can be calculated. This type of problem, with which until recent years comparatively little progress had been made, is the basis of the tests of significance by which we can examine whether or not the data are in harmony with any suggested hypothesis. In particular, it is necessary to test the adequacy of the hypothetical specification of the population upon which the method of reduction was based.
The problems which arise in the reduction of data may thus conveniently be divided into three types:
(i) Problems of Specification, which arise in the choice of the mathematical form of the population.
(ii) When a specification has been obtained, problems of Estimation arise. These involve the choice among the methods of calculating, from our sample, statistics fit to estimate the unknown parameters of the population.
(iii) Problems of Distribution include the mathematical deduction of the exact nature of the distributions in random samples of our estimates of the parameters, and of other statistics designed to test the validity of our specification (tests of Goodness of Fit).
The statistical examination of a body of data is thus logically similar to the general alternation of inductive and deductive methods throughout the sciences. A hypothesis is conceived and defined with all necessary exactitude; its logical consequences are ascertained by a deductive argument; these consequences are compared with the available observations; if these are completely in accord with the deductions, the hypothesis is justified at least until fresh and more stringent observations are available. The author has attempted a fuller examination of the logic of planned experimentation in his book, The Design of Experiments.
The deduction of inferences respecting samples, from assumptions respecting the populations from which they are drawn, shows us the position in Statistics of the classical Theory of Probability. For a given population we may calculate the probability with which any given sample will occur, and if we can solve the purely mathematical problem presented, we can, calculate the probability of occurrence of any given statistic calculated from such a sample. The problems of distribution may in fact be regarded as applications and extensions of the theory of probability. Three of the distributions with which we shall be concerned, Bernoulli's binomial distribution, Laplace's normal distribution, and Poisson's series, were developed by writers on probability. For many years, extending over a century and a half, attempts were made to extend the domain of the idea of probability to the deduction of inferences respecting populations from assumptions (or observations) respecting samples. Such inferences are usually distinguished under the heading of inverse Probability, and have at times gained wide acceptance. This is not the place to enter into the subtleties of a prolonged controversy; it will be sufficient in this general outline of the scope of Statistical Science to reaffirm my personal conviction, which I have sustained elsewhere, that the theory of inverse probability is founded upon an error, and must be wholly rejected. Inferences respecting populations, from which known samples have been drawn, cannot by this method be expressed in terms of probability, except in the trivial case when the population is itself a sample of a superpopulation the specification of which is known with accuracy.
The probabilities established by those tests of significance, which we shall later designate by t and z, are, however, entirely distinct from statements of inverse probability, and are free from the objections which apply to these latter. Their interpretation as probability statements respecting populations constitute an application unknown to the classical writers on probability. To distinguish such statements as to the probability of causes from the earlier attempts now discarded, they are known as statements of Fiducial Probability.
The rejection of the theory of inverse probability was for a time wrongly taken to imply that we cannot draw, from knowledge of a sample, inferences respecting the corresponding population. Such a view would entirely deny validity to all experimental science. What has now appeared is that the mathematical concept of probability is, in most cases, inadequate to express our mental confidence or diffidence in making such inferences, and that the mathematical quantity which appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term "Likelihood" to designate this quantity [A more special application of the likelihood is its use, under the name of "power function," for comparing the sensitiveness, in some chosen respect, of different possible tests of significance.]; since both the words "likelihood" and "probability" are loosely used in common speech to cover both kinds of relationship.
The solutions of problems of distribution (which may be regarded as purely deductive problems in the theory of probability) not only enable us to make critical tests of the significance of statistical results, and of the adequacy of the hypothetical distributions upon which our methods of numerical inference are based, but afford real guidance in the choice of appropriate statistics for purposes of estimation. Such statistics may be divided into classes according to the behaviour of their distributions in large samples.
If we calculate a statistic, such, for example, as the mean, from a very large sample, we are accustomed to ascribe to it great accuracy; and indeed it will usually, but not always, be true, that if a number of such statistics can be obtained and compared, the discrepancies among them will grow less and less, as the samples from which they are drawn are made larger and larger. In fact, as the samples are made larger without limit, the statistic will usually tend to some fixed value characteristic of the population, and, therefore, expressible in terms of the parameters of the population. If, therefore, such a statistic is to be used to estimate these parameters, there is only one parametric function to which it can properly be equated. If it be equated to some other parametric function, we shall be using a statistic which even from an infinite sample does not give the correct value; it tends indeed to a fixed value, but to a value which is erroneous from the point of view with which it was used. Such statistics are termed Inconsistent Statistics; except when the error is extremely minute, as in the use of Sheppard's adjustments, inconsistent statistics should be regarded as outside the pale of decent usage.
Consistent statistics, on the other hand, all tend more and more nearly to give the correct values, as the sample is more and more increased; at any rate, if they tend to any fixed value it is not to an incorrect one. In the simplest cases, with which we shall be concerned, they not only tend to give the correct value, but the errors, for samples of a given size, tend to be distributed in a well-known distribution known as the Normal Law of Frequency of Error, or more simply as the normal distribution. The liability to error may, in such cases, be expressed by calculating the mean value of the squares of these errors, a value which is known as the variance; and in the class of cases with which we are concerned, the variance falls off with increasing samples, in inverse proportion to the number in the sample.
Now, for the purpose of estimating any parameter, such as the centre of a normal distribution, it is usually possible to invent any number of statistics such as the arithmetic mean, or the median, etc., which shall be consistent in the sense defined above, and each of which has in large samples a variance falling off inversely with the size of the sample. But for large samples of a fixed size the variance of these different statistics will generally be different. Consequently, a special importance belongs to a smaller group of statistics, the error distributions of which tend to the normal distribution, as the sample is increased, with the least possible variance. We may thus separate off from the general body of consistent statistics a group of especial value, and these are known as efficient statistics.
The reason for this term may be made apparent by an example. If from a large sample of (say) 1000 observations we calculate an efficient statistic, A, and a second consistent statistic, B, having twice the variance of A, then B will be a valid estimate of the required parameter, but one definitely inferior to A in its accuracy. Using the statistic B, a sample of 2000 values would be required to obtain as good an estimate as is obtained by using the statistic A from a sample of 1000 values. We may say, in this sense, that the. statistic B makes use Of 50 per cent of the relevant information available in the observations; or, briefly, that its efficiency is 50 per cent. The term "efficient" in its absolute sense is reserved for statistics the efficiency of which is 100 per cent.
Statistics having efficiency less than 100 per cent may be legitimately used for many purposes. It is conceivable, for example, that it might in some cases be less laborious to increase the number of observations than to apply a more elaborate method of calculation to the results. It may often happen that an inefficient statistic is accurate enough to answer the particular questions at issue. There is however, one limitation to the legitimate use of inefficient statistics which should be noted in advance. If we are to make accurate tests of goodness of fit, the methods of fitting employed must not introduce errors of fitting comparable to the errors of random sampling; when this requirement is investigated, it appears that when tests of goodness of fit are required, the statistics employed in fitting must be not only consistent, but must be of 1000 per cent efficiency. This is a very serious limitation to the use of inefficient statistics, since in the examination of any body of data it is desirable to be able at any time to test the validity of one or more of the provisional assumptions which have been made.
Numerous examples of the calculation of statistics will be given in the following chapters, and, in these illustrations of method, efficient statistics have been chosen. The discovery of efficient statistics in new types of problem may require some mathematical investigation. The researches of the author have led him to the conclusion that an efficient statistic can in all cases be found by the Method of Maximum Likelihood; that is, by choosing statistics so that the estimated population should be that for which the likelihood is greatest. In view of the mathematical difficulty of some of the problems which arise it is also useful to know that approximations to the maximum likelihood solution are also in most cases efficient statistics. Some simple examples of the application of the method of maximum likelihood, and other methods, to genetical problems are developed in the final chapter.
For practical purposes it is not generally necessary to press refinement of methods further than the stipulation that the statistics used should be efficient. With large samples it may be shown that all efficient statistics tend to equivalence, so that little inconvenience arises from diversity of practice. There is, however, one class of statistics, including some of the most frequently recurring examples, which is of theoretical interest for possessing the remarkable property that, even in small samples, a statistic of this class alone includes the whole of the relevant information which the observations contain. Such statistics are distinguished by the term sufficient and, in the use of small samples, sufficient statistics, when they exist, are definitely superior to other efficient statistics. Examples of sufficient statistics are the arithmetic mean of samples from the normal distribution, or from the Poisson series; it is the fact of providing sufficient statistics for these two important types of distribution which gives to the arithmetic mean its theoretical importance. The method of maximum likelihood leads to these sufficient statistics when they exist.
While diversity of practice within the limits of efficient statistics will not with large samples lead to inconsistencies, it is, of course, of importance in all cases to distinguish clearly the parameter of the population, of which it is desired to estimate the value from the actual statistic employed as an estimate of its value; and to inform the reader by which of the considerable variety of processes which exist for the purpose the estimate was actually obtained.
JOC/EFR March 2006
The URL of this page is: