VACETS Logo

VACETS Regular Technical Column

The VACETS Technical Column is contributed by various members , especially those of the VACETS Technical Affairs Committe. Articles are posted regulary on vacets@peak.org forum. Please send questions, comments and suggestions to vacets-ta@vacets.org

Mon, 11 Nov 1996

Statiscal Notes #4: A Matter Of Deviation

By: T. V. Nguyen

The author of Ecclesiastes remarked that "No man can find out the work that God maketh from the beginning to the end". Indeed, we can not measure cosmic rays everywhere all the time. We can not try a new drug on everybody. No one can test every shell or bomb that they manufacture. We have to content with SAMPLES. The measurements involved in every scientific experiment constitute a sample of that unlimited set of measurements which would result if one performed the same experiment over and over indefinitely. This total set of potential measurements is referred to as POPULATION. Once a sample of data is collected, we are interested in four questions: (i) how can one describe the sample usefully and clearly; (ii) from the evidence of this sample, how one does one best infer conclusions concerning the total population; (iii) how reliable are these conclusions and (iv) how should samples be taken in order that they may be as illuminating and dependable as possible? In this note, I will discuss answers to the first question.

In the previous article, I already mentioned the average (or the mean) as a measure of a central position of a data set. I also point out a few havocs associated with this statistic. For example, the mean of income of a certain class of the Havard University is not a very useful figure if the class happens to include one man who has an income of half a million dollars. The average does not tell us the whole story. Consider the following data on ages of three different children in three groups:

(a) 6, 6, 6

(b) 6, 5, 7

(c) 2, 1, 15

The mean of all three groups is 6 years old. But, as you can see, this mean certainly does not adequately reflect the true picture of each of the data set. In (a) three values are the same; in (b) the mean seems to be a reasonable representation, while in (c) the mean is a hopeless statistic. We want to know more from the data. We want to know the extent to which the values differ from this mean. The term DISPERSION is used to describe the degree to which a set of values vary about their mean. Other terms that convey this same concept are VARIATION, SCATTER and SPREAD. When a set of values are all close to the mean, they exhibit less dispersion than when some of the values are much larger and/or much smaller than the mean. For descriptive measures used to express the amount of dispersion in a set of data are the range, average deviation, the variance and the standard deviation.

The range is defined as the difference between the largest and smallest value in a data set. For example, in the above example, the range in (a) is 0, in (b) 2 and in (c) 14. The range, although easy to compute, is usually an unsatisfactory measure of dispersion, since only two values are used in the computation. In other words, the range does not make use all the information available in the data it is supposed to describe.

The AVERAGE DEVIATION expresses the average amount by which a set of values differ from their mean. It takes into account the deviation of each value from the mean, x(i) - mean. However, the sum of these deviations, and hence their mean, is always equal to zero. Therefore, some modification must be made in the procedure if it is to lead to a valuable measure of dispersion. An appropriate modification is to take the mean of the deviations while ignoring the signs. That is, the absolute values of the deviation. The procedure is expressed in the following formula:

Ave Dev = Sum of |X(i) - mean| / N

In the above examples, the average deviation is calculated as:

(a) Ave Dev = [ |6-6| + |6-6| + |6 - 6|] / 3 = 0

(b) Ave Dev = [ |6-6| + |5-6| + |7 - 6| ] / 3 = 0.67

(c) Ave Dev = [ |2-6| + |1-6| + |15 - 6| ] / 3 = 6

Although it is an intuitive measure of dispersion, its usefulness is limited because it does not lend itself to further mathematical manipulation. Consequently, it is seldom used as a measure of dispersion.

The VARIANCE, like the average deviation, makes use of individual values, x(i), from their mean, that is, x(i) - mean. In computing the variance, negative differences are avoided by squaring, rather than taking absolute values. The variance of a sample of data, then, may be computed from the formula:

Var = Sum of [x(i) - mean]**2 / N

Thus, the variance is simply the average of the squared deviation of the individual values from their mean. The numerator is called the SUM OF SQUARES ABOUT THE MEAN. The symbol s**2 is used to designate the sample variance. The sample variance can be used to estimate the (unknown) population variance. And when this use is made, the denominator is (N-1) rather than N, i.e.

s**2 = Sum of [x(i) - mean]**2 / (N-1)

In the above example, the sample variance is:

(a) s^2 = [(6-6)^2 + (6-6)^2 + (6-6)^2] / 2 = 0

(b) Ave Dev = [(6-6)^2 + (5-6)^2 + (7-6)^2] / 2 = 1

(c) Ave Dev = [(2-6)^2 + (1-6)^2 + (15-6)^2] / 2 = 61

(please note ^2 and **2 both denote square)

Note that the variance has a unit of age^2 (age squared), which is sometimes impractical. It is desirable to convert this back to the original unit (age). This can be done by taking the positive square root of the variance, and the result is called the STANDARD DEVIATION. This is one of the most widely used measure of dispersion in statistics. The standard deviation is often denoted by s, i.e. in the example:

(a) s = 0

(b) s = sqrt(1) = 1 years

(c) s = sqrt(61) = 7.8 years

Thus, although the three data sets have an identical mean, the standard deviation tells us that the there is no variation in (a), while the variability in (c) is almost 8 times higher than that of in (b).

The standard deviation has an important implication in the description of data. P. L. Chebyshev (1821-1894), an eminent Russian mathematician, has shown that in any set of observations, given a mean M and standard deviation S,

at least 75% of the observations are expected to fall within M +/- 2S at least 89% of the observations are expected to fall within M +/- 3S at least 96% of the observations are expected to fall within M +/- 4S

Thus, given a mean, a standard deviation and the number of observations, we can reasonably work out the distribution of the data. If, in fact, the data are distributed symmetrically as in Figure 1, then the Chebyshev's statement can be shown to be:

at least 68% of the observations are expected to fall within M +/- S at least 95% of the observations are expected to fall within M +/- 2S at least 99.7% of the observations are expected to fall within M +/- 3S

How can we compare two samples of data, which are measured in different units, say, in kg and in cm? The mean and standard deviation can provide a way of getting around the problem of different units of measurements. Consider the quantity, which statisticians call the "z score",

z = [X(i) - M] / s

where X(i) is an ith value, M is the mean and s is the standard deviation. As you can see, the z-score does not have a unit. Thus, a transformation of data from original unit to the z-score does allow comparisons to be done.

The following figure graphs the number of women (y-axis) classified by their bone mass (x-axis). The mean of bone mass is 80 mg and standard deviation is 5 mg. For a woman whose bone mass is 70 mg, her z-score is (70 - 80)/5 = -2; for a woman whose bone mass is 80 mg, her z-score would be (80- 80)/5=0; on the other hand, a woman with bone mass of 95 mg is equivalent to a z-score of (95-80)/5 = 3. This transformation from mg to z-score is presented below the original unit of the figure. The interesting feature of the z-score is that it always has mean 0 and standard deviation of 1. Thus, by simply knowing whether a z-score is positive or negative, we know whether the bone mass is above the mean or below the mean. The larger the absolute value of the z-score, the further that bone mass is from the mean. It should be noted that this is also the method in which most educational authorities use to standardised students' marks across schools.

Number of women

200 + ...................................................***

.......| .....................................................***

.......| .............................................*** *** ***

.......| .............................................*** *** ***

.......| .............................................*** *** ***

100 + ....................................*** *** *** *** ***

.......| ......................................*** *** *** *** ***

.......| ......................................*** *** *** *** ***

.......| ..............................*** *** *** *** *** *** ***

.......| .......................*** *** *** *** *** *** *** *** ***

. ....-------------------------------------------------------------------------------------------------

mg ...........................60 ....65 ...70 ..75 ....80 ...85 ...90... .95 ..100

z-score................... -4..... -3 ....-2 ...-1 .....0 ....+1 ...+2 ....+3 ...+4

FIGURE 1: Graph of a distribution of bone mass with a mean of 80 mg and a standard deviation of 5 mg.

Now, we can simplify the Chebyshev's statement even further by using the z-score:

at least 68% of the z-scores are expected to fall within +/- 1S at least 95% of the z-scores are expected to fall within +/- 2S at least 99.7% of the z-scores are expected to fall within +/- 3S

Next time, I will discuss how to use this z-score to work out whether Deng Xiao Ping had chosen Hu Yao Bang (as his successor) by chance.

See you next time.


Tuan Nguyen, Ph.D.
t.nguyen@garvan.unsw.edu.au

For discussion on this column, join vacets-tech@vacets.org


Copyright © 1996 by VACETS and Tuan Nguyen

:

Other Articles

Satistical Notes #6:The Standard Deviation and The Normal Distribution: The Long and Short Of It.

Tide

How high can you suck

Asynchrounous Transfer Mode (ATM) - an analogy

The UNIX Runtime environment

National Information Infastructures in Pacific Asia: A Vision ...

Largest known Prime discovered

Internet over Cable

Statistics as A Formal System

Cubic Planet Earth

A New Era In Telecommunications In The U.S.

The Abuse Of Statistics

Where is the 10th Planet

The Effect of Internet Traffic on the Public Voice Network (Part1)

Statistics Notes #3: On The Average

The Effect of Internet Traffic on the Public Voice Network (Part2)



Other Links

VACETS Home Page

VACETS Electronic Newsletter

VACETS FTP Site