Math 340 - Statistics

Lesson 8 -
The Confidence Intervals and Sample Size

Definitions -

Parameter: A characteristic or measure obtained by using ALL the data values in a population (the mean, for example)

Point estimate: A specific numerical estimate of a parameter.

Interval estimate: An interval or range of values used to estimate a parameter.

Confidence level: The confidence level of an interval estimate is the probability that the interval estimate will contain the parameter.

Confidence interval: A specific interval estimate of a parameter determined by using data obtained from a sample and the specific confidence level of the estimate.

Remember that the Central Limit Theorem stated that as the sample size increases, the shape of the distribution of sample means will approach a normal distribution. This means it will have essentially the same properties as the Normal Distribution, i.e., 68% of the sample means will fall within 1 standard deviation (or standard error) of the population mean.

Looking at Table 3, you see that 95% of the sample means will fall within 1.96 standard errors of the population mean (1.96 equates to an area of 0.475 - this area on each side of the mean adds to 0.95, or 95%, of the total). Mathematically, this is the same as saying the sample mean (X) would be equivalent to:

_{_}
^{X+1.96{s/SQRT(n)}}

Notes:

_{_}
^X
^{is the sample mean}

⁺
^{- is "plus or minus"}

^s ^{- is the standard deviation}

^SQRT(n) ^{- is the square root of
"n"}

Remember from Lesson 7 that the z-value for sample means was given by:

_
z = (X - m)/{s/SQRT(n) }

Using some algebra, you can derive the formula for a confidence interval by following the steps shown. While the math is somewhat complicated, it is important to know what the equation tells us:

“The mean of a population (m) will be contained within an area of :
^+z_a/2^{s/SQRT(n)}
of the sample mean X, where ^z_a/2 is the z-value (from Table E) that corresponds to that area!

The term .........

E = ^+z_a/2^{s/SQRT(n)}

is called the maximum error of the estimate, and is defined as follows:

Maximum error of the estimate (E): The maximum difference between the point estimate of a parameter and the actual value of the parameter

Let’s look at a run-down of how this works...

Say you wish to know the average age of the students at Peru State, and you want to be 95% confident that the answer you give is correct. You take a sample of 40 students, and you note the mean of the ages of these students is 28.5 years. From previous studies, you learn that the standard deviation of age from the mean is 2 years.

1. Since you want the 95% confidence interval of ages that contain the true mean of the entire population (based on your 100 student sample), the first thing you need to do is to calculate Z_a/2 :

a. a = 1 - confidence interval desired (95%, or .95, in this case)
= 1 - 0.95 = 0.05

b. a/2 = 0.05/2 = 0.025

c. Subtract a/2 from 0.5 (1/2 the area under the Normal Distribution curve)
0.5 - 0.025 = 0.475

d. Go to Table E and find the z-value that corresponds to 0.475 (1.96)
This is how you get Z_a/2 !

2. Now that we know this, we can solve the problem. Since s = 2 years, and n = 40, we simply sub these into the formula to get:

Z_a/2{s/SQRT(n)} = (1.96)(2/?40) = 0.62

3. We know X, our sample mean, was found to be 28.5 years, then putting this into our formula yields:

(28.5 - 0.6) < m < (28.5 + 0.6) -or- 27.9 < m < 29.1

or more simply, 28.5 + 0.6 years

We can now say, with 95% confidence, that the average age of the students at Peru State is between 27.9 and 29.1 years, based on our 40-student sample!

**Note: When using a sample mean and a standard deviation, as we will here in this book, always round to the same decimal point as the given mean.

On the other side of the coin, given we know what confidence level we wish to achieve, what the population standard deviation is, and our maximum error of the estimate, we can then ask: “What minimum sample size will I have to use to achieve that confidence level?

What do we know?

1. We have our confidence level...this will let us calculate Z_a/2

2. We have the population standard deviation (s)

3. We have the maximum error of the estimate (E)

3. We also know the formula for the maximum error of estimate:

E = ^+z_a/2^{s/SQRT(n)}

so we just have to solve this for n, the required minimum size of our sample

a. Multiply both sides of the equation by the square root of "n"

b. Divide both sides by E

c. Square both sides - this yields the formula for n! (See Example 8-4)

When the standard deviation is known and the variable is normally distributed, the process described above will work. It will also work if the standard deviation is not known, as long as the sample size is greater than 30. But what if the standard deviation is not known and the sample size is less than 30? In these cases, we must use a slightly different distribution, known as the t distribution.

(See the green box at the top of page 330 for the characteristics of this distribution)

Confidence Intervals for the Mean - s unknown and n < 30

The t-distribution actually describes a “family” of curves, which differ according to a specific variable, known as the “degrees of freedom”. These degrees of freedom are the number of values in a sample that are free to vary after a statistic (such as the mean) for the sample has been computed.

For example: given a sample of 5 values: 4, 6, 8, 10, 12

The mean of this sample is 8.

Now that we’ve calculated that, throw out the 5 values and start putting in new numbers:

say the first is 7          Can we still build a 5-value sample with a mean of 8?   sure!
say the second is 4      Can we still build a 5-value sample with a mean of 8?   sure!
say the third is 3         Can we still build a 5-value sample with a mean of 8?   sure!
say the fourth is 16     Can we still build a 5-value sample with a mean of 8?   sure!

But - adding these four arbitrary (aka “free”) data values up yields forty. What value must we put in there that will yield a mean of 8 for the data set? We must use the number 10, since this will yield a mean of 8 for this new sample.

So we had FOUR degrees of freedom (d.f.) for this sample of FIVE numbers. The degrees of freedom will always be found by subtracting 1 from your sample size! You must take the degrees of freedom into account when using the t-distribution.

The formula for finding a specific confidence interval when the standard distribution is unknown and your sample size is less than 30 is given in the green box at the top of page 331. The values for t_a/2 are given in Appendix A, Table F. Notice you’ll need to know the desired confidence interval (95%, e.g.) and degrees of freedom. (disregard the “One tail” and “Two tails” rows at the top of the chart...we’ll get to them in Chapter 9)

If you have trouble knowing when to use the z-values in the Normal Distribution or the t-values in the t-distribution, follow the flow chart at the top of page 333.

Confidence Intervals and Sample Size for Proportions

When we work with proportions (12% of housewives, 20% of doctors, etc.), we use a different method for finding confidence intervals. We obtain the proportions from samples or populations, and proportions have a special set of symbols to help identify them.

Those symbols can be found in the green box at the bottom of page 335.

Here’s an example of how to find the values for
_{^ ^}
^{p and q} :

In a recent survey of 500 Americans, 190 were upset at the way the media was handling the current White House scandal.

Find p and q .

^
p = X/n = 190/500 = 0.38

^
q = 1- 0.38 = 0.62

To compute a confidence interval when using proportions, we use a slightly modified form of the formula for E:

^^
E = ^+z_a/2^{{s/SQRT(pq/n)}}

where we have an additional set of criteria similar to what we saw earlier:
^ ^
np > 5 and nq > 5

**The same method is used when computing Z_a/2as was discussed earlier. The only difference here is in the rounding: round off to three decimal places when computing the confidence interval for a proportion.

As before, computing the necessary sample size for a set confidence interval is simply a matter of rearranging the formula above to solve for n, and is given at the top of page 339 in the green box.

**Note:
^
If no approximation of p is known or given, use p = 0.5. This value will give a sample size large enough to guarantee an accurate prediction.

Confidence Intervals for Variances and Standard Deviations

Since variances and standard deviations are used all the time in industry, the medical professions, farming, etc., it is important that we know how to compute confidence intervals and sample sizes for these as well. But to do that we need yet another type of statistical distribution: the "Chi-square" distribution. Note: ^cis the Greek letter "Chi".

It’s similar to the t-distribution in that it is a family of curves based on degrees of freedom. Table G in Appendix A gives values for the Chi-square distribution. You’ll notice that it looks a little different from the ones we’ve seen. There are five columns to the left and five to the right, and are used independently to come up with the values we want.

Here’s how it works:

To find the values we need for a 95% confidence interval,

1. Get a by subtracting 1- confidence interval (here 1 - 0.95 = 0.05)

2. Compute a/2 : 0.05 / 2 = 0.025 This is the column on the right side of the table, and will be used to determine ^c₂

3. Subtract a/2 from 1 to get 0.975 This is the column on the left side of the table, and will be used to calculate ^c₁

4. Find the appropriate number of degrees of freedom (remember d.f. = n-1)

5. Then simply plug these values into the formula listed in the green box in the middle of page 344. Examples 8-13 and 8-14 will help guide you through the process, if you get stuck. Rounding is done to the same number of decimal points as those given in the variance or standard deviation.

HOMEWORK:

Read the rest of chapter 6 in the text.