Hypothesis: the research page. Part 3: Power, sample size, and clinical significance.

(1)

VOL 47: JULY • JUILLET 2001❖Canadian Family Physician•Le Médecin de famille canadien 1441

Resources

^❖

Ressources

T

his third ar ticle in a series on basic statistics deals with power and sample size, α and β errors, and clinical and statistical significance.^1-3

In plain English

Confidence inter vals (CIs) and P values help us determine the likelihood that a difference in a study is due to chance. Power deals with the opposite issue; if we do not see a statistical difference, how can we be sure there really is no difference?

The chance factor (P value), set arbitrarily at 5% or .05 and accepted as the standard, is also called α. If we set αtoo high, say at 10% or .1, we run the risk of making an “αerror” where we say a difference exists when in reality the difference was due to chance. If we set α too low, we run the risk of missing a difference that does exist. The possibility of concluding that a difference does not exist when it does is called a “βerror.”

By convention, a βof .2 or 20% is thought to be the minimum needed. It seems we are more willing to risk making a βerror (incorrectly concluding that a difference does not exist) than we are of making an αerror (incorrectly concluding that a difference exists).

The power of a study is the degree to which we are certain that, if we conclude a difference does not exist, it in fact does not exist. This is determined by 1 minus β(power = 1-β). Since βis generally set at .2, the accepted level of power is 1–.2 = .8 or 80%. The way to ensure a power of 80% is to do a sample size calcula- tion, which answers the question, “If I want to be 95%

certain that any difference I see is not due to chance and 80% certain that if I conclude there is no difference I am correct, how many people do I need in this study?”

If an article concludes no difference was found, the authors should tell you the level of cer tainty (power) with which they can make that conclusion.

Of course, if a statistical difference is seen (P < .05 or the 95% CI does not include 1), then by definition there was sufficient power. If the power is really high (ie, if there is a huge sample size compared with the number actually needed so that the power is, say 99%), statistical differences can be seen even when

ver y small real clinical dif ferences exist. For instance, an RR of 1.2 with a 95% CI of 1.1 to 1.3 might be highly statistically significant; the CI is very narrow due to the large sample size. This means the difference is likely to be real and not due to chance, but is the difference clinically significant? Sometimes it is, depending on the seriousness of the issue. If the RR indicates a child is 1.2 times more likely to die within the next 3 months if exposed to X, it is highly clinically significant. If it indicates that people are 1.2 times more likely to get a runny nose if they go out in cold weather without a hat, it is less important.

In statistical terms

Suppose the prevalence of angelitis in your city is 10%.

A new drug, angel dust, has been discovered that seems to be effective for treating angelitis. You want to do a study to determine how effective it actually is.

The first thing you need to consider is what a clinically significant decrease in angelitis would be. How much would the drug have to decrease the prevalence of angelitis to make it useful? You decide that, if angel dust can reduce the prevalence of angelitis from 10% to 6%, it would be worthwhile. You now have to determine how many people you need in your study to make your results statistically significant should they show a decrease to 6%.

You know that true prevalence in the untreated pop- ulation is 10% (Figure 1A). If your city had a population of 100 000 and if you randomly sampled 500 of these people, you might not get exactly 10% with angelitis.

You might get 9% or 8% or perhaps 12%. If you kept taking different 500-person samples, you would get a range of prevalences that followed a normal cur ve (Figure 1B), and 95% of all results would fall within two standard deviations (SD) of the middle of that curve; 5% would be outside (2.5% above and 2.5% below) those two SDs. The larger the samples (eg, 1000-person samples), the narrower and taller this normal curve would look (Figure 1C). If you took smaller samples, the curve would be spread out and flatter (Figure 1D).

How does this af fect your study? Figure 2A shows the 10% prevalence rate with a normal curve

Part 3: Power, sample size, and clinical significance

Marshall Godwin, MD, CCFP, FCFP

Hypothesis: The Research Page

(2)

1442 Canadian Family Physician•Le Médecin de famille canadien❖VOL 47: JULY • JUILLET 2001

Resources

^❖

Ressources

from taking many 200-person samples. The 6% mark comes well within the normal curve of the population with a mean prevalence of 10%. This means that, if you see prevalence reduced to 6% after treatment with angel dust, it could be due to chance because a sample taken from the population not taking angel dust could also give you the same result. To be 95%

certain that angel dust is having an effect, we need to change the shape of that curve so that the 6% mark comes below 2 SDs of the normal curve. To do that, we increase the sample size to 750 and get the result shown in Figure 2B. Now, if the people taking angel dust have a 6% prevalence rate, we can say with 95%

certainty that it is truly a difference and not likely to be a chance occurrence. The possibility that we are wrong is less than 5% (P < .05).

There is another problem, however. If, as we hoped, the true prevalence of angelitis in people taking angel dust is 6%, then the results we will get as we do more studies will not always be exactly 6%. They will follow a normal curve around 6% (Figure 3A).

At a sample size of 750, we thought we were safe, and we are if the result we get is 6% or ver y close to it because that will show a statistical difference. But what if we get a result of 7.5%? It is inside 2 SD of the population curve where 10% is the mean (the curve on the right), so we say it is not significant. But it is also well within the normal distribution of the curve where 6% is the mean (the curve on the left). It could be coming from either population, so we could be making a mistake, a βerror (where we say a difference does not exist when it does).

To avoid this, we must make the normal curves even narrower to decrease the overlap. We want the overlap to be 20% or less (remember the βof 20% and the power of 80% discussed above). Figure 3B shows the effect of increasing the sample size in both popu- lations to 1200. A result of 7.5 is still not statistically significant, but the likelihood that we are making a mistake is less; there is less likelihood that the 7.5 result belongs to the population curve with 6% as its mean. The formulas for calculating sample size can tell exactly when the degree of overlap is 20% or less.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Prevalence %

A

B

N=200

N=750 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Prevalence

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Prevalence %

Prevalence %

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Prevalence %

A

B

C

D

N=500

N=1000

N=200

Figure 1. Estimating prevalence: A) True prevalence is 10%. B) Results of repeated samples follow a normal bell curve. C) Larger samples increase height and decrease width of curve. D) Smaller samples decrease height and increase width of curve.

Figure 2. What happens when sample size is increased? A) Normal curve from many 200- person samples. B) Taller, narrower curve from 750-person sample.

(3)

Resources

^❖

Ressources

Sometimes it is not possible, for logistical reasons, to increase the sample size. The solution, apparent from the figures, is to accept being able to show statistical significance for a larger difference. If you decide to look for a decrease to 4% prevalence, you would need a smaller sample size because the means of the curves are further apart. You would have to accept the fact that, if you found that angel dust decreased the prevalence to 6%, you might not be able to say it was statistically significant because of the lower power.

Dr Godwin is an Associate Professor and Director of Research in the Department of Family Medicine at Queen’s University in Kingston, Ont.

References

1. Norman GR, Streiner N. PDQ statistics. Philadelphia, Pa: B.C. Decker Inc; 1986.

2. Abramson JH. Making sense of data: a self instruction manual on the interpreta- tion of epidemiologic data. 2nd ed. New York, NY: Oxford University Press; 1994.

3. Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. Evidence- based medicine. How to practice and teach EBM. 2nd ed. Toronto, Ont: Churchill Livingstone; 2000.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Prevalence %

A

B

N=750 N=750

N=1200 N=1200

Figure 3. Reducing β error: A) Results that fall within overlap of bell curves could be a mis- take. B) Increasing sample size reduces likelihood of making a mistake.