What is the standard error of the difference in two proportions?
The standard error for the difference in two proportions can take different values and this depends on whether we are finding confidence interval (for the difference in proportions) or whether we are using hypothesis testing (for testing the significance of a difference in the two proportions). The following are three cases for the standard error.
Case 1: The standard error used for the confidence interval of the difference in two proportions is given by:
where is the size of Sample 1, is the size of Sample 2, is the sample proportion of Sample 1 and is the sample proportion of Sample 2.
Case 2: The standard error used for hypothesis testing of difference in proportions with is given by:
where is the pooled sample proportion given by where is the number of successes in Sample 1, is the number of successes in Sample 2, is the size of Sample 1 and is the size of Sample 2.
Case 3: The standard error used for hypothesis testing of difference in proportions with is given by:
where is the size of Sample 1, is the size of Sample 2, is the sample proportion of Sample 1, is the sample proportion of Sample 2 and .
Derivations
In the following we give a step-by-step derivation for the standard error for each case.
Suppose that we have two samples: Sample 1 of size and Sample 2 of size . Let Sample 1 consist of elements . Each element (for ) could take the value 1 representing a success or the value 0 representing a fail. Let be the (true and unknown) population proportion for the elements found in Sample 1. That is, an element (for ) of Sample 1 has a probability of showing a value of 1 (i.e. of being a success). Similarly, Sample 2 is defined by the elements , and is the (true and unknown) population proportion for the elements found in Sample 2. Let us also define to be the number of successes in Sample 1, i.e., and let be the number of successes in Sample 1, i.e., .
We are after:
Case 1: A confidence interval for the difference in the (population) proportions, i.e., ,
Case 2: Testing the hypotheses whether or not the two (population) proportions are equal , or,
Case 3: Testing the hypotheses whether or not the two (population) proportions differ by some particular number .
However we do not known the true values of the population parameters and , and hence we rely on estimates. Let be the sample proportion of successes for Sample 1. Thus:
Let be the sample proportion of successes for Sample 1. Thus:
We are going to assume that the sampled elements are independent (that is, the fact that a sample element is 1 (or 0) has no effect on whether another element is 1 or 0). Note that each element in Sample 1 follows the Bernoulli distribution with parameter and each element in Sample 2 follows the Bernoulli distribution with parameter . Let us find the probability distributions of and . Let us first start with that for and the one for will follow in a similar fashion.
Since each is Bernoulli distributed with parameter , and assuming independence, then follows the binomial distribution with mean and variance . Moreover, since the ‘s are i.i.d. (independently and identically distributed), then by the Central Limit Theorem, for sufficiently large , is normally distributed. Hence:
Thus:
Similarly, we can derive the probability distribution for , which is given by:
From the theory of probability, a well-known results states that the sum (or difference) of two normally-distributed random variable is normally-distributed. Thus the distribution of the difference in sample proportions is normally distributed. The mean of is given by . Moreover and are independent. This follows from the fact that the sample elements are independent. Thus we have . The probability distribution of the difference in sample proportions is given by:
Case 1: We would like to find the confidence interval for the true difference in the two population proportions, that is, .
Since: then:
.
The variance of is unknown as must be estimated in order to derive the confidence interval. We use as an estimate for and as an estimate for . Thus we replace with and with in the standard deviation and obtain the following estimated standard error:
The % confidence level for the difference in population proportions is given by:
where is the stardardised score with a cumulative probability of .
Case 2: Here we would like to test whether there is a significant difference between the population proportion.
This is hypothesis testing using the following null and alternative hypotheses:
:
:
We know that:
and thus:
Let us consider the -statistics in the case of the null hypothesis, i.e., let us see what happens to the value of when we assume that . First of all we replace the in the numerator by 0. We need to replace the and the in the denominator by estimates. We are assuming that and are equal and so we just have to estimate one value. Every element in the sample, be it Sample 1 or Sample 2, has the same probability of being a success (since ). Hence (or pi_2) is estimated by , i.e., the number of successes in Sample 1 plus the number of successes in Sample 2, divided by the sample size. This is called the pooled sample proportion, because, since , we are combining Sample 1 with Sample 2, and thus we have just one pooled sample. So the -statistics becomes:
where .
Hence the (estimated) standard error used for hypothesis testing of a significant difference in proportions is:
Case 3: Here we would like to test whether the difference between the population proportion deviates by a certain value.
This is hypothesis testing using the following null and alternative hypotheses:
:
:
for some pre-defined real number .
We know that:
and thus:
Let us consider the -statistics in the case of the null hypothesis, i.e., let us see what happens to the value of when we assume that . First of all we replace the in the numerator by . We need to replace the and the in the denominator by estimates. We will replace by . In Case 1, for the confidencce interval, we estimated by the sample proportion . However here we are going to use the information that . Thus we are going to estimate by a weighted average of and as follows:
The -statistic then becomes:
Hence the (estimated) standard error used for hypothesis testing of a difference in proportions by a certain value is:
where .