What is the standard error of the difference in two proportions?

The standard error for the difference in two proportions can take different values and this depends on whether we are finding confidence interval (for the difference in proportions) or whether we are using hypothesis testing (for testing the significance of a difference in the two proportions). The following are three cases for the standard error.

Case 1: The standard error used for the confidence interval of the difference in two proportions is given by:

S.E.=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}

where n_1 is the size of Sample 1, n_2 is the size of Sample 2, p_1 is the sample proportion of Sample 1 and p_2 is the sample proportion of Sample 2.

Case 2: The standard error used for hypothesis testing of difference in proportions with H_0:\ \pi_1-\pi_2=0 is given by:

S.E.=\sqrt{P(1-P)(\frac{1}{n_1}+\frac{1}{n_2})}

where P is the pooled sample proportion given by P=\frac{r_1+r_2}{n_1+n_2} where r_1 is the number of successes in Sample 1, r_2 is the number of successes in Sample 2, n_1 is the size of Sample 1 and n_2 is the size of Sample 2.

Case 3: The standard error used for hypothesis testing of difference in proportions with H_0:\ \pi_1-\pi_2=c is given by:

S.E.=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p'_2(1-p'_2)}{n_2}}

where n_1 is the size of Sample 1, n_2 is the size of Sample 2, p_1 is the sample proportion of Sample 1, p_2 is the sample proportion of Sample 2 and p'_2=\frac{n_1(c-p_1)+n_2p_2}{n_1+n_2}.

Derivations

In the following we give a step-by-step derivation for the standard error for each case.

Suppose that we have two samples: Sample 1 of size n_1 and Sample 2 of size n_2. Let Sample 1 consist of n_1 elements X_{11},X_{12},\cdots,X_{1n_1}. Each element X_{1i} (for i=1,2,\cdots,n_1) could take the value 1 representing a success or the value 0 representing a fail.  Let \pi_1 be the (true and unknown) population proportion for the elements found in Sample 1. That is, an element X_{1i} (for i=1,2,\cdots,n_1) of Sample 1 has a probability \pi_i of showing a value of 1 (i.e. of being a success). Similarly, Sample 2 is defined by the elements X_{21}, X_{22}, \cdots, X_{2n_2}, and \pi_2 is the (true and unknown) population proportion for the elements found in Sample 2. Let us also define r_1 to be the number of successes in Sample 1, i.e., r_1=X_{11}+X_{12}+\cdots+X_{1n_1} and let r_2 be the number of successes in Sample 1, i.e., r_2=X_{21}+X_{22}+\cdots+X_{2n_2}.

We are after:

Case 1: A confidence interval for the difference in the (population) proportions, i.e., \pi_1-\pi_2,

Case 2: Testing the hypotheses whether or not the two (population) proportions are equal \pi_1-\pi_2=0, or,

Case 3: Testing the hypotheses whether or not the two (population) proportions differ by some particular number \pi_1-\pi_2=c.

However we do not known the true values of the population parameters \pi_1 and \pi_2, and hence we rely on estimates. Let p_1 be the sample proportion of successes for Sample 1. Thus:

p_1=\frac{X_{11}+X_{12}+\cdots+X_{1n_1}}{n_1}=\frac{r_1}{n_1}

Let p_1 be the sample proportion of successes for Sample 1. Thus:

p_2=\frac{X_{21}+X_{22}+\cdots+X_{2n_2}}{n_2}=\frac{r_2}{n_2}

We are going to assume that the sampled elements are independent (that is, the fact that a sample element is 1 (or 0) has no effect on whether another element is 1 or 0). Note that each element in Sample 1 follows the Bernoulli distribution with parameter \pi_1 and each element in Sample 2 follows the Bernoulli distribution with parameter \pi_2. Let us find the probability distributions of p_1 and p_2. Let us first start with that for p_1 and the one for p_1 will follow in a similar fashion.

Since each X_{1i} is Bernoulli distributed with parameter \pi_1, and assuming independence, then r_1=\frac{X_{11}+X_{12}+\cdots+X_{1n_1}}{n_1} follows the binomial distribution with mean n_1\pi_1 and variance n_1p_1(1-p_1). Moreover, since the X_{1i}‘s are i.i.d. (independently and identically distributed), then by the Central Limit Theorem, for sufficiently large n_1, r_1 is normally distributed. Hence:

r_1\sim \mathcall{N}(n_1\pi_1,n_1\pi_1(1-\pi_1))

Thus:

p_1=\frac{r_1}{n_1}\sim \mathcall{N}(\pi_1,\frac{\pi_1(1-\pi_1)}{n_1})

Similarly, we can derive the probability distribution for p_2, which is given by:

p_2\sim \mathcall{N}(\pi_2,\frac{\pi_2(1-\pi_2)}{n_2})

From the theory of probability, a well-known results states that the sum (or difference) of two normally-distributed random variable is normally-distributed. Thus the distribution of the difference in sample proportions p_1-p_2 is normally distributed. The mean of p_1-p_2 is given by \mathbb{E}(p_1-p_2)=\mathbb{E}(p_1)-\mathbb{E}(p_2)=\pi_1-\pi_2. Moreover p_1 and p_2 are independent. This follows from the fact that the sample elements are independent. Thus we have Var(p_1-p_1)=Var(p_1)+Var(p_2)=\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2}. The probability distribution of the difference in sample proportions is given by:

p_1-p_2\sim \mathcall{N}(\pi_1-\pi_2,\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2})

Case 1: We would like to find the confidence interval for the true difference in the two population proportions, that is, \pi_1-\pi_2.

Since: p_1-p_2\sim \mathcall{N}(\pi_1-\pi_2,\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2}) then:

Z=\frac{p_1-p_2-(\pi_1-\pi_2)}{\sqrt{\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2}}}\sim \mathcal{N}(0,1).

The variance of p_1-p_2 is unknown as must be estimated in order to derive the confidence interval. We use p_1 as an estimate for \pi_1 and p_2 as an estimate for \pi_2. Thus we replace \pi_1 with p_1 and \pi_2 with p_2 in the standard deviation and obtain the following estimated standard error:

S.E.=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}

The (1-\alpha)% confidence level for the difference in population proportions is given by:

p_1-p_2-z_\frac{\alpha}{2} S.E.\leq \pi_1-\pi_2 \leq p_1-p_2+z_\frac{\alpha}{2} S.E.

where z_\frac{\alpha}{2} is the stardardised score with a cumulative probability of 1-\frac{\alpha}{2}.

Case 2: Here we would like to test whether there is a significant difference between the population proportion.

This is hypothesis testing using the following null and alternative hypotheses:

H_0: \pi_1-\pi_2=0

H_1: \pi_1-\pi_2\neq 0

We know that:

p_1-p_2\sim \mathcall{N}(\pi_1-\pi_2,\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2})

and thus:

Z=\frac{p_1-p_2-(\pi_1-\pi_2)}{\sqrt{\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2}}}

Let us consider the Z-statistics in the case of the null hypothesis, i.e., let us see what happens to the value of Z when we assume that \pi_1=\pi_2. First of all we replace the \pi_1-\pi_2 in the numerator by 0. We need to replace the \pi_1 and the \pi_2 in the denominator by estimates. We are assuming that \pi_1 and \pi_2 are equal and so we just have to estimate one value. Every element in the sample, be it Sample 1 or Sample 2, has the same probability of being a success (since \pi_1=\pi_2). Hence pi_1 (or pi_2) is estimated by P=\frac{r_1+r_2}{n_1+n_2}, i.e., the number of successes in Sample 1 plus the number of successes in Sample 2, divided by the sample size. This is called the pooled sample proportion, because, since \pi_1=\pi_2, we are combining Sample 1 with Sample 2, and thus we have just one pooled sample. So the Z-statistics becomes:

    \begin{equation*} \begin{split} Z&=\frac{p_1-p_2-0}{\sqrt{\frac{P(1-P)}{n_1}+\frac{P(1-P)}{n_2}}}\\ &=\frac{p_1-p_2}{\sqrt{P(1-P)(\frac{1}{n_1}+\frac{1}{n_2})}} \end{split} \end{equation*}

where P=\frac{r_1+r_2}{n_1+n_2}.

Hence the (estimated) standard error used for hypothesis testing of a significant difference in proportions is:

S.E.=\sqrt{P(1-P)(\frac{1}{n_1}+\frac{1}{n_2})}

Case 3: Here we would like to test whether the difference between the population proportion deviates by a certain value.

This is hypothesis testing using the following null and alternative hypotheses:

H_0: \pi_1-\pi_2=c

H_1: \pi_1-\pi_2\neq c

for some pre-defined real number c.

We know that:

p_1-p_2\sim \mathcall{N}(\pi_1-\pi_2,\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2})

and thus:

Z=\frac{p_1-p_2-(\pi_1-\pi_2)}{\sqrt{\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2}}}

Let us consider the Z-statistics in the case of the null hypothesis, i.e., let us see what happens to the value of Z when we assume that \pi_1-\pi_2=c. First of all we replace the \pi_1-\pi_2 in the numerator by c. We need to replace the \pi_1 and the \pi_2 in the denominator by estimates. We will replace \pi_1 by p_1=\frac{r_1}{n_1}. In Case 1, for the confidencce interval, we estimated \pi_2 by the sample proportion p_2=\frac{r_2}{n_2}. However here we are going to use the information that \pi_2=c-\pi_1. Thus we are going to estimate \pi_2 by a weighted average of c-p_1 and p_2 as follows:

    \begin{equation*} \begin{split} p'_2&=\frac{n_1}{n_1+n_2}(c-p_1)+\frac{n_2}{n_1+n_2}p_2\\ &=\frac{n_1(c-p_1)+n_2p_2}{n_1+n_2} \end{split} \end{equation*}

The Z-statistic then becomes:

Z=\frac{p_1-p_2-c}{\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p'_2(1-p'_2)}{n_2}}}

Hence the (estimated) standard error used for hypothesis testing of a difference in proportions by a certain value is:

S.E.=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p'_2(1-p'_2)}{n_2}}

where p'_2=\frac{n_1(c-p_1)+n_2p_2}{n_1+n_2}.