Sample size for categorical data

I have a population of phone calls - 200,000. There are different reasons for each call, but lets assume the number of reasons is known. i.e. 7 different call reasons: 1) Check on order 2) Cancel order 3) Billing information 4) Account information etc. My question is, I want to get a statistically significant sample size so the number of calls I listen to in my sample are representative of the distribution of the reasons in the overall population of 200,000 calls. What should the sample size be? Which methods to use to calculate the sample size?

asked Mar 17, 2015 at 18:23 41 1 1 gold badge 1 1 silver badge 2 2 bronze badges $\begingroup$ What does "representative of . " mean in this context? $\endgroup$ Commented Mar 17, 2015 at 18:29

$\begingroup$ To clarify: Do you know the reason(s) for each call in advance, or is your purpose to listen and ascertain the the reasons? Are the calls ordered in anyway (e.g. in time)? Do you have any other advance information about them, e.g. length of call, that might be important? $\endgroup$

Commented Mar 17, 2015 at 22:20

$\begingroup$ @gung - representative means being confident that the sample size I have taken will include all the reasons that customers could be calling about and in the right proportions. i.e. IF the original population the distribution was Check order-20% Cancel-15% Billing-10%, etc. Then my sample will be ave the same distribution. $\endgroup$

Commented Mar 19, 2015 at 11:29

$\begingroup$ @SteveSamuels - I don't know the reasons before hand for each call, the purpose is to ascertain the reasons. The calls do have a timestamp, and there are other information associated with the length of the call, but lets assume these factors don't make a difference. i.e. calls for cancels and billing are randomly distributed through time and that the length of each call is not linked to the length of the call. $\endgroup$

Commented Mar 19, 2015 at 11:34

$\begingroup$ If you need to ensure that you have, say, 20% check orders, then you need to sample, say, 20 calls at random from w/i those that were check orders, & 15 calls from w/i the cancels, etc. If you can't do that, then you need to use all calls to ensure you get those numbers. $\endgroup$

Commented Mar 19, 2015 at 15:12

1 Answer 1

$\begingroup$

In the absence of auxiliary information, I recommend simple random sampling. It is safe to ignore the fact that you are sampling from a finite population, since the population size $N = 200,000$ is huge, compared to the sample size $n$. Essentially we assume that draws are independent.

Each call can be classified into k = 7 categories. With the simple random sampling design, the number of appearances $a_1, a_2, \ldots a_k$ follow a multinomial distribution, with true proportions $\pi_j$, $j = 1\ldots k$. These will be estimated by the sample proportions $p_j$.

Thompson, 1987, studied sample size determination for estimating a simultaneous $1-\alpha$ confidence set for the $P_j$ with multinomial data. The individual intervals each have the same half-length $2 d$. That is, each interval is of the form: $$ \left[\,p_j -d,\, p_j +d\,\right] $$

The analyst chooses $d$ and a protection level $\alpha$, which will be the maximum probability that one or more of the $k$ intervals does not contain its corresponding population proportion; i.e.:

Equivalently, the probability that all the intervals cover the true $P_j$ is greater than $1-\alpha$.

Thompson proved that the worst case configuration of the $P_j$ in this setup, the one requiring maximum $n$, is one in which the true proportions are equal: $P_j = 1/k$. He provided a simple table for choosing $n$, or, if resources limit $n$, for balancing the choices of $\alpha$ and $d$. These results apply as long as $k$ exceeds a certain minimum value that varied with $\alpha$.

 alpha (d^2 x n) min k n if d = 0.05 ---------------------------------------- .50 .44129 4 177 .40 .50729 4 203 .30 .60123 3 241 .20 .74739 3 299 .10 1.00635 3 403 .05 1.27359 3 510 .025 1.55963 2 624 .02 1.65872 2 664 .01 1.96986 2 788 .005 2.28514 2 915 .001 3.02892 2 1212 .0005 3.33530 2 1342 .0001 4.11209 2 1645 

You can see that $k=7$ is covered by the table, since 7>4. For each $\alpha$ under consideration, choose $d$, then solve for $n$ by dividing the value in column 2 by $d^2$. Alternatively,if your resources or time limit you to a maximum sample size $n_>$, use column 2 solve for $d$. If you were to choose $\alpha =0.20$ and $d = 0.05$, for example, the corresponding value of $n = 0.74739/0.05^2 = 298.954$, rounded up to 299 in column 4.

Other methods for computing simultaneous confidence intervals

Sison and Glaz (1995) proposed two methods for finding simultaneous confidence intervals and recommended their Method 1. May and Williams (2000) published a SAS macro to calculate a version of Sison and Glaz's Method 1. Hou et al. (2003) presented still another method.

References

Hou, Chia-Ding, Jengtung Chiang, and John Jen Tai. 2003. A family of simultaneous confidence intervals for multinomial proportions. Computational Statistics & Data Analysis 43, no. 1: 29-45.

May, Warren L, and William D Johnson. 2000. Constructing two-sided simultaneous confidence intervals for multinomial proportions for small counts in a large number of cells. Journal of Statistical Software 5, no. 6: 1-24. preprint: http://www.jstatsoft.org/v05/i06/paper

Sison, Cristina P, and Joseph Glaz. 1995. Simultaneous confidence intervals and sample size determination for multinomial proportions. Journal of the American Statistical Association 90, no. 429: 366-369. http://140.112.142.232/~purplewoo/Literature/!Methodology/!Distribution_SampleSize/SimultConfidIntervJASA.pdf

Thompson, Steven K. 1987. Sample size for estimating multinomial proportions. The American Statistician 41, no. 1: 42-46.