Saturday, August 16, 2008

What percent of a population do I need to select in my random selection in order to have adequate representation of the population?

I was recently asked about selecting percentages of the population when selecting sample size. Below I provide the response to the question: What percent of a population do I need to select in my random selection in order to have adequate representation of the population?

On the percentage issue - Juswik et al. (Writing Into the 21st Century An Overview of Research on Writing, 1999 to 2004, Written Communcation, 2006) in their article discuss percentages for purposes of validating coding - a 10% sample (assumed then to represent the larger population).

"Inter-rater reliability on the exclusions was high at 97.5% based on a sample of 10% of the studies." Juzwik et al p. 460

"A sample of 10% of the studies was taken for an exact inter-rater reliability on the coding of the studies that were included in our database. This reliability check determined that the initial coder and the reviewer agreed on 97.5% of the articles that were included in the study, on 96.0% of the age codes, and on 91.0% of the problem codes." Juzwik et al p. 463

In "The ‘Doing boy/girl’ and global/local elements in 10–12 year olds’ drawings and written texts," Qualitative Research, 2007, by Pat O'Connor, University of Limerick, he used about 10% of the population of texts to study but did not say it in terms of percentages:

"In this article, the focus is on a randomly selected sub-set (n = 341) from the total sample of 3464 texts written by those aged 10–12 years." (p. 234)

Another useful article on the topic is:
Collins et al.
A Mixed Methods Investigation of Mixed Methods Sampling Designs in Social and Health Science Research
Journal of Mixed Methods, 2007

On page 273 is Table 2 which lists typical sample sizes and some rationale.

My further explanation included the following with respect to my dissertation:

That percentage (I used of 20%) in part reflects that I wanted to get over 400 participants in order to have certainty of + or - 5%. I was estimating how many teachers and students I'd get from each program - I had about 250 programs. So if I started with 20% and 10 people from each program responded, that would give me 500 respondents. However, as summarized below, I had to do about 60% of the population through the use of insurance samples.

That said, when I was working with texts at a research center, it was understood that you needed at least 10% to have a chance at having a representative sample. But, I've never found this rule in writing because it's more complex than that, ultimately. It depends on how much variability there is in the entire population. The more variability, the larger your sample size should be because then that variability will have a greater chance of being captured. As always, whatever you do you have to be able to contextualize your choices and findings in the data analysis.

The issue is how big do you need the sample to be in order to get a representative sample? -- A sample that represents the same characteristics as in the entire population. In my study, the only characteristics I had available to check this was program type - PhD, Masters, Four Year, Two Year, and Certificate.

I was able to show that the final group of respondents fairly closely reflected the larger population's characteristics. Except PhD programs over-responded. You just have to talk about that then in the data interpretation.

The larger the sample, the greater the chance you will have a representative sample. So if your entire population is 50% women and 50% men, and your sample ends up being the same, you know that at least in this one respect, it's representative.

I also had two additional "insurance" samples selected in case I got skewed responses on the first try, or in case I for some reason had a low level of response. Ultimately, I had to select three phases of 20% of the entire population because of lack of response with my initial attempt at 20%. I believe I was close to 60% and was concerned that I'd end up having to select the entire population rather than the randomly selected population. Depending on your situation, I recommend thinking about having one or two insurance samples ready to be selected in addition to the initial sample.

The percentage you choose might also be matched against the number you want in the end (confidence level) - based on that table in Lauer and Asher's book -- page 58.

From Earl Babbie: "The larger the sample selected, the more accurate it is as an estimate of the population from which it was drawn." p. 193 10th edition. This is a book I'd recommend for any student working on this kind of research because Babbie explains random selection and lots of other research methods in really easy to understand terms.

"The kind of sampling procedure used also affects sample size. As we gave mentioned, for the same level of precision, stratified samples usually require fewer people than the simple random sample, and cluster samples usually require more" (Survey Research, 2nd edition, Backstrom and Hursh Cesar).

In my study, I did a stratified sample taking 20% of each of the five categories. That was 20% of the whole as well. But I ended up having to do this 3 times in order to get close to my desired 400.

Later I went on to say:

I've never seen a publication actually give a minimum percentage of the population that needs to be selected in order to have a representative sample. I would say that you need between 10 and 20% of the entire population. But level of confidence goes by the *number* in the sample rather than the *percentage* of the entire population. However, percentage of the entire population does matter because the larger the percentage obviously the more representative. I mean, it can never be a simple solution like a percentage because it all depends on the context . . .

The area that might be informative is mass media or comm. arts. I'm attaching the sampling chapter from Riffe, Lacy, and Fico's book. It might be helpful in explaining all the nuisances of your question and why no one is willing to just say you are safe if you pick at least 10%. The language I underlined on page 105 might be relevant. They actually mention 20% as being a magic number in that if you have it (or more than 20%), your confidence level goes up even though your actual number of samples is not that high. Lauer & Asher talk about this as well in their book when they discuss the "correction factor" pp.58-59-60 . Riffe et als. book is cited as well in _What writing does and how it does it_ in case you need to tie any of this into our field.

In a study I worked on previously, Stewart Whittemore and I took 10% of the entire population of texts in order to get inter-rater reliability. I remember being in the same quandary there with respect to how many texts I needed to select in order to have a reliable coding scheme. I could find nothing firm in writing. Bill Hart-Davidson just said 10% minimum, if I remember correctly-- but those texts were very very homogeneous because they'd been written based on a prompt. In part it was ultimately an issue of labor, time, and money, like a lot of these decisions. I don't think I've ever seen anything written up with less 10%.

No comments: