Statistics sampling
On methods of sampling, and pros/cons.
Terminology
sampling-frame: Part of population you'd like to draw a sample from undercoverage: Omitting population members from a sampling-frame when they should have been included.
Sampling Methods
Simple Random Sampling (SRS)
Choosing N items at random (ex. choosing from a hat)
Issues:
- cannot sample if you don't know who is present
- list must be representative (ex. sample of cafeteria clients on a saturday will deviate from norms on a weekday)
Stratified Sampling
In stratified sampling, you partition your data into subcategories.
(Ex. First year students, second year students, third year students.)
This helps account for deviances/minorities within your data sample.Issues:
- possible to oversample one group (ex. minority within data gets equal weight)
Systematic Sampling
Pick every Nth individual for your sample.
Cluster Sampling
Define your population, then sample clusters of that population where:
- every measured characteristic is represented within each cluster
- each cluster has similar distribution of characteristics
- where all clusters taken together covers the entire population
Choose random clusters to use as your sample.
# population clusters randomly-selected # clusters a b c d e abfg cdhi f g h i j -> -> abfg ijno k l m n o klmn ijnoWhen used:
- When geography is important to sampling
Issues:
- People in a cluster may be similar in a way that makes problem hard to study
(ex. other contaminants in pollution study)Convenience Sampling
Using results or data conveniently obtained.
Ex. asking friends what they think VS customers.Issues:
- unreliable, uneven distribution
Multi-Stage Sampling
Combining Multiple Sampling methods for final output, in a pipeline.
Ex. You may start with a clustered sample, then filter the results using simple-random, then finally use stratified-sampling on those results.
Typically used if you have a giant sample, ex. often used by governments.