3  Sampling Methods

Author
Affiliation

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

3.1 Sample Size

3.1.1 Standard Error

(a) maximum sample size \(n = 200\)
(b) sample size for \(n = 5 \ldots 50\)
Figure 3.1: The SE for varying sample sizes \(n\)

Standard error is a statistical measure that quantifies the variation or uncertainty in sample statistics, particularly the mean (average). It is a valuable tool in inferential statistics and provides an estimate of how much the sample mean is expected to vary from the true population mean.

\[\begin{align} SE = \frac{sd}{\sqrt{n}} \end{align}\]

A smaller standard error indicates that the sample mean is likely very close to the population mean, while a larger standard error suggests greater variability and less precision in estimating the population mean. Standard error is crucial when constructing confidence intervals and performing hypothesis tests, as it helps in assessing the reliability of sample statistics as estimates of population parameters.

Variance vs. Standard Deviation: The standard error formula is based on the standard deviation of the sample, not the variance. The standard deviation is the square root of the variance.

Scaling of Variability: The purpose of the standard error is to measure the variability or spread of sample means. The square root of the sample size reflects how that variability decreases as the sample size increases. When the sample size is larger, the sample mean is expected to be closer to the population mean, and the standard error becomes smaller to reflect this reduced variability.

Central Limit Theorem: The inclusion of \(\sqrt{n}\) in the standard error formula is closely tied to the Central Limit Theorem, which states that the distribution of sample means approaches a normal distribution as the sample size increases. \(\sqrt{n}\) helps in this context to ensure that the standard error appropriately reflects the distribution’s properties.

3.2 Random Sampling

Figure 3.2: The idea of random sampling (Dan Kernler).
  • Definition: Selecting a sample from a population in a purely random manner, where every individual has an equal chance of being chosen.
  • Advantages:
    • Eliminates bias in selection.
    • Results are often representative of the population.
  • Disadvantages:
    • Possibility of unequal representation of subgroups.
    • Time-consuming and may not be practical for large populations.

3.3 Stratified Sampling

Figure 3.3: The idea of stratified sampling (Dan Kernler)
  • Definition: Dividing the population into subgroups or strata based on certain characteristics and then randomly sampling from each stratum.
  • Advantages:
    • Ensures representation from all relevant subgroups.
    • Increased precision in estimating population parameters.
  • Disadvantages:
    • Requires accurate classification of the population into strata.
    • Complexity in implementation and analysis.

3.4 Systematic Sampling

Figure 3.4: The idea of systematic sampling (Dan Kernler)
  • Definition: Choosing every kth individual from a list after selecting a random starting point.
  • Advantages:
    • Simplicity in execution compared to random sampling.
    • Suitable for large populations.
  • Disadvantages:
    • Susceptible to periodic patterns in the population.
    • If the periodicity aligns with the sampling interval, it can introduce bias.

3.5 Cluster Sampling

Figure 3.5: The idea of clustered sampling (Dan Kernler).
  • Definition: Dividing the population into clusters, randomly selecting some clusters, and then including all individuals from the chosen clusters in the sample.

  • Advantages:

    • Cost-effective, especially for geographically dispersed populations.
    • Reduces logistical challenges compared to other methods.
  • Disadvantages:

    • Increased variability within clusters compared to other methods.
    • Requires accurate information on cluster characteristics.

3.6 Example - The Star Wars dataset

3.6.1 Get to know the data

Table 3.1: The starwars dataset
     name               height           mass          hair_color       
 Length:87          Min.   : 66.0   Min.   :  15.00   Length:87         
 Class :character   1st Qu.:167.0   1st Qu.:  55.60   Class :character  
 Mode  :character   Median :180.0   Median :  79.00   Mode  :character  
                    Mean   :174.6   Mean   :  97.31                     
                    3rd Qu.:191.0   3rd Qu.:  84.50                     
                    Max.   :264.0   Max.   :1358.00                     
                    NA's   :6       NA's   :28                          
  skin_color         eye_color           birth_year         sex           
 Length:87          Length:87          Min.   :  8.00   Length:87         
 Class :character   Class :character   1st Qu.: 35.00   Class :character  
 Mode  :character   Mode  :character   Median : 52.00   Mode  :character  
                                       Mean   : 87.57                     
                                       3rd Qu.: 72.00                     
                                       Max.   :896.00                     
                                       NA's   :44                         
    gender           homeworld           species         
 Length:87          Length:87          Length:87         
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         

3.6.2 Simple Random Sampling

starwars_srswor <- starwars %>%                         
  sample_n(size = 5)
starwars_srswor                                          
# A tibble: 5 × 11
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Jek Tono…    180   110 brown      fair       blue              NA <NA>  <NA>  
2 Rey           NA    NA brown      light      hazel             NA fema… femin…
3 Shmi Sky…    163    NA black      fair       brown             72 fema… femin…
4 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
5 Yoda          66    17 white      green      brown            896 male  mascu…
# ℹ 2 more variables: homeworld <chr>, species <chr>

3.6.3 Simple Random Sampling with replacment

starwars_srswr <- starwars %>%                            
  sample_n(size = 5,
           replace = TRUE)
starwars_srswr                                           
# A tibble: 5 × 11
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Zam Wese…    168    55 blonde     fair, gre… yellow            NA fema… femin…
2 Ben Quad…    163    65 none       grey, gre… orange            NA male  mascu…
3 Ben Quad…    163    65 none       grey, gre… orange            NA male  mascu…
4 Mas Amed…    196    NA none       blue       blue              NA male  mascu…
5 Cordé        157    NA brown      light      brown             NA <NA>  <NA>  
# ℹ 2 more variables: homeworld <chr>, species <chr>

3.6.4 Sampling with replacment, sample larger than original data

starwars_srswr2 <- starwars %>%                           
  sample_n(size = 200,
           replace = TRUE)
starwars_srswr2                                          
# A tibble: 200 × 11
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Jocasta…    167    NA white      fair       blue              NA fema… femin…
 2 Ric Olié    183    NA brown      fair       blue              NA male  mascu…
 3 IG-88       200   140 none       metal      red               15 none  mascu…
 4 Jocasta…    167    NA white      fair       blue              NA fema… femin…
 5 IG-88       200   140 none       metal      red               15 none  mascu…
 6 Cordé       157    NA brown      light      brown             NA <NA>  <NA>  
 7 Poe Dam…     NA    NA brown      light      brown             NA male  mascu…
 8 Palpati…    170    75 grey       pale       yellow            82 male  mascu…
 9 Padmé A…    185    45 brown      light      brown             46 fema… femin…
10 Rey          NA    NA brown      light      hazel             NA fema… femin…
# ℹ 190 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>
mean(starwars$height, na.rm = TRUE)                      
[1] 174.6049
mean(starwars_srswr2$height, na.rm = TRUE)
[1] 173.172

3.6.5 Systematic Sampling

Sample always the \(5th\).

starwars_syst <- starwars %>%                            
  slice(seq(sample(1:5, 1),                              
            nrow(starwars),                              
            by = 5))                                   
starwars_syst                                           
# A tibble: 17 × 11
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 2 Biggs D…    183    84 black      light      brown           24   male  mascu…
 3 Han Solo    180    80 brown      fair       brown           29   male  mascu…
 4 Yoda         66    17 white      green      brown          896   male  mascu…
 5 Lando C…    177    79 black      dark       brown           31   male  mascu…
 6 Wicket …     88    20 brown      brown      brown            8   male  mascu…
 7 Padmé A…    185    45 brown      light      brown           46   fema… femin…
 8 Watto       137    NA black      blue, grey yellow          NA   male  mascu…
 9 Bib For…    180    NA none       pale       pink            NA   male  mascu…
10 Ben Qua…    163    65 none       grey, gre… orange          NA   male  mascu…
11 Adi Gal…    184    50 none       dark       blue            NA   fema… femin…
12 Gregar …    185    85 black      dark       brown           NA   <NA>  <NA>  
13 Barriss…    166    50 black      yellow     blue            40   fema… femin…
14 Zam Wes…    168    55 blonde     fair, gre… yellow          NA   fema… femin…
15 R4-P17       96    NA none       silver, r… red, blue       NA   none  femin…
16 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
17 Rey          NA    NA brown      light      hazel           NA   fema… femin…
# ℹ 2 more variables: homeworld <chr>, species <chr>

3.6.6 Stratified Sampling

table(starwars$sex)                                      

        female hermaphroditic           male           none 
            16              1             60              6 
starwars_strat <- starwars %>%                          
  group_by(sex) %>%
  sample_frac(size = 0.3)
starwars_strat                                           
# A tibble: 26 × 11
# Groups:   sex [4]
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Ayla Se…    178  55   none       blue       hazel             48 fema… femin…
 2 Luminar…    170  56.2 black      yellow     blue              58 fema… femin…
 3 Jocasta…    167  NA   white      fair       blue              NA fema… femin…
 4 Shmi Sk…    163  NA   black      fair       brown             72 fema… femin…
 5 Taun We     213  NA   none       grey       black             NA fema… femin…
 6 Finn         NA  NA   black      dark       dark              NA male  mascu…
 7 Rugor N…    206  NA   none       green      orange            NA male  mascu…
 8 Lobot       175  79   none       light      blue              37 male  mascu…
 9 Jar Jar…    196  66   none       orange     orange            52 male  mascu…
10 Qui-Gon…    193  89   brown      fair       blue              92 male  mascu…
# ℹ 16 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>
table(starwars_strat$sex)                               

female   male   none 
     5     18      2 

3.6.7 Clustered Sampling

Table 3.2: The starwars dataset with clustered sampling
     name               height           mass         hair_color       
 Length:19          Min.   : 97.0   Min.   :  32.0   Length:19         
 Class :character   1st Qu.:169.5   1st Qu.:  75.0   Class :character  
 Mode  :character   Median :178.0   Median :  79.0   Mode  :character  
                    Mean   :173.9   Mean   : 171.2                     
                    3rd Qu.:188.0   3rd Qu.: 116.5                     
                    Max.   :216.0   Max.   :1358.0                     
                                    NA's   :4                          
  skin_color         eye_color           birth_year         sex           
 Length:19          Length:19          Min.   : 19.00   Length:19         
 Class :character   Class :character   1st Qu.: 37.00   Class :character  
 Mode  :character   Mode  :character   Median : 47.00   Mode  :character  
                                       Mean   : 93.29                     
                                       3rd Qu.: 72.00                     
                                       Max.   :600.00                     
                                       NA's   :6                          
    gender           homeworld           species         
 Length:19          Length:19          Length:19         
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         

3.7 Bootstrapping

Figure 3.6: The idea of bootstrapping (Biggerj1, Marsupilami)
  • Definition: Estimating sample statistic distribution by drawing new samples with replacement from observed data, providing insights into variability without strict population distribution assumptions.

  • Advantages:

    • Non-parametric: Works without assuming a specific data distribution.
    • Confidence Intervals: Facilitates easy estimation of confidence intervals.
    • Robustness: Reliable for small sample sizes or unknown data distributions.
  • Disadvantages:

    • Computationally Intensive: Resource-intensive for large datasets.
    • Results quality relies on the representativeness of the initial sample (garbage in - garbage out).
    • Cannot compensate for inadequate information in the original sample.
    • Not Always Optimal: Traditional methods may be better in cases meeting distribution assumptions.