3 Sampling Methods – Advanced Statistical Methods and Optimization

3.1 Sample Size

3.1.1 Standard Error

Standard Error (SE) is a statistical measure that quantifies the variation or uncertainty in sample statistics, particularly the mean (average). It is a valuable tool in inferential statistics and provides an estimate of how much the sample mean is expected to vary from the true population mean.

\[\begin{align} SE = \frac{sd}{\sqrt{n}} \end{align}\]

A smaller SE indicates that the sample mean is likely very close to the population mean, while a larger standard error suggests greater variability and less precision in estimating the population mean. SE is crucial when constructing confidence intervals and performing hypothesis tests, as it helps in assessing the reliability of sample statistics as estimates of population parameters.

Variance vs. Standard Deviation: The standard error formula is based on the standard deviation of the sample, not the variance. The standard deviation is the square root of the variance.

Scaling of Variability: The purpose of the standard error is to measure the variability or spread of sample means. The square root of the sample size reflects how that variability decreases as the sample size increases. When the sample size is larger, the sample mean is expected to be closer to the population mean, and the standard error becomes smaller to reflect this reduced variability.

Central Limit Theorem: The inclusion of \(\sqrt{n}\) in the standard error formula is closely tied to the Central Limit Theorem, which states that the distribution of sample means approaches a normal distribution as the sample size increases. \(\sqrt{n}\) helps in this context to ensure that the standard error appropriately reflects the distribution’s properties.

3.2 Random Sampling

Figure 3.2: The idea of random sampling (Dan Kernler).

Definition: Selecting a sample from a population in a purely random manner, where every individual has an equal chance of being chosen.
Advantages:
- Eliminates bias in selection.
- Results are often representative of the population.
Disadvantages:
- Possibility of unequal representation of subgroups.
- Time-consuming and may not be practical for large populations.

3.3 Stratified Sampling

Figure 3.3: The idea of stratified sampling (Dan Kernler)

Definition: Dividing the population into subgroups or strata based on certain characteristics and then randomly sampling from each stratum.
Advantages:
- Ensures representation from all relevant subgroups.
- Increased precision in estimating population parameters.
Disadvantages:
- Requires accurate classification of the population into strata.
- Complexity in implementation and analysis.

3.4 Systematic Sampling

Figure 3.4: The idea of systematic sampling (Dan Kernler)

Definition: Choosing every kth individual from a list after selecting a random starting point.
Advantages:
- Simplicity in execution compared to random sampling.
- Suitable for large populations.
Disadvantages:
- Susceptible to periodic patterns in the population.
- If the periodicity aligns with the sampling interval, it can introduce bias.

3.5 Cluster Sampling

Figure 3.5: The idea of clustered sampling (Dan Kernler).

Definition: Dividing the population into clusters, randomly selecting some clusters, and then including all individuals from the chosen clusters in the sample.
Advantages:
- Cost-effective, especially for geographically dispersed populations.
- Reduces logistical challenges compared to other methods.
Disadvantages:
- Increased variability within clusters compared to other methods.
- Requires accurate information on cluster characteristics.

3.6 Example - Classroom Sampling

3.6.1 Descriptive Statistics (Population)

Table 3.1: The Classroom data

Characteristic	N = 40¹
Age	30 (28, 34)
Gender
Female	23 (58%)
Male	17 (43%)
¹ Median (Q1, Q3); n (%)

3.6.1.1 Histogram

3.6.1.2 Classroom Age

Figure 3.7: Where students are seated (Age)

3.6.1.3 Classroom Gender

Figure 3.8: Where students are seated (Age)

3.6.1.4 Population Values

\[\mu_{Age} = 30.4\] \[sd_{Age} = 4.18\]

These are the values we want to estimate using the introduced sampling strategies

3.6.2 Simple Random Sampling

Table 3.2: The means and standard deviations of the classroom data for n = 5,10,15,20

n	mean in years	standard deviation in years
5	29.80	5.36
10	30.50	3.37
15	30.93	5.09
20	31.00	4.00
population	30.40	4.18

3.6.2.1 \(n = 5\)

Figure 3.9: Where students are seated (Age) and the 5 samples

3.6.2.2 \(n = 10\)

Figure 3.10: Where students are seated (Age) and the 10 samples

3.6.2.3 \(n = 15\)

Figure 3.11: Where students are seated (Age) and the 15 samples

3.6.2.4 \(n = 20\)

Figure 3.12: Where students are seated (Age) and the 20 samples

3.6.2.5 Data

Figure 3.13: Comparison of sample to the population

3.6.2.6 Mean Comparison

Figure 3.14: The difference in means at different sample sizes

3.6.2.7 SD Comparison

Figure 3.15: The difference in sd at different sample sizes

3.6.3 Systematic Sampling

Sample always the \(k\)th.

Table 3.3: The output of systematic sampling for every 8th, 5th, 4th, 2nd

k	mean in years	standard deviation in years
8	33.40	3.58
5	30.62	4.41
3	31.57	4.07
2	30.95	4.59
population	30.40	4.18

3.6.3.1 \(k = 8\)

Figure 3.16: Where students are seated (Age) and every 8th sample

3.6.3.2 \(k = 5\)

Figure 3.17: Where students are seated (Age) and every 5th sample

3.6.3.3 \(k = 2\)

Figure 3.18: Where students are seated (Age) and every 2nd sample

3.6.3.4 Data

3.6.3.5 Mean Comparison

Figure 3.20: The difference in means at different sampling intervals

3.6.3.6 SD Comparison

Figure 3.21: The difference in sd at different sampling intervals

3.6.4 Stratified Sampling

Choose sample stratified to characteristic (Gender represented in population)

3.6.4.1 \(\text{proportion} = 12\% \rightarrow n = 5\)

Figure 3.22: Where students are seated (Age), stratified according to gender (\(12\%\))

3.6.4.2 \(\text{proportion} = 25\% \rightarrow n = 10\)

Figure 3.23: Where students are seated (Age), stratified according to gender (\(25\%\))

3.6.4.3 \(\text{proportion} = 38\% \rightarrow n = 15\)

Figure 3.24: Where students are seated (Age), stratified according to gender (\(38\%\))

3.6.4.4 \(\text{proportion} = 50\% \rightarrow n = 20\)

Figure 3.25: Where students are seated (Age), stratified according to gender (\(50\%\))

3.6.4.5 Data

Figure 3.26: The data of the stratified sampling

3.6.4.6 Mean Comparsion

Figure 3.27: The difference in means at different sample sizes

3.6.4.7 SD Comparison

Figure 3.28: The difference in sd at different sample sizes

3.6.5 Clustered Sampling

Clusters are logical units which are sampled in order to save sampling resources.

In our case clusters are columns of students.

3.6.5.1 One Cluster

Figure 3.29: Where students are seated (Age), cluster one

3.6.5.2 Two Clusters

Figure 3.30: Where students are seated (Age), two clusters

3.6.5.3 Three Clusters

Figure 3.31: Where students are seated (Age), three clusters

3.6.5.4 Four Clusters

Figure 3.32: Where students are seated (Age), four clusters

3.6.5.5 Data

Figure 3.33: The data of the clustered sampling

3.6.5.6 Mean Comparison

Figure 3.34: The difference in means at different sample sizes

3.6.5.7 SD Comparison

Figure 3.35: The difference in sd at different sample sizes

3.6.6 Overall Comparison of Sampling Strategies (Mean)

Figure 3.36: A graphical comparison of the absolute difference in means per sample strategy

3.6.7 Overall Comparison of Sampling Strategies (SD)

Figure 3.37: A graphical comparison of the absolute difference in means per sample strategy

3.7 Bootstrapping

Figure 3.38: The idea of bootstrapping (Biggerj1, Marsupilami)

Definition: Estimating sample statistic distribution by drawing new samples with replacement from observed data, providing insights into variability without strict population distribution assumptions.
Advantages:
- Non-parametric: Works without assuming a specific data distribution.
- Confidence Intervals: Facilitates easy estimation of confidence intervals.
- Robustness: Reliable for small sample sizes or unknown data distributions.
Disadvantages:
- Computationally Intensive: Resource-intensive for large datasets.
- Results quality relies on the representativeness of the initial sample (garbage in - garbage out).
- Cannot compensate for inadequate information in the original sample.
- Not Always Optimal: Traditional methods may be better in cases meeting distribution assumptions.