4  Inferential Statistics

Author
Affiliation

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

Inferential statistics involves making predictions, generalizations, or inferences about a population based on a sample of data. These techniques are used when researchers want to draw conclusions beyond the specific data they have collected. Inferential statistics help answer questions about relationships, differences, and associations within a population.

4.1 Hypothesis Testing - Basics

Figure 4.1: We are hypotheses.

Null Hypothesis (H0): This is the default or status quo assumption. It represents the belief that there is no significant change, effect, or difference in the production process. It is often denoted as a statement of equality (e.g., the mean production rate is equal to a certain value).

Alternative Hypothesis (Ha): This is the claim or statement we want to test. It represents the opposite of the null hypothesis, suggesting that there is a significant change, effect, or difference in the production process (e.g., the mean production rate is not equal to a certain value).

4.1.1 The drive shaft exercise - Hypotheses

During the QC of the drive shaft \(n=100\) samples are taken and the diameter is measured with an accuracy of \(\pm 0.01mm\). Is the true mean of all produced drive shafts within the specification?

For this we can formulate the hypotheses.

H0:
The drive shaft diameter is within the specification.
Ha:
The drive shaft diameter is not within the specification.

In the following we will explore, how to test for these hypotheses.

4.2 Confidence Intervals

A Confidence Interval is a statistical concept used to estimate a range of values within which a population parameter, such as a population mean or proportion, is likely to fall. It provides a way to express the uncertainty or variability in our sample data when making inferences about the population. In other words, it quantifies the level of confidence we have in our estimate of a population parameter.

Confidence intervals are typically expressed as a range with an associated confidence level. The confidence level, often denoted as \(1-\alpha\), represents the probability that the calculated interval contains the true population parameter. Common confidence levels include \(90\%\), \(95\%\), and \(99\%\).

There are different ways of calculating CI.

  1. For the population mean \(\mu_0\) when the population standard deviation \(\sigma_0^2\) is known (\(\eqref{ci01}\)).

\[\begin{align} CI = \bar{X} \pm t \frac{\sigma_0}{\sqrt{n}} \label{ci01} \end{align}\]

  • \(\bar{X}\) is the sample mean.

  • \(Z\) is the critical value from the standard normal distribution corresponding to the desired confidence level (e.g., \(1.96\) for a \(95\%\) confidence interval).

  • \(\sigma_0\) is the populations standard deviation

  • \(n\) is the sample size

2.For the population mean \(\mu_0\) when the population standard deviation \(\sigma_0\) is Unknown (t-confidence interval), see \(\eqref{ci02}\).

\[\begin{align} CI = \bar{X} \pm t \frac{sd}{\sqrt{n}} \label{ci02} \end{align}\]

  • \(\bar{X}\) is the sample mean.

  • \(t\) is the critical value from the t-distribution with \(n-1\) degrees of freedom corresponding to the desired confidence level

  • \(sd\) is the sample standard deviation

  • \(n\) is the sample size

  1. For a population proportion p, see \(\eqref{ci03}\).

\[\begin{align} CI = \hat{p} \pm Z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \label{ci03} \end{align}\]

  • \(\hat{p}\) is the sample proportion

  • \(Z\) is the critical value from the standard normal distribution corresponding to the desired confidence level

  • \(n\) is the sample size

  1. The method for calculating confidence intervals may vary depending on the estimated parameter. Estimating a population median or the differences between two population means, other statistical techniques may be used.

4.2.1 The drive shaft exercise - Confidence Intervals

Figure 4.2: The 95% CI for the drive shaft data.

The \(95\%\) CI for the drive shaft data is shown in Figure 4.2. For comparison the histogram with an overlayed density curve is plotted. The highlighted area shows the minimum and maximum CI, the calculated mean is shown as a dashed line.

4.3 Significance Level

The significance level \(\alpha\) is a critical component of hypothesis testing in statistics. It represents the maximum acceptable probability of making a Type I error, which is the error of rejecting a null hypothesis when it is actually true. In other words, \(\alpha\) is the probability of concluding that there is an effect or relationship when there isn’t one. Commonly used significance levels include \(0.05 (5\%)\), \(0.01 (1\%)\), and \(0.10 (10\%)\). The choice of \(\alpha\) depends on the context of the study and the desired balance between making correct decisions and minimizing the risk of Type I errors.

4.4 False negative - risk

The risk for a false negative outcome is called \(\beta\) - risk. Is is calculated using statistical power analysis. Statistical power is the probability of correctly rejecting a null hypothesis when it is false, which is essentially the complement of beta (\(\beta\)).

\[\begin{align} \beta = 1 - \text{Power} \end{align}\]

4.5 Power Analysis

Statistical power is calculated using software, statistical tables, or calculators specifically designed for this purpose. Generally speaking: The greater the statistical power, the greater is the evidence to accept or reject the \(H_0\) based on the study. Power analysis is also very useful in determining the sample size before the actualy experiments are conducted. Below is an example for a power calculation for a two-sample t-test.

\[ \text{Power} = 1 - \beta = P\left(\frac{{|\bar{X}_1 - \bar{X}_2|}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} > Z_{\frac{\alpha}{2}} - \frac{\delta}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\right) \]

  1. Effect Size: This represents the magnitude of the effect you want to detect. Larger effects are easier to detect than smaller ones.

  2. Significance Level (\(\alpha\)): This is the predetermined level of significance that defines how confident you want to be in rejecting the null hypothesis (e.g., typically set at 0.05).

  3. Sample Size (\(n\)): The number of observations or participants in your study. Increasing the sample size generally increases the power of the test.

  4. Power (\(1 - \beta\)): This is the probability of correctly rejecting the null hypothesis when it is false. Higher power is desirable, as it minimizes the chances of a Type II error (failing to detect a true effect).

  5. Type I Error (\(\alpha\)): The probability of incorrectly rejecting the null hypothesis when it is true. This is typically set at \(0.05\) or \(5\%\) in most studies.

  6. Type II Error (\(\beta\)): The probability of failing to reject the null hypothesis when it is false. Power is the complement of \(\beta\) (\(Power = 1 - \beta\)).

Figure 4.3: The coin toss with the respective probabilites (Champely 2020).
H0:
The coin is fair and lands heads \(50\%\) of the time.
Ha:
The coin is loaded and lands heads more than \(50\%\) of the time.
pwr.p.test(h = ES.h(p1 = 0.75, p2 = 0.50),
           sig.level = 0.05,
           power = 0.80,
           alternative = "greater")

     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.5235988
              n = 22.55126
      sig.level = 0.05
          power = 0.8
    alternative = greater

The sample size \(n = 23\), meaning \(23\) coin flips means that the statistical power is \(80\%\) at a \(\alpha = 0.05\) significance level (\(\beta = 1-power = 0.2 \approx 20\%\)). But what if the sample size varies? This is the subject of Figure 4.4. On the x-axis the power is shown (or the \(\beta\)-risk on the upper x-axis), whereas the sample size n is depicted on the y-axis. To increase the power by \(10\%\) to be \(90\%\) the sample sized must be increased by \(11\). A further power increase of \(5\%\) would in turn mean an increase in sample size to be \(n = 40\). This highlights the non-linear nature of power calculations and why they are important for experimental planning.

Figure 4.4: The power vs. the sample size
Figure 4.5: The power vs. the sample size for different effect sizes

4.5.1 A word on Effect Size

Cohen (Cohen 2013) describes effect size as “the degree to which the null hypothesis is false.” In the coin flipping example, this is the difference between \(75\%\) and \(50\%\). We could say the effect was 25% but recall we had to transform the absolute difference in proportions to another quantity using the ES.h function. This is a crucial part of doing power analysis correctly: An effect size must be provided on the expected scale. Doing otherwise will produce wrong sample size and power calculations.

When in doubt, Conventional Effect Sizes can be used. These are pre-determined effect sizes for “small”, “medium”, and “large” effects, see Cohen (2013).

4.6 p-value

Figure 4.6: Type I and Type II error in the context of inferential statistics.

The p-value is a statistical measure that quantifies the evidence against a null hypothesis. It represents the probability of obtaining test results as extreme or more extreme than the ones observed, assuming the null hypothesis is true. In hypothesis testing, a smaller p-value indicates stronger evidence against the null hypothesis. If the p-value is less than or equal to \(\alpha\) (\(p \leq \alpha\)), you reject the null hypothesis. If the p-value is greater than \(\alpha\) ( \(p > \alpha\) ), you fail to reject the null hypothesis. A common threshold for determining statistical significance is to reject the null hypothesis when \(p\leq\alpha\).

The p-value however does not give an assumption about the effect size, which can be quite insignificant (Nuzzo 2014). While the p-value therefore is the probability of accepting \(H_a\) as true, it is not a measure of magnitude or relative importance of an effect. Therefore the CI and the effect size should always be reported with a p-value. Some Researchers even claim that most of the research today is false (Ioannidis 2005). In practice, especially in the manufacturing industry, the p-value and its use is still popular. Before implementing any measures in a series production, those questions will be asked. The confident and reliable engineer asks them beforehand and is always his own greatest critique.

4.7 Statistical errors

Figure 4.7: The statistical Errors (Type I and Type II).

A Type I error occurs when a null hypothesis that is actually true is rejected. In other words, it’s a false alarm. It is concluded that there is a significant effect or difference when there is none. The probability of committing a Type I error is denoted by the significance level \(\alpha\). Example: Imagine a drug trial where the null hypothesis is that the drug has no effect (it’s ineffective), but due to random chance, the data appears to show a significant effect, and you incorrectly conclude that the drug is effective (Type I error).

A Type II error occurs when a null hypothesis that is actually false is not rejected. It means failing to detect a significant effect or difference when one actually exists. The probability of committing a Type II error is denoted by the symbol \(\beta\). Example: In a criminal trial, the null hypothesis might be that the defendant is innocent, but they are actually guilty. If the jury fails to find enough evidence to convict the guilty person, it is a Type II error.

Type I Error is falsely concluding, that there is an effect or difference when there is none (false positive). Type II Error failing to conclude that there is an effect or difference when there actually is one (false negative).

The relationship between Type I and Type II errors is often described as a trade-off. As the risk of Type I errors is reduced by lowering the significance level (\(\alpha\)), the risk of Type II errors (\(\beta\)) is typically increased (Figure 4.6). This trade-off is inherent in hypothesis testing, and the choice of significance level depends on the specific goals and context of the study. Researchers often aim to strike a balance between these two types of errors based on the consequences and costs associated with each. This balance is a critical aspect of the design and interpretation of statistical tests.

4.8 Parametric and Non-parametric Tests

Parametric and non-parametric tests in statistics are methods used for analyzing data. The primary difference between them lies in the assumptions they make about the underlying data distribution:

  1. Parametric Tests:
    • These tests assume that the data follows a specific probability distribution, often the normal distribution.
    • Parametric tests make assumptions about population parameters like means and variances.
    • They are more powerful when the data truly follows the assumed distribution.
    • Examples of parametric tests include t-tests, ANOVA, regression analysis, and parametric correlation tests.
  2. Non-Parametric Tests:
    • Non-parametric tests make minimal or no assumptions about the shape of the population distribution.
    • They are more robust and can be used when data deviates from a normal distribution or when dealing with ordinal or nominal data.
    • Non-parametric tests are generally less powerful compared to parametric tests but can be more reliable in certain situations.
    • Examples of non-parametric tests include the Mann-Whitney U test, Wilcoxon signed-rank test, Kruskal-Wallis test, and Spearman’s rank correlation.

The choice between parametric and non-parametric tests depends on the nature of the data and the assumptions. Parametric tests are appropriate when data follows the assumed distribution, while non-parametric tests are suitable when dealing with non-normally distributed data or ordinal data. Some examples for parametric and non-parametric tests are given in Table 4.1.

Table 4.1: Some parametric and non-parametric statistical tests.

Parametric Tests

Non-Parametric Tests

One-sample t-test Wilcoxon signed rank test
Paired t-test Mann-Whitney U test
Two-sample t-test Kruskal Wallis test
One-Way ANOVA Welch Test

4.9 Paired and Independent Tests

Figure 4.8: The difference between paired and independent Tests.
  1. Paired Statistical Test:
  • Paired tests are used when there is a natural pairing or connection between two sets of data points. This pairing is often due to repeated measurements on the same subjects or entities.
  • They are designed to assess the difference between two related samples, such as before and after measurements on the same group of individuals.
  • The key idea is to reduce variability by considering the differences within each pair, which can increase the test sensitivity.
  1. Independent Statistical Test:
  • Independent tests, also known as unpaired or two-sample tests, are used when there is no inherent pairing between the two sets of data.
  • These tests are typically applied to compare two separate and unrelated groups or samples.
  • They assume that the data in each group is independent of the other, meaning that the value in one group doesn’t affect the value in the other group.

An example for a paired test is, if two groups of data are to be compared in two different points in time (see Figure 4.8).

4.10 Distribution Tests

The importance of testing for normality (or other distributions) lies in the fact that various statistical techniques, such as parametric tests (e.g., t-tests, ANOVA), are based on the assumption of for example normality. When data deviates significantly from a normal distribution, using these parametric methods can lead to incorrect conclusions and biased results. Therefore, it is essential to determine how a dataset is approximately distributed before applying such techniques.

Several tests for normality are available, with the most common ones being the Kolmogorov-Smirnov test, the Shapiro-Wilk test, and the Anderson-Darling test. These tests provide a quantitative measure of how well the data conforms to a normal distribution.

In practice, it is important to interpret the results of these tests cautiously. Sometimes, a minor departure from normality may not affect the validity of parametric tests, especially when the sample size is large. In such cases, using non-parametric methods may be an alternative. However, in cases where normality assumptions are crucial, transformations of the data or choosing appropriate non-parametric tests may be necessary to ensure the reliability of statistical analyses.

Tests for normality do not free you from the burden of thinking for yourself.

4.10.1 Quantile-Quantile plots

Quantile-Quantile plots are a graphical tool used in statistics to assess whether a dataset follows a particular theoretical distribution, typically the normal distribution. They provide a visual comparison between the observed quantiles1 of the data and the quantiles expected from the chosen theoretical distribution.

A neutral explanation of how QQ plots work:

4.10.1.1 Sample data

In Table 4.2 \(n=10\) datapoints are shown as a sample dataset.

Table 4.2: 10 randomly sampled datapoints for the creation of the QQ-plot
x smpl_no
-0.56047565 1
-0.23017749 2
1.55870831 3
0.07050839 4
0.12928774 5
1.71506499 6
0.46091621 7
-1.26506123 8
-0.68685285 9
-0.44566197 10

4.10.1.2 Data Sorting

To create a QQ plot, the data must be sorted in ascending order.

Table 4.3: The sorted data points.
x smpl_no
-1.26506123 8
-0.68685285 9
-0.56047565 1
-0.44566197 10
-0.23017749 2
0.07050839 4
0.12928774 5
0.46091621 7
1.55870831 3
1.71506499 6

4.10.1.3 Theoretical Quantiles

Theoretical quantiles are calculated based on the chosen distribution (e.g., the normal distribution). These quantiles represent the expected values if the data perfectly follows that distribution.

Table 4.4: The calculated theoretical quantiles
x smpl_no x_norm x_thrtcl
-1.26506123 8 -1.404601888 0.08006985
-0.68685285 9 -0.798376211 0.21232610
-0.56047565 1 -0.665875352 0.25274539
-0.44566197 10 -0.545498338 0.29270541
-0.23017749 2 -0.319572479 0.37464622
0.07050839 4 -0.004316756 0.49827787
0.12928774 5 0.057310762 0.52285118
0.46091621 7 0.405008410 0.65726434
1.55870831 3 1.555994430 0.94014529
1.71506499 6 1.719927421 0.95727718

4.10.1.4 Plotting Points

Figure 4.9: The QQ points as calculated before.

For each data point, a point is plotted in the QQ plot. The x-coordinate of the point corresponds to the theoretical quantile, and the y-coordinate corresponds to the observed quantile from the data, see Figure 4.9.

4.10.1.5 Perfect Normal Distribution

Figure 4.10: A perfect normal distribution would be indicated if all points would fall on this straight line.

In the case of a perfect normal distribution, all the points would fall along a straight line at a 45-degree angle. If the data deviates from normality, the points may deviate from this line in specific ways, see Figure 4.10.

4.10.1.6 Interpretation

Figure 4.11: The QQ line as plotted using the theoretical and sample quantiles.

Deviations from the straight line suggest departures from the assumed distribution. For example, if points curve upward, it indicates that the data has heavier tails than a normal distribution. If points curve downward, it suggests lighter tails. S-shaped curves or other patterns can reveal additional information about the data’s distribution. In Figure 4.11 the QQ-points are shown together with the respective QQ-line and a line of perfectly normal distributed points. Some deviations can be seen, but it is hard to judge, if the data is normally distributed or not.

4.10.1.7 Confidence Interval

Figure 4.12: The QQ plot with confidence bands.

Because it is hard to judge from Figure 4.11 if the points are normally distributed, it makes sense to get limits for normally disitrbuted points. This is shown in Figure 4.12. The gray area depicts the (\(95\%\)) confidence bands for a normal distribution. All the points fall into the area, as well as the line. This shows, that the points are likely to be normally distributed.

4.10.1.8 The drive shaft exercise

Figure 4.13: The QQ plots for each drive shaft group shown in subplots.

The QQ plot method is extended to the drive shaft exercise in Figure 4.13. In each subplot the plot for the respective group is shown together with the QQ-points, the QQ-line and the respective confidence bands. The scaling for each plot is different to enhance visibility of every subplot. A line for the perfect normal distribution is also shown in solid linestyle. From group \(1 \ldots 4\) all points fall into the QQ confidence bands. Group05 differs however. The points from visible categories, which is a strong indicator, that the measurement system may be to inaccurate.

4.10.2 Quantitative Methods

Figure 4.14: A visualisation of the KS test using the 10 datapoints from before

The Kolmogorov-Smirnov test for normality, often referred to as the KS test, is a statistical test used to assess whether a dataset follows a normal distribution. It evaluates how closely the cumulative distribution function of the dataset matches the expected CDF of a normal distribution.

  1. Null Hypothesis (H0): The null hypothesis in the KS test states that the sample data follows a normal distribution.

  2. Alternative Hypothesis (Ha): The alternative hypothesis suggests that the sample data significantly deviates from a normal distribution.

  3. Test Statistic (D): The KS test calculates a test statistic, denoted as D which measures the maximum vertical difference between the empirical CDF of the data and the theoretical CDF of a normal distribution. It quantifies how far the observed data diverges from the expected normal distribution. A visualization of the KS-test is shown in Figure 4.14. The red line denotes a perfect normal distribution, whereas the step function shows the empirical CDF of the data itself.

  4. Critical Value: To assess the significance of D, a critical value is determined based on the sample size and the chosen significance level (\(\alpha\)). If D exceeds the critical value, it indicates that the dataset deviates significantly from a normal distribution.

  5. Decision: If D is greater than the critical value, the null hypothesis is rejected, and it is concluded that the data is not normally distributed. If D is less than or equal to the critical value, there is not enough evidence to reject the null hypothesis, suggesting that the data may follow a normal distribution.

It is important to note that the KS test is sensitive to departures from normality in both tails of the distribution. There are other normality tests, like the Shapiro-Wilk test and Anderson-Darling test, which may be more suitable in certain situations. Researchers typically choose the most appropriate test based on the characteristics of their data and the assumptions they want to test.

4.10.3 Expanding to non-normal disitributions

(a) the QQ-plot for the weibull distribution using the drive shaft failure time data
(b) a detrended QQ-plot
Figure 4.15: The QQ-plot can easily be extended to non-normal distributions.

The QQ-plot can easily be extended to non-normal disitributions as well. This is shown in Figure 4.15. In Figure 4.15 (a) a classic QQ-plot for Figure 2.25 is shown. The same rules as before still apply, they are only extended to the weibull distribution. In Figure 4.15 (b) a detrended QQ-plot is shown in order to account for visual bias. It is of course known, that the data follows a weibull disitribution with a shape parameter \(\beta=2\) and a scale parameter \(\lambda = 500\), but such distributional parameters can also be estimated (Delignette-Muller and Dutang 2015).

4.11 Test 1 Variable

Figure 4.16: Statistical tests for one variable.

4.11.1 One Proportion Test

Table 4.5: The raw data for the proportion test.
Category Count Total plt_lbl
A 35 100 35 counts 100 trials
B 20 100 20 counts 100 trials

The one proportion test is used on categorical data with a binary outcome, such as success or failure. Its prerequisite is having a known or hypothesized population proportion that the sample proportion shall be compared to. This test helps determine if the sample proportion significantly differs from the population proportion, making it valuable for studies involving proportions and percentages.

Table 4.6: The test results for the proportion test.
estimate1 estimate2 statistic p.value parameter conf.low conf.high alternative
0.350 0.200 4.915 0.027 1.000 0.018 0.282 two.sided

4.11.2 Chi2 goodness of fit test

Table 4.7: The raw data for the gof \(\chi^2\) test.
group count_n_observed
group01 100.000
group02 100.000
group03 100.000
group04 100.000
group05 100.000
Table 4.8: The test results for the gof \(\chi^2\) test.
statistic p.value parameter
0.000 1.000 4.000

The \(\chi^2\) goodness of Fit Test (gof) is applied on categorical data with expected frequencies. It is suitable for analyzing nominal or ordinal data. This test assesses whether there is a significant difference between the observed and expected frequencies in your dataset, making it useful for determining if the data fits an expected distribution.

4.11.3 One-sample t-test

The one-sample t-test is designed for continuous data when you have a known or hypothesized population mean that you want to compare your sample mean to. It relies on the assumption of normal distribution, making it applicable when assessing whether a sample’s mean differs significantly from a specified population mean.

The test can be applied in various settings. One is, to test if measured data comes from a population with a certain mean (for exampe a test against a specification). To show the application, the drive shaft data is employed. In Table 4.9 the per group summarised data of the dirve shaft data is shown.

Table 4.9: The raw data for the one sample t-test.
group mean_diameter sd_diameter
group01 12.015 0.111
group02 12.364 0.189
group03 13.002 0.102
group04 11.486 0.094
group05 12.001 0.026

One important prerequisite for the One sample t-test normally distributed data. For this, graphical and numerical methods have been introduced in previous chapters. First, a classic QQ-plot is created for every group (see Figure 4.17). From a first glance, the data appears to be normally distributed.

Figure 4.17: The qq-plot for the drive shaft data

A more quantitative approach to tests for normality is shown in Table 4.10. Here, each group is tested with the KS-test for normality. H0 is accepted (the data is normal distributed) because the computed p-value is larger than the significance level (\(\alpha = 0.05\)).

Table 4.10: The results for the one KS normality test for each group.
group statistic p.value method alternative

group01

0.048 0.975

Asymptotic
one-sample
Kolmogorov-Smirnov
test

two-sided

group02

0.067 0.754

Asymptotic
one-sample
Kolmogorov-Smirnov
test

two-sided

group03

0.075 0.633

Asymptotic
one-sample
Kolmogorov-Smirnov
test

two-sided

group04

0.060 0.862

Asymptotic
one-sample
Kolmogorov-Smirnov
test

two-sided

group05

0.127 0.081

Asymptotic
one-sample
Kolmogorov-Smirnov
test

two-sided

There is sufficient evidence to assume normal distributed data within each group. The next step is, to test if the data comes from a certain population mean (\(\mu_0\)). In this case, the population mean is the specification of the drive shaft at a diameter \(=12mm\).

Table 4.11: The results for the one sample t-test (against mean = 12mm).
group estimate statistic p.value parameter conf.low conf.high method alternative

group01

12.015 1.391 0.167 99.000 11.993 12.038

One
Sample
t-test

two.sided

group02

12.364 19.274 0.000 99.000 12.326 12.401

One
Sample
t-test

two.sided

group03

13.002 97.769 0.000 99.000 12.982 13.022

One
Sample
t-test

two.sided

group04

11.486 −54.441 0.000 99.000 11.468 11.505

One
Sample
t-test

two.sided

group05

12.001 0.418 0.677 99.000 11.996 12.006

One
Sample
t-test

two.sided

4.11.4 One sample Wilcoxon test

For situations where your data may not follow a normal distribution or when dealing with ordinal data, the one-sample Wilcoxon test is a non-parametric alternative to the t-test. It is used to evaluate whether a sample’s median significantly differs from a specified population median.

The wear and tear of drive shafts can occur due to various factors related to the vehicle’s operation and maintenance. Some common causes include:

  1. Normal Usage: Over time, the drive shaft undergoes stress and strain during regular driving. This can lead to gradual wear on components, especially if the vehicle is frequently used.

  2. Misalignment: Improper alignment of the drive shaft can result in uneven distribution of forces, causing accelerated wear. This misalignment may stem from issues with the suspension system or other related components.

  3. Lack of Lubrication: Inadequate lubrication of the drive shaft joints and bearings can lead to increased friction, accelerating wear. Regular maintenance, including proper lubrication, is essential to mitigate this factor.

  4. Contamination: Exposure to dirt, debris, and water can contribute to the degradation of drive shaft components. Contaminants can infiltrate joints and bearings, causing abrasive damage over time.

  5. Vibration and Imbalance: Excessive vibration or imbalance in the drive shaft can lead to increased stress on its components. This may result from issues with the balance of the rotating parts or damage to the shaft itself.

  6. Extreme Operating Conditions: Harsh driving conditions, such as off-road terrain or constant heavy loads, can accelerate wear on the drive shaft. The components may be subjected to higher levels of stress than they were designed for, leading to premature wear and tear.

The wear and tear because o the reasons above can be rated on a scale with discrete values from \(1 \ldots 5\) with \(2\) being the reference value. It is therefore interesting, if the wear and tear rating of \(n=100\) drive shafts per group differs significantly from the reference value \(2\). Because we are dealing with discrete data, the one sample t-test can not be used.

Figure 4.18: The wear and tear rating data histograms.
Table 4.12: The results for the one sample Wilcoxon test for every group against the reference value.
group statistic p.value alternative

group01

3,208.500 0.000

greater

group02

5,050.000 0.000

greater

group03

0.000 1.000

greater

group04

3,203.500 0.000

greater

group05

3,003.000 0.000

greater

Table 4.13: The results for the one sample t-test compared to the results of a one sample Wilcoxon test.
group t_tidy_p.value wilcox_tidy_p.value

group01

0.167 0.182

group02

0.000 0.000

group03

0.000 0.000

group04

0.000 0.000

group05

0.677 0.803

4.12 Test 2 Variable (Qualitative or Quantitative)

Figure 4.19: Statistical tests for two variables.

4.12.1 Cochrane’s Q-test

Cochran’s Q test is employed when you have categorical data with three or more related groups, often collected over time or with repeated measurements. It assesses if there is a significant difference in proportions between the related groups.

4.12.2 Chi2 test of independence

This test is appropriate when you have two categorical variables, and you want to determine if there is an association between them. It is useful for assessing whether the two variables are dependent or independent of each other.

In the context of the drive shaft production the example assumes a dataset with categorical variables like “Defects” (Yes/No) and “Operator” (Operator A/B).

4.12.2.1 Contingency tables

A contingency table, also known as a cross-tabulation or crosstab, is a statistical table that displays the frequency distribution of variables. It organizes data into rows and columns to show the frequency or relationship between two or more categorical variables. Each cell in the table represents the count or frequency of occurrences that fall into a specific combination of categories for the variables being analyzed. It is commonly used in statistics to examine the association between categorical variables and to understand patterns within data sets.

Table 4.14: The contingency table for this example.
Defects Operator A Operator B

No

2 3

Yes

3 2

4.12.2.2 test results

With \(p\approx1>0.05\) the \(p\)-value is greater than the significance level of \(\alpha = 0.05\). The \(H_0\) is therefore proven, there is no difference between the operators. The test results are depicted below-


    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table
X-squared = 0, df = 1, p-value = 1

4.12.3 Correlation

Figure 4.20: Correlation between two variables and the quantification thereof.

Correlation refers to a statistical measure that describes the relationship between two variables. It indicates the extent to which changes in one variable are associated with changes in another.

Correlation is measured on a scale from -1 to 1:

  • A correlation of 1 implies a perfect positive relationship, where an increase in one variable corresponds to a proportional increase in the other.

  • A correlation of -1 implies a perfect negative relationship, where an increase in one variable corresponds to a proportional decrease in the other.

  • A correlation close to 0 suggests a weak or no relationship between the variables.

Correlation doesn’t imply causation; it only indicates that two variables change together but doesn’t determine if one causes the change in the other.

4.12.3.1 Pearson Corrrelation

The pearson correlation coefficient is a normalized version of the covariance.

\[\begin{align} R = \frac{\mathrm{Cov}(X,Y)}{\sigma_x \sigma_y} \end{align}\]

  • Covariance is sensitive to scale (\(mm\) vs. \(cm\))
  • Pearson correlation removes units, allowing for meaningful comparisons across datasets
Figure 4.21: The QQ-plot of both variables. There is strong evidence that they are normally distributed.
Figure 4.22: Correlation between rpm of lathe machine and the diameter of the drive shaft.

    Pearson's product-moment correlation

data:  drive_shaft_rpm_dia$rpm and drive_shaft_rpm_dia$diameter
t = 67.895, df = 498, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9406732 0.9578924
sample estimates:
 cor 
0.95 

When you have two continuous variables and want to measure the strength and direction of their linear relationship, Pearson correlation is the go-to choice (Pearson 1895). It assumes normally distributed data and is particularly valuable for exploring linear associations between variables and is calculated via \(\eqref{pearcorr}\).

\[\begin{align} R = \frac{\sum_{i = 1}^{n}(x_i - \bar{x}) \times (y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\times \sum_{i=1}^{n}(y_i-\bar{y})^2} \label{pearcorr} \end{align}\]

The Pearson Correlation Coeffcient works best with normal disitributed data. The normal distribution of the data is verified in Figure 4.21.

4.12.3.2 Spearman Correlation

Spearman (Spearman 1904) correlation is a non-parametric alternative to Pearson correlation. It is used when the data is not normally distributed or when the relationship between variables is monotonic but not necessarily linear.

\[\begin{align} \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \label{spearcorr} \end{align}\]

In Figure 4.23 the example data for a drive shaft production is shown. The Production_Time and the Defects seem to form a relationship, but the data does not appear to be normally distributed. This can also be seen in the QQ-plots of both variables in Figure 4.24.

The spearman correlation coefficient (\(\rho\)) is based on the pearson correlation, but applied to ranked data

Figure 4.23: The relationship between the production time and the number of defects.
Figure 4.24: The QQ-plots of both variables.

4.12.3.3 Correlation - methodogical limits

While correlation analysis and summary statistics are certainly useful, one must always consider the raw data. The data taken from Davies, Locke, and D’Agostino McGowan (2022) showcases this. The summary statistics in Table 4.15 are practically the same, one would not suspect different underlying data. When the raw data is plotted though (Figure 4.25), it can be seen that the data appears to be highly non linear, forming different shapes as well as different categories etc.

Always check the raw data.

Table 4.15: The datasauRus data and the respective summary statistics.
dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
away 54.266 47.835 16.770 26.940 −0.064
bullseye 54.269 47.831 16.769 26.936 −0.069
circle 54.267 47.838 16.760 26.930 −0.068
dino 54.263 47.832 16.765 26.935 −0.064
dots 54.260 47.840 16.768 26.930 −0.060
h_lines 54.261 47.830 16.766 26.940 −0.062
high_lines 54.269 47.835 16.767 26.940 −0.069
slant_down 54.268 47.836 16.767 26.936 −0.069
slant_up 54.266 47.831 16.769 26.939 −0.069
star 54.267 47.840 16.769 26.930 −0.063
v_lines 54.270 47.837 16.770 26.938 −0.069
wide_lines 54.267 47.832 16.770 26.938 −0.067
x_shape 54.260 47.840 16.770 26.930 −0.066
Figure 4.25: The raw data from the datasauRus packages shows, that summary statistics may be misleading.

4.13 Test 2 Variables (2 Groups)

Figure 4.26: Statistical tests for two variable.

4.13.1 Test for equal variance (homoscedasticity)

Figure 4.27: The variances (\(sd^2\)) for the drive shaft data.

Tests for equal variances, also known as tests for homoscedasticity, are used to determine if the variances of two or more groups or samples are equal. Equal variances are an assumption in various statistical tests, such as the t-test and analysis of variance (ANOVA). When the variances are not equal, it can affect the validity of these tests. Two common tests for equal variances are:

Certainly, here are bullet points outlining the null hypothesis, prerequisites, and decisions for each of the three tests:

4.13.1.1 F-Test (Hahs-Vaughn and Lomax 2013)

  • Null Hypothesis: The variances of the different groups or samples are equal.
  • Prerequisites:
    • Independence
    • Normality
    • Number of groups \(= 2\)
  • Decisions:
    • \(p> \alpha \rightarrow\) fail to reject H0
    • \(p< \alpha \rightarrow\) reject H0

    F test to compare two variances

data:  ds_wide$group01 and ds_wide$group03
F = 1.1817, num df = 99, denom df = 99, p-value = 0.4076
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7951211 1.7563357
sample estimates:
ratio of variances 
          1.181736 

4.13.1.2 Bartlett Test (Bartlett 1937)

  • Null Hypothesis: The variances of the different groups or samples are equal.
  • Prerequisites:
    • Independence
    • Normality
    • Number of groups \(> 2\)
  • Decisions:
    • \(p> \alpha \rightarrow\) fail to reject H0
    • \(p< \alpha \rightarrow\) reject H0

    Bartlett test of homogeneity of variances

data:  diameter by group
Bartlett's K-squared = 275.61, df = 4, p-value < 2.2e-16

4.13.1.3 Levene Test (Olkin June)

  • Null Hypothesis: The variances of the different groups or samples are equal.
  • Prerequisites:
    • Independence
    • Number of groups \(> 2\)
  • Decisions:
    • \(p> \alpha \rightarrow\) fail to reject H0
    • \(p< \alpha \rightarrow\) reject H0
Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   4  38.893 < 2.2e-16 ***
      495                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.13.2 t-test for independent samples

The independent samples t-test is applied when you have continuous data from two independent groups. It evaluates whether there is a significant difference in means between these groups, assuming a normal distribution of the data.

  • Null Hypothesis: The means of the two samples are equal.
  • Prerequisites:
    • Independence
    • Normal Distribution
    • Number of groups \(=2\)
    • equal Variances of the groups

First, the variances are compared in order to check if they are equal using the F-Test (as described in Section 4.13.1.1).


    F test to compare two variances

data:  group01 %>% pull("diameter") and group03 %>% pull("diameter")
F = 1.1817, num df = 99, denom df = 99, p-value = 0.4076
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7951211 1.7563357
sample estimates:
ratio of variances 
          1.181736 

With \(p>\alpha = 0.05\) the \(H_0\) is accepted, the variances are equal.

The next step is to check the data for normality using the KS-test (as described in Section 4.10.2).


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  group01 %>% pull("diameter")
D = 0.048142, p-value = 0.9746
alternative hypothesis: two-sided

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  group03 %>% pull("diameter")
D = 0.074644, p-value = 0.6332
alternative hypothesis: two-sided

With \(p>\alpha = 0.05\) the \(H_0\) is accepted, the data seems to be normally distributed.

Figure 4.28: The data within the two groups for comparing the sample means using the t-test for independent samples.

The formal test is then carried out. With \(p<\alpha=0.05\) \(H_0\) is rejected, the data comes from populations with different means.


    Two Sample t-test

data:  group01 %>% pull(diameter) and group03 %>% pull(diameter)
t = -65.167, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.0164554 -0.9567446
sample estimates:
mean of x mean of y 
  12.0155   13.0021 

4.13.3 Welch t-test for independent samples

Similar to the independent samples t-test, the Welch t-test is used for continuous data with two independent groups (WELCH 1947). However, it is employed when there are unequal variances between the groups, relaxing the assumption of equal variances in the standard t-test.

  • Null Hypothesis: The means of the two samples are equal.
  • Prerequisites:
    • Independence
    • Normal Distribution
    • Number of groups \(=2\)

First, the variances are compared in order to check if they are equal using the F-Test (as described in Section 4.13.1.1).


    F test to compare two variances

data:  group01 %>% pull("diameter") and group02 %>% pull("diameter")
F = 0.34904, num df = 99, denom df = 99, p-value = 3.223e-07
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2348504 0.5187589
sample estimates:
ratio of variances 
         0.3490426 

With \(p<\alpha = 0.05\) \(H_0\) is rejected and \(H_a\) is accepted. The variances are different.

Using the KS-test (see Section 4.10.2) the data is checked for normality.


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  group01 %>% pull("diameter")
D = 0.048142, p-value = 0.9746
alternative hypothesis: two-sided

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  group02 %>% pull("diameter")
D = 0.067403, p-value = 0.7539
alternative hypothesis: two-sided

With \(p>\alpha = 0.05\) \(H_0\) is accepted, the data seems to be normally distributed.

Figure 4.29: The data within the two groups for comparing the sample means using the Welch-test for independent samples.

Then, the formal test is carried out.


    Welch Two Sample t-test

data:  group01 %>% pull(diameter) and group02 %>% pull(diameter)
t = -15.887, df = 160.61, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3912592 -0.3047408
sample estimates:
mean of x mean of y 
  12.0155   12.3635 

With \(p<\alpha = 0.05\) we reject \(H_0\), the data seems to be coming from different population means, even though the variances are overlapping (and different).

4.13.4 Mann-Whitney U test

For non-normally distributed data or small sample sizes, the Mann-Whitney U test serves as a non-parametric alternative to the independent samples t-test (Mann and Whitney 1947). It assesses whether there is a significant difference in medians between two independent groups.

  • Null Hypothesis: The medians of the two samples are equal.
  • Prerequisites:
    • Independence
    • no specific distribution (non-parametric)
    • Number of groups \(=2\)
Figure 4.30: The data within the two groups for comparing the sample medians using the Mann-Whitney-U Test.

This time a graphical method to check for normality is employed (QQ-plot, see Section 4.10.1). From the Figure 4.31 it is pretty clear, that the data is not normally distributed. Furthermore, the variances seem to be unequal as well.

Figure 4.31: The data within the two groups for comparing the sample medians using the Mann-Whitney-U Test.

Then, the formal test is carried out. With \(p<\alpha = 0.05\) \(H_0\) is rejected, the true location shift is not equal to \(0\).


    Wilcoxon rank sum test with continuity correction

data:  diameter by group
W = 7396, p-value = 4.642e-09
alternative hypothesis: true location shift is not equal to 0

4.13.5 t-test for paired samples

The paired samples t-test is suitable when you have continuous data from two related groups or repeated measures. It helps determine if there is a significant difference in means between the related groups, assuming normally distributed data.

  • Null Hypothesis: True mean difference is not equal to 0.
  • Prerequisites:
    • Paired Data
    • Normal Distribution
    • equal variances
    • Number of groups \(=2\)

Using the F-Test, the variances are compared.


    F test to compare two variances

data:  diameter by timepoint
F = 1, num df = 9, denom df = 9, p-value = 1
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2483859 4.0259942
sample estimates:
ratio of variances 
                 1 

With \(p>\alpha = 0.05\) \(H_0\) is accepted, the variances are equal.

Using a QQ-plot the data is checked for normality.

Without a formal test, the data is assumed to be normally distributed.

Figure 4.32: A boxplot of the data, showing the connections between the datapoints.

The formal test is then carried out.

# A tibble: 1 × 8
  .y.      group1 group2    n1    n2 statistic    df           p
* <chr>    <chr>  <chr>  <int> <int>     <dbl> <dbl>       <dbl>
1 diameter t0     t1        10    10     -13.4     9 0.000000296

With \(p<\alpha = 0.05\) \(H_0\) is rejected, the treatment changed the properties of the product.

4.13.6 Wilcoxon signed rank test

For non-normally distributed data or situations involving paired samples, the Wilcoxon signed rank test is a non-parametric alternative to the paired samples t-test. It evaluates whether there is a significant difference in medians between the related groups.

  • Null Hypothesis: True mean difference is not equal to 0.
  • Prerequisites:
    • Paired Data
    • Number of groups \(=2\)

# A tibble: 1 × 7
  .y.      group1 group2    n1    n2 statistic       p
* <chr>    <chr>  <chr>  <int> <int>     <dbl>   <dbl>
1 diameter t0     t1        20    20        25 0.00169

4.14 Test 2 Variables (> 2 Groups)

Figure 4.33: Statistical tests for one variable.

4.14.1 Analysis of Variance (ANOVA) - Basic Idea

ANOVA’s ability to compare multiple groups or factors makes it widely applicable across diverse fields for analyzing variance and understanding relationships within data. In the context of engineering sciences the application of ANOVA include:

  1. Experimental Design and Analysis: Engineers often conduct experiments to optimize processes, test materials, or evaluate designs. ANOVA aids in analyzing these experiments by assessing the effects of various factors (like temperature, pressure, or material composition) on the performance of systems or products. It helps identify significant factors and their interactions to improve engineering processes.

  2. Product Testing and Reliability: Engineers use ANOVA to compare the performance of products manufactured under different conditions or using different materials. This analysis helps ensure product reliability by identifying which factors significantly impact product quality, durability, or functionality.

  3. Process Control and Improvement: ANOVA plays a crucial role in quality control and process improvement within engineering. It helps identify variations in manufacturing processes, such as assessing the impact of machine settings or production methods on product quality. By understanding these variations, engineers can make informed decisions to optimize processes and minimize defects.

  4. Supply Chain and Logistics: In engineering logistics and supply chain management, ANOVA aids in analyzing the performance of different suppliers or transportation methods. It helps assess variations in delivery times, costs, or product quality across various suppliers or logistical approaches.

  5. Simulation and Modeling: In computational engineering, ANOVA is used to analyze the outputs of simulations or models. It helps understand the significance of different input variables on the output, enabling engineers to refine models and simulations for more accurate predictions.

Figure 4.34: The basic idea of an ANOVA.

Across such fields ANOVA is often used to:

Comparing Means: ANOVA is employed when comparing means between three or more groups. It assesses whether there are statistically significant differences among the means of these groups. For instance, in an experiment testing the effect of different fertilizers on plant growth, ANOVA can determine if there’s a significant difference in growth rates among the groups treated with various fertilizers.

Modeling Dependencies: ANOVA can be extended to model dependencies among variables in more complex designs. For instance, in factorial ANOVA, it’s used to study the interaction effects among multiple independent variables on a dependent variable. This allows researchers to understand how different factors might interact to influence an outcome.

Measurement System Analysis (MSA): ANOVA is integral in MSA to evaluate the variation contributed by different components of a measurement system. In assessing the reliability and consistency of measurement instruments or processes, ANOVA helps in dissecting the total variance into components attributed to equipment variation, operator variability, and measurement error.

As with statistical tests before, the applicability of the ANOVA depends on various factors.

4.14.1.1 Sum of squared error (SSE)

The sum of squared errors is a statistical measure used to assess the goodness of fit of a model to its data. It is calculated by squaring the differences between the observed values and the values predicted by the model for each data point, then summing up these squared differences. The SSE indicates the total variability or dispersion of the observed data points around the fitted regression line or model. Lower SSE values generally indicate a better fit of the model to the data.

\[\begin{align} SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \label{sse} \end{align}\]

Figure 4.35: A graphical depiction of the SSE.

4.14.1.2 Mean squared error (MSE)

The mean squared error is a measure used to assess the average squared difference between the predicted and actual values in a dataset. It is frequently employed in regression analysis to evaluate the accuracy of a predictive model. The MSE is calculated by taking the average of the squared differences between predicted values and observed values. A lower MSE indicates that the model’s predictions are closer to the actual values, reflecting better accuracy.

\[\begin{align} MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \label{mse} \end{align}\]

4.14.2 One-way ANOVA

The one-way analysis of variance (ANOVA) is used for continuous data with three or more independent groups. It assesses whether there are significant differences in means among these groups, assuming a normal distribution.

  • Null Hypothesis: True mean difference is equal to 0.
  • Prerequisites:
    • equal variances
    • Number of groups \(>2\)
    • One response, one predictor variable
Figure 4.36: The basic idea of a One-way ANOVA.

The most important prerequisite for a One-way ANOVA are equal variances. Because there are more than two groups, the Bartlett test (as introduced in Section 4.13.1.2) is chosen (data is normally distributed).


    Bartlett test of homogeneity of variances

data:  diameter by group
Bartlett's K-squared = 275.61, df = 4, p-value < 2.2e-16

Because \(p<\alpha = 0.05\) the variances are different.

Figure 4.37: The groups with equal variance are highlighted.

    Bartlett test of homogeneity of variances

data:  diameter by group
Bartlett's K-squared = 2.7239, df = 2, p-value = 0.2562

With \(p>\alpha=0.05\) \(H_0\) is accepted, the variances of group01, group02 and group03 are equal.

Of course, many software package provide an automated way of performing a One-way ANOVA, but the first will be explained in detail. The general model for a One-way ANOVA is shown in \(\eqref{onewayanova}\).

\[\begin{align} Y \sim X + \epsilon \label{onewayanova} \end{align}\]

  • \(H_0\): All population means are equal.
  • \(H_a\): Not all population means are equal.

For a One-way ANOVA the predictor variable \(X\) is the mean (\(\bar{x}\)) of all datapoints \(x_i\).

First the SSE and the MSE is calculated for the complete model (\(H_a\) is true), see Table 4.16. The complete model means, that every mean, for every group is calculated and the \(SSE\) according to \(\eqref{sse}\) is calculated.

Figure 4.38: Computation of error for the complete model (mean per group as model)
Figure 4.39: Computation of error for the reduced model (overall mean as model)
Table 4.16: The SSE and MSE for the complete model.
sse df n p mse
3.150 297.000 300.000 3.000 0.011

Then, the SSE and the MSE is calculated for the reduced model (\(H_0\) is true). In the reduced model, the mean is not calculated per group, the overall mean is calculated (results in Table 4.17).

Table 4.17: The SSE and MSE from the reduced model.
sse df n p mse
121.506 299.000 300.000 1.000 0.406

The \(SSE\), \(df\) and \(MSE\) explained by the complete model are calculated:

\[\begin{align} SSE_{explained} &= SSE_{reduced}-SSE_{complete} = 118.36 \\ df_{explained} &= df_{reduced} - df_{complete} = 2 \\ MSE_{explained} &= \frac{SSE_{explained}}{df_{explained}} = 59.18 \end{align}\]

The ratio of the variance (MSE) as explained by the complete model to the reduced model is then calculated. The probability of this statistic is afterwards calculated (if \(H_0\) is true).

[1] 2.762026e-236

The probability of a F-statistic with \(pf = 5579.207\) is \(0\).

A crosscheck with a automated solution (aov-function) yields the results shown in Table 4.18.

Table 4.18: The ANOVA results from the aov function.
term df sumsq meansq statistic p.value
group 2.000 118.356 59.178 5,579.207 0.000
Residuals 297.000 3.150 0.011 NA NA

Some sanity checks are of course required to ensure the validity of the results. First, the variance of the residuals must be equal along the groups (see Figure 4.40).

Figure 4.40: The variances of the residuals.

Also, the residuals from the model must be normally distributed (see Figure 4.41).

Figure 4.41: The distribution of the residuals.

The model seems to be valid (equal variances of residuals, normal distributed residuals).

With \(p<\alpha = 0.05\) \(H_0\) can be rejected, the means come from different populations.

4.14.3 Welch ANOVA

Welch ANOVA: Similar to one-way ANOVA, the Welch ANOVA is employed when there are unequal variances between the groups being compared. It relaxes the assumption of equal variances, making it suitable for situations where variance heterogeneity exists.

  • Null Hypothesis: True mean difference is not equal to 0.
  • Prerequisites:
    • Number of groups \(>2\)
    • One response, one predictor variable

The Welch ANOVA drops the prerequisite of equal variances in groups. Because there are more than two groups, the Bartlett test (as introduced in Section 4.13.1.2) is chosen (data is normally distributed).


    Bartlett test of homogeneity of variances

data:  diameter by group
Bartlett's K-squared = 275.61, df = 4, p-value < 2.2e-16

With \(p<\alpha = 0.05\) \(H_0\) can be rejected, the variances are not equal.

The ANOVA table for the Welch ANOVA is shown in Table 4.19.

Table 4.19: The ANOVA results from the ANOVA Welch Test (not assuming equal variances).
num.df den.df statistic p.value method
4.000 215.085 3,158.109 0.000

One-way analysis of
means (not assuming equal
variances)

4.14.4 Kruskal Wallis

Kruskal-Wallis Test: When dealing with non-normally distributed data, the Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA. It is used to evaluate whether there are significant differences in medians among three or more independent groups.

In this example the drive strength is measured using three-point bending. Three different methods are employed to increase the strength of the drive shaft.

Figure 4.42: The mechanical Background for a three-point bending test
  • Method A: baseline material
  • Method B: different geometry
  • Method C: different material

In Figure 4.43 the raw drive shaft strength data for Method A, B and C is shown. At first glance, the data does not appear to be normally distributed.

Figure 4.43: The raw data from the drive shaft strength testing.

In Figure 4.44 the visual test for normal distribution is performed. The data does not appear to be normally distributed.

Figure 4.44: The qq-plot for the drive shaft strength testing data.

The Kruskal-Wallis test is then carried out. With \(p< \alpha = 0.05\) it is shown, that the groups come from populations with different means. The next step is to find which of the groups are different using a post-hoc analysis.


    Kruskal-Wallis rank sum test

data:  strength by group
Kruskal-Wallis chi-squared = 107.65, df = 2, p-value < 2.2e-16

The Kruskal-Wallis Test (as the ANOVA) can only tell you, if there is a signifcant difference between the groups, not what groups are different. Post-hoc tests are able to determine such, but must be used with a correction for multiple testing (see (Tamhane 1977))


    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  kw_shaft_data$strength and kw_shaft_data$group 

         Method_A Method_B
Method_B < 2e-16  -       
Method_C 6.8e-14  2.0e-10 

P value adjustment method: bonferroni 

Because \(p<\alpha = 0.05\) it can be concluded, that all means are different from each other.

4.14.5 repeated measures ANOVA

Repeated Measures ANOVA: The repeated measures ANOVA is applicable when you have continuous data with multiple measurements within the same subjects or units over time. It is used to assess whether there are significant differences in means over the repeated measurements, under the assumptions of sphericity and normal distribution.

In this example, the diameter of \(n = 20\) drive shafts is measured after three different steps.

  • Before Machining
  • After Machining
  • After Inspection
Figure 4.45: The raw data for the repeated measures ANOVA.

First, outliers are identified. There is no strict rule to identify outliers, in this case a classical measure is applied according to \(\eqref{outlierrule}\)

\[\begin{align} \text{outlier} &= \begin{cases} x_i & >Q3 + 1.5 \cdot IQR \\ x_i & <Q1 - 1.5 \cdot IQR \end{cases} \label{outlierrule} \end{align}\]

# A tibble: 1 × 5
  timepoint        Subject_ID diameter is.outlier is.extreme
  <chr>            <fct>         <dbl> <lgl>      <lgl>     
1 After_Inspection 15             12.9 TRUE       FALSE     

A check for normality is done employing the Shapiro-Wilk test (Shapiro and Wilk 1965).

timepoint variable statistic p
After_Inspection diameter 0.968 0.727
After_Machining diameter 0.954 0.456
Before_Machining diameter 0.968 0.741

The next step is to check the dataset for sphericity, meaning to compare the variance of the groups among each other in order to determine the equality thereof. For this the Mauchly Test for sphericity is employed (Mauchly 1940).

     Effect     W     p p<.05
1 timepoint 0.927 0.524      

With \(p>\alpha = 0.05\) \(H_0\) is accepted, the variances are equal. Otherwise sphericity corrections must be applied (Greenhouse and Geisser 1959).

The next step is to perform the repeated measures ANOVA, which yields the following results.

Effect DFn DFd F p p<.05 ges
timepoint 2.000 36.000 18.081 0.000 * 0.444

With \(p<\alpha = 0.05\) \(H_0\) is rejected, the different timepoints yield different diameters. Which groups are different is then determined using a post-hoc test, including a correction for the significance level (Bonferroni 1936).

In this case, the assumptions for a t-test are met, the pairwise t-test can be used.

group1 group2 n1 n2 statistic df p p.adj signif
After_Inspection After_Machining 19 19 0.342 18 0.736 1.000 ns
After_Inspection Before_Machining 19 19 −4.803 18 0.000 0.000 ***
After_Machining Before_Machining 19 19 −6.283 18 0.000 0.000 ****

with \(p<\alpha = 0.05\) \(H_0\) is rejected for the comparison Before_Machining - After_Machining and After_Inspection - Before_Machining. It can therefore be concluded that the machining has a significant influence on the diameter, whereas the inspection has none.

4.14.6 Friedman test

The Friedman test is a non-parametric alternative to repeated measures ANOVA (Friedman 1937). It is utilized when dealing with non-normally distributed data and multiple measurements within the same subjects. This test helps determine if there are significant differences in medians over the repeated measurements.

The same data as for the repeated measures ANOVA will be used.

.y. n statistic df p method
diameter 20.000 16.900 2.000 0.000 Friedman test

With \(p<\alpha = 0.05\) \(H_0\) is rejected, the timepoints play a vital role for the drive shaft parameter.


  1. A quantile is a statistical concept used to divide a dataset into equal-sized subsets or intervals.↩︎