
4 Inferential Statistics
Inferential statistics involves making predictions, generalizations, or inferences about a population based on a sample of data. These techniques are used when researchers want to draw conclusions beyond the specific data they have collected. Inferential statistics help answer questions about relationships, differences, and associations within a population.
4.1 Hypothesis Testing - Basics

Null Hypothesis (H0): This is the default or status quo assumption. It represents the belief that there is no significant change, effect, or difference in the production process. It is often denoted as a statement of equality (e.g., the mean production rate is equal to a certain value).
Alternative Hypothesis (Ha): This is the claim or statement we want to test. It represents the opposite of the null hypothesis, suggesting that there is a significant change, effect, or difference in the production process (e.g., the mean production rate is not equal to a certain value).
4.1.1 The drive shaft exercise - Hypotheses
During the QC of the drive shaft \(n=100\) samples are taken and the diameter is measured with an accuracy of \(\pm 0.01mm\). Is the true mean of all produced drive shafts within the specification?
For this we can formulate the hypotheses.
- H0:
- The drive shaft diameter is within the specification.
- Ha:
- The drive shaft diameter is not within the specification.
In the following we will explore, how to test for these hypotheses.
4.2 Confidence Intervals
A Confidence Interval is a statistical concept used to estimate a range of values within which a population parameter, such as a population mean or proportion, is likely to fall. It provides a way to express the uncertainty or variability in our sample data when making inferences about the population. In other words, it quantifies the level of confidence we have in our estimate of a population parameter.
Confidence intervals are typically expressed as a range with an associated confidence level. The confidence level, often denoted as \(1-\alpha\), represents the probability that the calculated interval contains the true population parameter. Common confidence levels include \(90\%\), \(95\%\), and \(99\%\).
There are different ways of calculating CI.
- For the population mean \(\mu_0\) when the population standard deviation \(\sigma_0^2\) is known (\(\eqref{ci01}\)).
\[\begin{align} CI = \bar{X} \pm t \frac{\sigma_0}{\sqrt{n}} \label{ci01} \end{align}\]
\(\bar{X}\) is the sample mean.
\(Z\) is the critical value from the standard normal distribution corresponding to the desired confidence level (e.g., \(1.96\) for a \(95\%\) confidence interval).
\(\sigma_0\) is the populations standard deviation
\(n\) is the sample size
2.For the population mean \(\mu_0\) when the population standard deviation \(\sigma_0\) is Unknown (t-confidence interval), see \(\eqref{ci02}\).
\[\begin{align} CI = \bar{X} \pm t \frac{sd}{\sqrt{n}} \label{ci02} \end{align}\]
\(\bar{X}\) is the sample mean.
\(t\) is the critical value from the t-distribution with \(n-1\) degrees of freedom corresponding to the desired confidence level
\(sd\) is the sample standard deviation
\(n\) is the sample size
- For a population proportion p, see \(\eqref{ci03}\).
\[\begin{align} CI = \hat{p} \pm Z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \label{ci03} \end{align}\]
\(\hat{p}\) is the sample proportion
\(Z\) is the critical value from the standard normal distribution corresponding to the desired confidence level
\(n\) is the sample size
- The method for calculating confidence intervals may vary depending on the estimated parameter. Estimating a population median or the differences between two population means, other statistical techniques may be used.
4.2.1 The drive shaft exercise - Confidence Intervals

The \(95\%\) CI for the drive shaft data is shown in Figure 4.2. For comparison the histogram with an overlayed density curve is plotted. The highlighted area shows the minimum and maximum CI, the calculated mean is shown as a dashed line.
4.3 Significance Level
The significance level \(\alpha\) is a critical component of hypothesis testing in statistics. It represents the maximum acceptable probability of making a Type I error, which is the error of rejecting a null hypothesis when it is actually true. In other words, \(\alpha\) is the probability of concluding that there is an effect or relationship when there isn’t one. Commonly used significance levels include \(0.05 (5\%)\), \(0.01 (1\%)\), and \(0.10 (10\%)\). The choice of \(\alpha\) depends on the context of the study and the desired balance between making correct decisions and minimizing the risk of Type I errors.
4.4 False negative - risk
The risk for a false negative outcome is called \(\beta\) - risk. Is is calculated using statistical power analysis. Statistical power is the probability of correctly rejecting a null hypothesis when it is false, which is essentially the complement of beta (\(\beta\)).
\[\begin{align} \beta = 1 - \text{Power} \end{align}\]
4.5 Power Analysis
Statistical power is calculated using software, statistical tables, or calculators specifically designed for this purpose. Generally speaking: The greater the statistical power, the greater is the evidence to accept or reject the \(H_0\) based on the study. Power analysis is also very useful in determining the sample size before the actualy experiments are conducted. Below is an example for a power calculation for a two-sample t-test.
\[ \text{Power} = 1 - \beta = P\left(\frac{{|\bar{X}_1 - \bar{X}_2|}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} > Z_{\frac{\alpha}{2}} - \frac{\delta}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}\right) \]
Effect Size: This represents the magnitude of the effect you want to detect. Larger effects are easier to detect than smaller ones.
Significance Level (\(\alpha\)): This is the predetermined level of significance that defines how confident you want to be in rejecting the null hypothesis (e.g., typically set at 0.05).
Sample Size (\(n\)): The number of observations or participants in your study. Increasing the sample size generally increases the power of the test.
Power (\(1 - \beta\)): This is the probability of correctly rejecting the null hypothesis when it is false. Higher power is desirable, as it minimizes the chances of a Type II error (failing to detect a true effect).
Type I Error (\(\alpha\)): The probability of incorrectly rejecting the null hypothesis when it is true. This is typically set at \(0.05\) or \(5\%\) in most studies.
Type II Error (\(\beta\)): The probability of failing to reject the null hypothesis when it is false. Power is the complement of \(\beta\) (\(Power = 1 - \beta\)).

- H0:
- The coin is fair and lands heads \(50\%\) of the time.
- Ha:
- The coin is loaded and lands heads more than \(50\%\) of the time.
pwr.p.test(h = ES.h(p1 = 0.75, p2 = 0.50),
sig.level = 0.05,
power = 0.80,
alternative = "greater")
proportion power calculation for binomial distribution (arcsine transformation)
h = 0.5235988
n = 22.55126
sig.level = 0.05
power = 0.8
alternative = greater
The sample size \(n = 23\), meaning \(23\) coin flips means that the statistical power is \(80\%\) at a \(\alpha = 0.05\) significance level (\(\beta = 1-power = 0.2 \approx 20\%\)). But what if the sample size varies? This is the subject of Figure 4.4. On the x-axis
the power is shown (or the \(\beta\)-risk on the upper x-axis
), whereas the sample size n
is depicted on the y-axis
. To increase the power by \(10\%\) to be \(90\%\) the sample sized must be increased by \(11\). A further power increase of \(5\%\) would in turn mean an increase in sample size to be \(n = 40\). This highlights the non-linear nature of power calculations and why they are important for experimental planning.


4.5.1 A word on Effect Size
Cohen (Cohen 2013) describes effect size as “the degree to which the null hypothesis is false.” In the coin flipping example, this is the difference between \(75\%\) and \(50\%\). We could say the effect was 25% but recall we had to transform the absolute difference in proportions to another quantity using the ES.h function. This is a crucial part of doing power analysis correctly: An effect size must be provided on the expected scale. Doing otherwise will produce wrong sample size and power calculations.
When in doubt, Conventional Effect Sizes can be used. These are pre-determined effect sizes for “small”, “medium”, and “large” effects, see Cohen (2013).
4.6 p-value

The p-value is a statistical measure that quantifies the evidence against a null hypothesis. It represents the probability of obtaining test results as extreme or more extreme than the ones observed, assuming the null hypothesis is true. In hypothesis testing, a smaller p-value indicates stronger evidence against the null hypothesis. If the p-value is less than or equal to \(\alpha\) (\(p \leq \alpha\)), you reject the null hypothesis. If the p-value is greater than \(\alpha\) ( \(p > \alpha\) ), you fail to reject the null hypothesis. A common threshold for determining statistical significance is to reject the null hypothesis when \(p\leq\alpha\).
The p-value however does not give an assumption about the effect size, which can be quite insignificant (Nuzzo 2014). While the p-value therefore is the probability of accepting \(H_a\) as true, it is not a measure of magnitude or relative importance of an effect. Therefore the CI and the effect size should always be reported with a p-value. Some Researchers even claim that most of the research today is false (Ioannidis 2005). In practice, especially in the manufacturing industry, the p-value and its use is still popular. Before implementing any measures in a series production, those questions will be asked. The confident and reliable engineer asks them beforehand and is always his own greatest critique.
4.7 Statistical errors

- Type I Error (False Positive, see Figure 4.7):
A Type I error occurs when a null hypothesis that is actually true is rejected. In other words, it’s a false alarm. It is concluded that there is a significant effect or difference when there is none. The probability of committing a Type I error is denoted by the significance level \(\alpha\). Example: Imagine a drug trial where the null hypothesis is that the drug has no effect (it’s ineffective), but due to random chance, the data appears to show a significant effect, and you incorrectly conclude that the drug is effective (Type I error).
- Type II Error (False Negative, see Figure 4.7):
A Type II error occurs when a null hypothesis that is actually false is not rejected. It means failing to detect a significant effect or difference when one actually exists. The probability of committing a Type II error is denoted by the symbol \(\beta\). Example: In a criminal trial, the null hypothesis might be that the defendant is innocent, but they are actually guilty. If the jury fails to find enough evidence to convict the guilty person, it is a Type II error.
Type I Error is falsely concluding, that there is an effect or difference when there is none (false positive). Type II Error failing to conclude that there is an effect or difference when there actually is one (false negative).
The relationship between Type I and Type II errors is often described as a trade-off. As the risk of Type I errors is reduced by lowering the significance level (\(\alpha\)), the risk of Type II errors (\(\beta\)) is typically increased (Figure 4.6). This trade-off is inherent in hypothesis testing, and the choice of significance level depends on the specific goals and context of the study. Researchers often aim to strike a balance between these two types of errors based on the consequences and costs associated with each. This balance is a critical aspect of the design and interpretation of statistical tests.
4.8 Parametric and Non-parametric Tests
Parametric and non-parametric tests in statistics are methods used for analyzing data. The primary difference between them lies in the assumptions they make about the underlying data distribution:
- Parametric Tests:
- These tests assume that the data follows a specific probability distribution, often the normal distribution.
- Parametric tests make assumptions about population parameters like means and variances.
- They are more powerful when the data truly follows the assumed distribution.
- Examples of parametric tests include t-tests, ANOVA, regression analysis, and parametric correlation tests.
- Non-Parametric Tests:
- Non-parametric tests make minimal or no assumptions about the shape of the population distribution.
- They are more robust and can be used when data deviates from a normal distribution or when dealing with ordinal or nominal data.
- Non-parametric tests are generally less powerful compared to parametric tests but can be more reliable in certain situations.
- Examples of non-parametric tests include the Mann-Whitney U test, Wilcoxon signed-rank test, Kruskal-Wallis test, and Spearman’s rank correlation.
The choice between parametric and non-parametric tests depends on the nature of the data and the assumptions. Parametric tests are appropriate when data follows the assumed distribution, while non-parametric tests are suitable when dealing with non-normally distributed data or ordinal data. Some examples for parametric and non-parametric tests are given in Table 4.1.
Parametric Tests |
Non-Parametric Tests |
---|---|
One-sample t-test | Wilcoxon signed rank test |
Paired t-test | Mann-Whitney U test |
Two-sample t-test | Kruskal Wallis test |
One-Way ANOVA | Welch Test |
4.9 Paired and Independent Tests

- Paired Statistical Test:
- Paired tests are used when there is a natural pairing or connection between two sets of data points. This pairing is often due to repeated measurements on the same subjects or entities.
- They are designed to assess the difference between two related samples, such as before and after measurements on the same group of individuals.
- The key idea is to reduce variability by considering the differences within each pair, which can increase the test sensitivity.
- Independent Statistical Test:
- Independent tests, also known as unpaired or two-sample tests, are used when there is no inherent pairing between the two sets of data.
- These tests are typically applied to compare two separate and unrelated groups or samples.
- They assume that the data in each group is independent of the other, meaning that the value in one group doesn’t affect the value in the other group.
An example for a paired test is, if two groups of data are to be compared in two different points in time (see Figure 4.8).
4.10 Distribution Tests
The importance of testing for normality (or other distributions) lies in the fact that various statistical techniques, such as parametric tests (e.g., t-tests, ANOVA), are based on the assumption of for example normality. When data deviates significantly from a normal distribution, using these parametric methods can lead to incorrect conclusions and biased results. Therefore, it is essential to determine how a dataset is approximately distributed before applying such techniques.
Several tests for normality are available, with the most common ones being the Kolmogorov-Smirnov test, the Shapiro-Wilk test, and the Anderson-Darling test. These tests provide a quantitative measure of how well the data conforms to a normal distribution.
In practice, it is important to interpret the results of these tests cautiously. Sometimes, a minor departure from normality may not affect the validity of parametric tests, especially when the sample size is large. In such cases, using non-parametric methods may be an alternative. However, in cases where normality assumptions are crucial, transformations of the data or choosing appropriate non-parametric tests may be necessary to ensure the reliability of statistical analyses.
Tests for normality do not free you from the burden of thinking for yourself.
4.10.1 Quantile-Quantile plots
Quantile-Quantile plots are a graphical tool used in statistics to assess whether a dataset follows a particular theoretical distribution, typically the normal distribution. They provide a visual comparison between the observed quantiles1 of the data and the quantiles expected from the chosen theoretical distribution.
A neutral explanation of how QQ plots work:
4.10.1.1 Sample data
In Table 4.2 \(n=10\) datapoints are shown as a sample dataset.
x | smpl_no |
---|---|
-0.56047565 | 1 |
-0.23017749 | 2 |
1.55870831 | 3 |
0.07050839 | 4 |
0.12928774 | 5 |
1.71506499 | 6 |
0.46091621 | 7 |
-1.26506123 | 8 |
-0.68685285 | 9 |
-0.44566197 | 10 |
4.10.1.2 Data Sorting
To create a QQ plot, the data must be sorted in ascending order.
x | smpl_no |
---|---|
-1.26506123 | 8 |
-0.68685285 | 9 |
-0.56047565 | 1 |
-0.44566197 | 10 |
-0.23017749 | 2 |
0.07050839 | 4 |
0.12928774 | 5 |
0.46091621 | 7 |
1.55870831 | 3 |
1.71506499 | 6 |
4.10.1.3 Theoretical Quantiles
Theoretical quantiles are calculated based on the chosen distribution (e.g., the normal distribution). These quantiles represent the expected values if the data perfectly follows that distribution.
x | smpl_no | x_norm | x_thrtcl |
---|---|---|---|
-1.26506123 | 8 | -1.404601888 | 0.08006985 |
-0.68685285 | 9 | -0.798376211 | 0.21232610 |
-0.56047565 | 1 | -0.665875352 | 0.25274539 |
-0.44566197 | 10 | -0.545498338 | 0.29270541 |
-0.23017749 | 2 | -0.319572479 | 0.37464622 |
0.07050839 | 4 | -0.004316756 | 0.49827787 |
0.12928774 | 5 | 0.057310762 | 0.52285118 |
0.46091621 | 7 | 0.405008410 | 0.65726434 |
1.55870831 | 3 | 1.555994430 | 0.94014529 |
1.71506499 | 6 | 1.719927421 | 0.95727718 |
4.10.1.4 Plotting Points

For each data point, a point is plotted in the QQ plot. The x-coordinate of the point corresponds to the theoretical quantile, and the y-coordinate corresponds to the observed quantile from the data, see Figure 4.9.
4.10.1.5 Perfect Normal Distribution

In the case of a perfect normal distribution, all the points would fall along a straight line at a 45-degree angle. If the data deviates from normality, the points may deviate from this line in specific ways, see Figure 4.10.
4.10.1.6 Interpretation

Deviations from the straight line suggest departures from the assumed distribution. For example, if points curve upward, it indicates that the data has heavier tails than a normal distribution. If points curve downward, it suggests lighter tails. S-shaped curves or other patterns can reveal additional information about the data’s distribution. In Figure 4.11 the QQ-points are shown together with the respective QQ-line and a line of perfectly normal distributed points. Some deviations can be seen, but it is hard to judge, if the data is normally distributed or not.
4.10.1.7 Confidence Interval

Because it is hard to judge from Figure 4.11 if the points are normally distributed, it makes sense to get limits for normally disitrbuted points. This is shown in Figure 4.12. The gray area depicts the (\(95\%\)) confidence bands for a normal distribution. All the points fall into the area, as well as the line. This shows, that the points are likely to be normally distributed.
4.10.1.8 The drive shaft exercise

The QQ plot method is extended to the drive shaft exercise in Figure 4.13. In each subplot the plot for the respective group is shown together with the QQ-points, the QQ-line and the respective confidence bands. The scaling for each plot is different to enhance visibility of every subplot. A line for the perfect normal distribution is also shown in solid linestyle. From group \(1 \ldots 4\) all points fall into the QQ confidence bands. Group05 differs however. The points from visible categories, which is a strong indicator, that the measurement system may be to inaccurate.
4.10.2 Quantitative Methods

The Kolmogorov-Smirnov test for normality, often referred to as the KS test, is a statistical test used to assess whether a dataset follows a normal distribution. It evaluates how closely the cumulative distribution function of the dataset matches the expected CDF of a normal distribution.
Null Hypothesis (H0): The null hypothesis in the KS test states that the sample data follows a normal distribution.
Alternative Hypothesis (Ha): The alternative hypothesis suggests that the sample data significantly deviates from a normal distribution.
Test Statistic (D): The KS test calculates a test statistic, denoted as D which measures the maximum vertical difference between the empirical CDF of the data and the theoretical CDF of a normal distribution. It quantifies how far the observed data diverges from the expected normal distribution. A visualization of the KS-test is shown in Figure 4.14. The red line denotes a perfect normal distribution, whereas the step function shows the empirical CDF of the data itself.
Critical Value: To assess the significance of D, a critical value is determined based on the sample size and the chosen significance level (\(\alpha\)). If D exceeds the critical value, it indicates that the dataset deviates significantly from a normal distribution.
Decision: If D is greater than the critical value, the null hypothesis is rejected, and it is concluded that the data is not normally distributed. If D is less than or equal to the critical value, there is not enough evidence to reject the null hypothesis, suggesting that the data may follow a normal distribution.
It is important to note that the KS test is sensitive to departures from normality in both tails of the distribution. There are other normality tests, like the Shapiro-Wilk test and Anderson-Darling test, which may be more suitable in certain situations. Researchers typically choose the most appropriate test based on the characteristics of their data and the assumptions they want to test.
4.10.3 Expanding to non-normal disitributions


The QQ-plot can easily be extended to non-normal disitributions as well. This is shown in Figure 4.15. In Figure 4.15 (a) a classic QQ-plot for Figure 2.25 is shown. The same rules as before still apply, they are only extended to the weibull distribution. In Figure 4.15 (b) a detrended QQ-plot is shown in order to account for visual bias. It is of course known, that the data follows a weibull disitribution with a shape parameter \(\beta=2\) and a scale parameter \(\lambda = 500\), but such distributional parameters can also be estimated (Delignette-Muller and Dutang 2015).
4.11 Test 1 Variable

4.11.1 One Proportion Test
Category | Count | Total | plt_lbl |
---|---|---|---|
A | 35 | 100 | 35 counts 100 trials |
B | 20 | 100 | 20 counts 100 trials |
The one proportion test is used on categorical data with a binary outcome, such as success or failure. Its prerequisite is having a known or hypothesized population proportion that the sample proportion shall be compared to. This test helps determine if the sample proportion significantly differs from the population proportion, making it valuable for studies involving proportions and percentages.
estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | alternative |
---|---|---|---|---|---|---|---|
0.350 | 0.200 | 4.915 | 0.027 | 1.000 | 0.018 | 0.282 | two.sided |
4.11.2 Chi2 goodness of fit test
group | count_n_observed |
---|---|
group01 | 100.000 |
group02 | 100.000 |
group03 | 100.000 |
group04 | 100.000 |
group05 | 100.000 |
statistic | p.value | parameter |
---|---|---|
0.000 | 1.000 | 4.000 |
The \(\chi^2\) goodness of Fit Test (gof) is applied on categorical data with expected frequencies. It is suitable for analyzing nominal or ordinal data. This test assesses whether there is a significant difference between the observed and expected frequencies in your dataset, making it useful for determining if the data fits an expected distribution.
4.11.3 One-sample t-test
The one-sample t-test is designed for continuous data when you have a known or hypothesized population mean that you want to compare your sample mean to. It relies on the assumption of normal distribution, making it applicable when assessing whether a sample’s mean differs significantly from a specified population mean.
The test can be applied in various settings. One is, to test if measured data comes from a population with a certain mean (for exampe a test against a specification). To show the application, the drive shaft data is employed. In Table 4.9 the per group summarised data of the dirve shaft data is shown.
group | mean_diameter | sd_diameter |
---|---|---|
group01 | 12.015 | 0.111 |
group02 | 12.364 | 0.189 |
group03 | 13.002 | 0.102 |
group04 | 11.486 | 0.094 |
group05 | 12.001 | 0.026 |
One important prerequisite for the One sample t-test normally distributed data. For this, graphical and numerical methods have been introduced in previous chapters. First, a classic QQ-plot is created for every group (see Figure 4.17). From a first glance, the data appears to be normally distributed.

A more quantitative approach to tests for normality is shown in Table 4.10. Here, each group is tested with the KS-test for normality. H0 is accepted (the data is normal distributed) because the computed p-value is larger than the significance level (\(\alpha = 0.05\)).
group | statistic | p.value | method | alternative |
---|---|---|---|---|
There is sufficient evidence to assume normal distributed data within each group. The next step is, to test if the data comes from a certain population mean (\(\mu_0\)). In this case, the population mean is the specification of the drive shaft at a diameter \(=12mm\).
group | estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|
4.11.4 One sample Wilcoxon test
For situations where your data may not follow a normal distribution or when dealing with ordinal data, the one-sample Wilcoxon test is a non-parametric alternative to the t-test. It is used to evaluate whether a sample’s median significantly differs from a specified population median.
The wear and tear of drive shafts can occur due to various factors related to the vehicle’s operation and maintenance. Some common causes include:
Normal Usage: Over time, the drive shaft undergoes stress and strain during regular driving. This can lead to gradual wear on components, especially if the vehicle is frequently used.
Misalignment: Improper alignment of the drive shaft can result in uneven distribution of forces, causing accelerated wear. This misalignment may stem from issues with the suspension system or other related components.
Lack of Lubrication: Inadequate lubrication of the drive shaft joints and bearings can lead to increased friction, accelerating wear. Regular maintenance, including proper lubrication, is essential to mitigate this factor.
Contamination: Exposure to dirt, debris, and water can contribute to the degradation of drive shaft components. Contaminants can infiltrate joints and bearings, causing abrasive damage over time.
Vibration and Imbalance: Excessive vibration or imbalance in the drive shaft can lead to increased stress on its components. This may result from issues with the balance of the rotating parts or damage to the shaft itself.
Extreme Operating Conditions: Harsh driving conditions, such as off-road terrain or constant heavy loads, can accelerate wear on the drive shaft. The components may be subjected to higher levels of stress than they were designed for, leading to premature wear and tear.
The wear and tear because o the reasons above can be rated on a scale with discrete values from \(1 \ldots 5\) with \(2\) being the reference value. It is therefore interesting, if the wear and tear rating of \(n=100\) drive shafts per group differs significantly from the reference value \(2\). Because we are dealing with discrete data, the one sample t-test can not be used.

group | statistic | p.value | alternative |
---|---|---|---|
group | t_tidy_p.value | wilcox_tidy_p.value |
---|---|---|
4.12 Test 2 Variable (Qualitative or Quantitative)

4.12.1 Cochrane’s Q-test
Cochran’s Q test is employed when you have categorical data with three or more related groups, often collected over time or with repeated measurements. It assesses if there is a significant difference in proportions between the related groups.
4.12.2 Chi2 test of independence
This test is appropriate when you have two categorical variables, and you want to determine if there is an association between them. It is useful for assessing whether the two variables are dependent or independent of each other.
In the context of the drive shaft production the example assumes a dataset with categorical variables like “Defects” (Yes/No) and “Operator” (Operator A/B).
4.12.2.1 Contingency tables
A contingency table, also known as a cross-tabulation or crosstab, is a statistical table that displays the frequency distribution of variables. It organizes data into rows and columns to show the frequency or relationship between two or more categorical variables. Each cell in the table represents the count or frequency of occurrences that fall into a specific combination of categories for the variables being analyzed. It is commonly used in statistics to examine the association between categorical variables and to understand patterns within data sets.
Defects | Operator A | Operator B |
---|---|---|
4.12.2.2 test results
With \(p\approx1>0.05\) the \(p\)-value is greater than the significance level of \(\alpha = 0.05\). The \(H_0\) is therefore proven, there is no difference between the operators. The test results are depicted below-
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 0, df = 1, p-value = 1
4.12.3 Correlation

Correlation refers to a statistical measure that describes the relationship between two variables. It indicates the extent to which changes in one variable are associated with changes in another.
Correlation is measured on a scale from -1 to 1:
A correlation of 1 implies a perfect positive relationship, where an increase in one variable corresponds to a proportional increase in the other.
A correlation of -1 implies a perfect negative relationship, where an increase in one variable corresponds to a proportional decrease in the other.
A correlation close to 0 suggests a weak or no relationship between the variables.
Correlation doesn’t imply causation; it only indicates that two variables change together but doesn’t determine if one causes the change in the other.
4.12.3.1 Pearson Corrrelation
The pearson correlation coefficient is a normalized version of the covariance.
\[\begin{align} R = \frac{\mathrm{Cov}(X,Y)}{\sigma_x \sigma_y} \end{align}\]
- Covariance is sensitive to scale (\(mm\) vs. \(cm\))
- Pearson correlation removes units, allowing for meaningful comparisons across datasets


Pearson's product-moment correlation
data: drive_shaft_rpm_dia$rpm and drive_shaft_rpm_dia$diameter
t = 67.895, df = 498, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9406732 0.9578924
sample estimates:
cor
0.95
When you have two continuous variables and want to measure the strength and direction of their linear relationship, Pearson correlation is the go-to choice (Pearson 1895). It assumes normally distributed data and is particularly valuable for exploring linear associations between variables and is calculated via \(\eqref{pearcorr}\).
\[\begin{align} R = \frac{\sum_{i = 1}^{n}(x_i - \bar{x}) \times (y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\times \sum_{i=1}^{n}(y_i-\bar{y})^2} \label{pearcorr} \end{align}\]
The Pearson Correlation Coeffcient works best with normal disitributed data. The normal distribution of the data is verified in Figure 4.21.
4.12.3.2 Spearman Correlation
Spearman (Spearman 1904) correlation is a non-parametric alternative to Pearson correlation. It is used when the data is not normally distributed or when the relationship between variables is monotonic but not necessarily linear.
\[\begin{align} \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \label{spearcorr} \end{align}\]
In Figure 4.23 the example data for a drive shaft production is shown. The Production_Time
and the Defects
seem to form a relationship, but the data does not appear to be normally distributed. This can also be seen in the QQ-plots of both variables in Figure 4.24.
The spearman correlation coefficient (\(\rho\)) is based on the pearson correlation, but applied to ranked data


4.12.3.3 Correlation - methodogical limits
While correlation analysis and summary statistics are certainly useful, one must always consider the raw data. The data taken from Davies, Locke, and D’Agostino McGowan (2022) showcases this. The summary statistics in Table 4.15 are practically the same, one would not suspect different underlying data. When the raw data is plotted though (Figure 4.25), it can be seen that the data appears to be highly non linear, forming different shapes as well as different categories etc.
Always check the raw data.
dataset | mean_x | mean_y | std_dev_x | std_dev_y | corr_x_y |
---|---|---|---|---|---|

4.13 Test 2 Variables (2 Groups)

4.13.1 Test for equal variance (homoscedasticity)

Tests for equal variances, also known as tests for homoscedasticity, are used to determine if the variances of two or more groups or samples are equal. Equal variances are an assumption in various statistical tests, such as the t-test and analysis of variance (ANOVA). When the variances are not equal, it can affect the validity of these tests. Two common tests for equal variances are:
Certainly, here are bullet points outlining the null hypothesis, prerequisites, and decisions for each of the three tests:
4.13.1.1 F-Test (Hahs-Vaughn and Lomax 2013)
- Null Hypothesis: The variances of the different groups or samples are equal.
- Prerequisites:
- Independence
- Normality
- Number of groups \(= 2\)
F test to compare two variances
data: ds_wide$group01 and ds_wide$group03
F = 1.1817, num df = 99, denom df = 99, p-value = 0.4076
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7951211 1.7563357
sample estimates:
ratio of variances
1.181736
4.13.1.2 Bartlett Test (Bartlett 1937)
- Null Hypothesis: The variances of the different groups or samples are equal.
- Prerequisites:
- Independence
- Normality
- Number of groups \(> 2\)
Bartlett test of homogeneity of variances
data: diameter by group
Bartlett's K-squared = 275.61, df = 4, p-value < 2.2e-16
4.13.1.3 Levene Test (Olkin June)
- Null Hypothesis: The variances of the different groups or samples are equal.
- Prerequisites:
- Independence
- Number of groups \(> 2\)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 38.893 < 2.2e-16 ***
495
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.13.2 t-test for independent samples
The independent samples t-test is applied when you have continuous data from two independent groups. It evaluates whether there is a significant difference in means between these groups, assuming a normal distribution of the data.
- Null Hypothesis: The means of the two samples are equal.
- Prerequisites:
- Independence
- Normal Distribution
- Number of groups \(=2\)
- equal Variances of the groups
First, the variances are compared in order to check if they are equal using the F-Test (as described in Section 4.13.1.1).
F test to compare two variances
data: group01 %>% pull("diameter") and group03 %>% pull("diameter")
F = 1.1817, num df = 99, denom df = 99, p-value = 0.4076
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7951211 1.7563357
sample estimates:
ratio of variances
1.181736
With \(p>\alpha = 0.05\) the \(H_0\) is accepted, the variances are equal.
The next step is to check the data for normality using the KS-test (as described in Section 4.10.2).
Asymptotic one-sample Kolmogorov-Smirnov test
data: group01 %>% pull("diameter")
D = 0.048142, p-value = 0.9746
alternative hypothesis: two-sided
Asymptotic one-sample Kolmogorov-Smirnov test
data: group03 %>% pull("diameter")
D = 0.074644, p-value = 0.6332
alternative hypothesis: two-sided
With \(p>\alpha = 0.05\) the \(H_0\) is accepted, the data seems to be normally distributed.

The formal test is then carried out. With \(p<\alpha=0.05\) \(H_0\) is rejected, the data comes from populations with different means.
Two Sample t-test
data: group01 %>% pull(diameter) and group03 %>% pull(diameter)
t = -65.167, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.0164554 -0.9567446
sample estimates:
mean of x mean of y
12.0155 13.0021
4.13.3 Welch t-test for independent samples
Similar to the independent samples t-test, the Welch t-test is used for continuous data with two independent groups (WELCH 1947). However, it is employed when there are unequal variances between the groups, relaxing the assumption of equal variances in the standard t-test.
- Null Hypothesis: The means of the two samples are equal.
- Prerequisites:
- Independence
- Normal Distribution
- Number of groups \(=2\)
First, the variances are compared in order to check if they are equal using the F-Test (as described in Section 4.13.1.1).
F test to compare two variances
data: group01 %>% pull("diameter") and group02 %>% pull("diameter")
F = 0.34904, num df = 99, denom df = 99, p-value = 3.223e-07
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2348504 0.5187589
sample estimates:
ratio of variances
0.3490426
With \(p<\alpha = 0.05\) \(H_0\) is rejected and \(H_a\) is accepted. The variances are different.
Using the KS-test (see Section 4.10.2) the data is checked for normality.
Asymptotic one-sample Kolmogorov-Smirnov test
data: group01 %>% pull("diameter")
D = 0.048142, p-value = 0.9746
alternative hypothesis: two-sided
Asymptotic one-sample Kolmogorov-Smirnov test
data: group02 %>% pull("diameter")
D = 0.067403, p-value = 0.7539
alternative hypothesis: two-sided
With \(p>\alpha = 0.05\) \(H_0\) is accepted, the data seems to be normally distributed.

Then, the formal test is carried out.
Welch Two Sample t-test
data: group01 %>% pull(diameter) and group02 %>% pull(diameter)
t = -15.887, df = 160.61, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3912592 -0.3047408
sample estimates:
mean of x mean of y
12.0155 12.3635
With \(p<\alpha = 0.05\) we reject \(H_0\), the data seems to be coming from different population means, even though the variances are overlapping (and different).
4.13.4 Mann-Whitney U test
For non-normally distributed data or small sample sizes, the Mann-Whitney U test serves as a non-parametric alternative to the independent samples t-test (Mann and Whitney 1947). It assesses whether there is a significant difference in medians between two independent groups.
- Null Hypothesis: The medians of the two samples are equal.
- Prerequisites:
- Independence
- no specific distribution (non-parametric)
- Number of groups \(=2\)

This time a graphical method to check for normality is employed (QQ-plot, see Section 4.10.1). From the Figure 4.31 it is pretty clear, that the data is not normally distributed. Furthermore, the variances seem to be unequal as well.

Then, the formal test is carried out. With \(p<\alpha = 0.05\) \(H_0\) is rejected, the true location shift is not equal to \(0\).
Wilcoxon rank sum test with continuity correction
data: diameter by group
W = 7396, p-value = 4.642e-09
alternative hypothesis: true location shift is not equal to 0
4.13.5 t-test for paired samples
The paired samples t-test is suitable when you have continuous data from two related groups or repeated measures. It helps determine if there is a significant difference in means between the related groups, assuming normally distributed data.
- Null Hypothesis: True mean difference is not equal to 0.
- Prerequisites:
- Paired Data
- Normal Distribution
- equal variances
- Number of groups \(=2\)
Using the F-Test, the variances are compared.
F test to compare two variances
data: diameter by timepoint
F = 1, num df = 9, denom df = 9, p-value = 1
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2483859 4.0259942
sample estimates:
ratio of variances
1
With \(p>\alpha = 0.05\) \(H_0\) is accepted, the variances are equal.
Using a QQ-plot the data is checked for normality.
Without a formal test, the data is assumed to be normally distributed.

The formal test is then carried out.
# A tibble: 1 × 8
.y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 diameter t0 t1 10 10 -13.4 9 0.000000296
With \(p<\alpha = 0.05\) \(H_0\) is rejected, the treatment changed the properties of the product.
4.13.6 Wilcoxon signed rank test
For non-normally distributed data or situations involving paired samples, the Wilcoxon signed rank test is a non-parametric alternative to the paired samples t-test. It evaluates whether there is a significant difference in medians between the related groups.
- Null Hypothesis: True mean difference is not equal to 0.
- Prerequisites:
- Paired Data
- Number of groups \(=2\)
# A tibble: 1 × 7
.y. group1 group2 n1 n2 statistic p
* <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 diameter t0 t1 20 20 25 0.00169
4.14 Test 2 Variables (> 2 Groups)

4.14.1 Analysis of Variance (ANOVA) - Basic Idea
ANOVA’s ability to compare multiple groups or factors makes it widely applicable across diverse fields for analyzing variance and understanding relationships within data. In the context of engineering sciences the application of ANOVA include:
Experimental Design and Analysis: Engineers often conduct experiments to optimize processes, test materials, or evaluate designs. ANOVA aids in analyzing these experiments by assessing the effects of various factors (like temperature, pressure, or material composition) on the performance of systems or products. It helps identify significant factors and their interactions to improve engineering processes.
Product Testing and Reliability: Engineers use ANOVA to compare the performance of products manufactured under different conditions or using different materials. This analysis helps ensure product reliability by identifying which factors significantly impact product quality, durability, or functionality.
Process Control and Improvement: ANOVA plays a crucial role in quality control and process improvement within engineering. It helps identify variations in manufacturing processes, such as assessing the impact of machine settings or production methods on product quality. By understanding these variations, engineers can make informed decisions to optimize processes and minimize defects.
Supply Chain and Logistics: In engineering logistics and supply chain management, ANOVA aids in analyzing the performance of different suppliers or transportation methods. It helps assess variations in delivery times, costs, or product quality across various suppliers or logistical approaches.
Simulation and Modeling: In computational engineering, ANOVA is used to analyze the outputs of simulations or models. It helps understand the significance of different input variables on the output, enabling engineers to refine models and simulations for more accurate predictions.

Across such fields ANOVA is often used to:
Comparing Means: ANOVA is employed when comparing means between three or more groups. It assesses whether there are statistically significant differences among the means of these groups. For instance, in an experiment testing the effect of different fertilizers on plant growth, ANOVA can determine if there’s a significant difference in growth rates among the groups treated with various fertilizers.
Modeling Dependencies: ANOVA can be extended to model dependencies among variables in more complex designs. For instance, in factorial ANOVA, it’s used to study the interaction effects among multiple independent variables on a dependent variable. This allows researchers to understand how different factors might interact to influence an outcome.
Measurement System Analysis (MSA): ANOVA is integral in MSA to evaluate the variation contributed by different components of a measurement system. In assessing the reliability and consistency of measurement instruments or processes, ANOVA helps in dissecting the total variance into components attributed to equipment variation, operator variability, and measurement error.
As with statistical tests before, the applicability of the ANOVA depends on various factors.
4.14.1.1 Sum of squared error (SSE)
The sum of squared errors is a statistical measure used to assess the goodness of fit of a model to its data. It is calculated by squaring the differences between the observed values and the values predicted by the model for each data point, then summing up these squared differences. The SSE indicates the total variability or dispersion of the observed data points around the fitted regression line or model. Lower SSE values generally indicate a better fit of the model to the data.
\[\begin{align} SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \label{sse} \end{align}\]

4.14.1.2 Mean squared error (MSE)
The mean squared error is a measure used to assess the average squared difference between the predicted and actual values in a dataset. It is frequently employed in regression analysis to evaluate the accuracy of a predictive model. The MSE is calculated by taking the average of the squared differences between predicted values and observed values. A lower MSE indicates that the model’s predictions are closer to the actual values, reflecting better accuracy.
\[\begin{align} MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \label{mse} \end{align}\]
4.14.2 One-way ANOVA
The one-way analysis of variance (ANOVA) is used for continuous data with three or more independent groups. It assesses whether there are significant differences in means among these groups, assuming a normal distribution.
- Null Hypothesis: True mean difference is equal to 0.
- Prerequisites:
- equal variances
- Number of groups \(>2\)
- One response, one predictor variable

The most important prerequisite for a One-way ANOVA are equal variances. Because there are more than two groups, the Bartlett test (as introduced in Section 4.13.1.2) is chosen (data is normally distributed).
Bartlett test of homogeneity of variances
data: diameter by group
Bartlett's K-squared = 275.61, df = 4, p-value < 2.2e-16
Because \(p<\alpha = 0.05\) the variances are different.

Bartlett test of homogeneity of variances
data: diameter by group
Bartlett's K-squared = 2.7239, df = 2, p-value = 0.2562
With \(p>\alpha=0.05\) \(H_0\) is accepted, the variances of group01
, group02
and group03
are equal.
Of course, many software package provide an automated way of performing a One-way ANOVA, but the first will be explained in detail. The general model for a One-way ANOVA is shown in \(\eqref{onewayanova}\).
\[\begin{align} Y \sim X + \epsilon \label{onewayanova} \end{align}\]
- \(H_0\): All population means are equal.
- \(H_a\): Not all population means are equal.
For a One-way ANOVA the predictor variable \(X\) is the mean (\(\bar{x}\)) of all datapoints \(x_i\).
First the SSE and the MSE is calculated for the complete model (\(H_a\) is true), see Table 4.16. The complete model means, that every mean, for every group is calculated and the \(SSE\) according to \(\eqref{sse}\) is calculated.


sse | df | n | p | mse |
---|---|---|---|---|
Then, the SSE and the MSE is calculated for the reduced model (\(H_0\) is true). In the reduced model, the mean is not calculated per group, the overall mean is calculated (results in Table 4.17).
sse | df | n | p | mse |
---|---|---|---|---|
The \(SSE\), \(df\) and \(MSE\) explained by the complete model are calculated:
\[\begin{align} SSE_{explained} &= SSE_{reduced}-SSE_{complete} = 118.36 \\ df_{explained} &= df_{reduced} - df_{complete} = 2 \\ MSE_{explained} &= \frac{SSE_{explained}}{df_{explained}} = 59.18 \end{align}\]
The ratio of the variance (MSE) as explained by the complete model to the reduced model is then calculated. The probability of this statistic is afterwards calculated (if \(H_0\) is true).
[1] 2.762026e-236
The probability of a F-statistic with \(pf = 5579.207\) is \(0\).
A crosscheck with a automated solution (aov
-function) yields the results shown in Table 4.18.
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
Some sanity checks are of course required to ensure the validity of the results. First, the variance of the residuals must be equal along the groups (see Figure 4.40).

Also, the residuals from the model must be normally distributed (see Figure 4.41).

The model seems to be valid (equal variances of residuals, normal distributed residuals).
With \(p<\alpha = 0.05\) \(H_0\) can be rejected, the means come from different populations.
4.14.3 Welch ANOVA
Welch ANOVA: Similar to one-way ANOVA, the Welch ANOVA is employed when there are unequal variances between the groups being compared. It relaxes the assumption of equal variances, making it suitable for situations where variance heterogeneity exists.
- Null Hypothesis: True mean difference is not equal to 0.
- Prerequisites:
- Number of groups \(>2\)
- One response, one predictor variable
The Welch ANOVA drops the prerequisite of equal variances in groups. Because there are more than two groups, the Bartlett test (as introduced in Section 4.13.1.2) is chosen (data is normally distributed).
Bartlett test of homogeneity of variances
data: diameter by group
Bartlett's K-squared = 275.61, df = 4, p-value < 2.2e-16
With \(p<\alpha = 0.05\) \(H_0\) can be rejected, the variances are not equal.
The ANOVA table for the Welch ANOVA is shown in Table 4.19.
num.df | den.df | statistic | p.value | method |
---|---|---|---|---|
4.14.4 Kruskal Wallis
Kruskal-Wallis Test: When dealing with non-normally distributed data, the Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA. It is used to evaluate whether there are significant differences in medians among three or more independent groups.
In this example the drive strength is measured using three-point bending. Three different methods are employed to increase the strength of the drive shaft.

- Method A: baseline material
- Method B: different geometry
- Method C: different material
In Figure 4.43 the raw drive shaft strength data for Method A, B and C is shown. At first glance, the data does not appear to be normally distributed.

In Figure 4.44 the visual test for normal distribution is performed. The data does not appear to be normally distributed.

The Kruskal-Wallis test is then carried out. With \(p< \alpha = 0.05\) it is shown, that the groups come from populations with different means. The next step is to find which of the groups are different using a post-hoc analysis.
Kruskal-Wallis rank sum test
data: strength by group
Kruskal-Wallis chi-squared = 107.65, df = 2, p-value < 2.2e-16
The Kruskal-Wallis Test (as the ANOVA) can only tell you, if there is a signifcant difference between the groups, not what groups are different. Post-hoc tests are able to determine such, but must be used with a correction for multiple testing (see (Tamhane 1977))
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: kw_shaft_data$strength and kw_shaft_data$group
Method_A Method_B
Method_B < 2e-16 -
Method_C 6.8e-14 2.0e-10
P value adjustment method: bonferroni
Because \(p<\alpha = 0.05\) it can be concluded, that all means are different from each other.
4.14.5 repeated measures ANOVA
Repeated Measures ANOVA: The repeated measures ANOVA is applicable when you have continuous data with multiple measurements within the same subjects or units over time. It is used to assess whether there are significant differences in means over the repeated measurements, under the assumptions of sphericity and normal distribution.
In this example, the diameter of \(n = 20\) drive shafts is measured after three different steps.
- Before Machining
- After Machining
- After Inspection

First, outliers are identified. There is no strict rule to identify outliers, in this case a classical measure is applied according to \(\eqref{outlierrule}\)
\[\begin{align} \text{outlier} &= \begin{cases} x_i & >Q3 + 1.5 \cdot IQR \\ x_i & <Q1 - 1.5 \cdot IQR \end{cases} \label{outlierrule} \end{align}\]
# A tibble: 1 × 5
timepoint Subject_ID diameter is.outlier is.extreme
<chr> <fct> <dbl> <lgl> <lgl>
1 After_Inspection 15 12.9 TRUE FALSE
A check for normality is done employing the Shapiro-Wilk test (Shapiro and Wilk 1965).
timepoint | variable | statistic | p |
---|---|---|---|
The next step is to check the dataset for sphericity, meaning to compare the variance of the groups among each other in order to determine the equality thereof. For this the Mauchly Test for sphericity is employed (Mauchly 1940).
Effect W p p<.05
1 timepoint 0.927 0.524
With \(p>\alpha = 0.05\) \(H_0\) is accepted, the variances are equal. Otherwise sphericity corrections must be applied (Greenhouse and Geisser 1959).
The next step is to perform the repeated measures ANOVA, which yields the following results.
Effect | DFn | DFd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|
With \(p<\alpha = 0.05\) \(H_0\) is rejected, the different timepoints yield different diameters. Which groups are different is then determined using a post-hoc test, including a correction for the significance level (Bonferroni 1936).
In this case, the assumptions for a t-test are met, the pairwise t-test can be used.
group1 | group2 | n1 | n2 | statistic | df | p | p.adj | signif |
---|---|---|---|---|---|---|---|---|
with \(p<\alpha = 0.05\) \(H_0\) is rejected for the comparison Before_Machining - After_Machining
and After_Inspection - Before_Machining
. It can therefore be concluded that the machining has a significant influence on the diameter, whereas the inspection has none.
4.14.6 Friedman test
The Friedman test is a non-parametric alternative to repeated measures ANOVA (Friedman 1937). It is utilized when dealing with non-normally distributed data and multiple measurements within the same subjects. This test helps determine if there are significant differences in medians over the repeated measurements.
The same data as for the repeated measures ANOVA will be used.
.y. | n | statistic | df | p | method |
---|---|---|---|---|---|
With \(p<\alpha = 0.05\) \(H_0\) is rejected, the timepoints play a vital role for the drive shaft parameter.
A quantile is a statistical concept used to divide a dataset into equal-sized subsets or intervals.↩︎