Gradient Descent

Prof. Dr. Tim Weber

tim.weber@th-deg.de

Deggendorf Institute of Technology

Gradient Descent

Use Cases: General

Machine Learning: Used to train models like neural networks, improving accuracy by minimizing error functions.
Deep Learning: Vital for optimizing complex neural network architectures in tasks like image recognition and natural language processing.
Reinforcement Learning: Enables agents to learn optimal strategies by updating policies or value functions in interaction with environments.
Optimization Problems: Applies to various domains, including physics, engineering, finance, and healthcare, to minimize costs or maximize utility functions.
Computer Vision: Key for training convolutional neural networks to classify objects, detect features, and segment images accurately.
Finance and Economics: Helps in financial modeling, algorithmic trading, and risk management by optimizing trading strategies and pricing financial instruments.
Healthcare and Biology: Utilized in drug discovery, genomic analysis, and medical image analysis to optimize models for diagnosis and treatment planning.

Use Cases: Statistical Learning

Parameter Estimation: Used in maximum likelihood and maximum a posteriori estimation to find model parameters that best fit observed data.
Regression: Employed in linear and logistic regression to minimize errors between observed and predicted values.
Regularization: Implements L1 and L2 regularization to prevent overfitting in models.
Dimensionality Reduction: Utilized in PCA and factor analysis to reduce data dimensionality while preserving information.
Clustering: Optimizes objective functions in k-means and Gaussian mixture models for better cluster formation.
Survival Analysis: Estimates parameters in survival models like Cox proportional hazards model for analyzing time-to-event data.

The idea

(James et al. 2021)

The loss function - formulation

The loss function is at the heart of the gradient descent algorithm. We start with linear regression.

\[\begin{align} Y(x_i) = \theta_0 + \theta_1 x_i \end{align}\]

The loss function needs to be differentiable and convex, which is satisfied for the MSE.

\[\begin{align} J(\theta_1,\theta_0) = \overbrace{\frac{1}{N}\sum_{i=1}^N(y(x_i)-y_i)^2}^{MSE} \end{align}\]

The loss function - partial derivative

In order to solve for the parameters we do some math and geht the partial derivatives:

\[\begin{align} \frac{\partial \text{MSE}}{\partial \theta_0} &= -\frac{2}{n} \sum_{i=1}^n (y_i - \hat{y}_i) \\ \frac{\partial \text{MSE}}{\partial \theta_1} &= -\frac{2}{n} \sum_{i=1}^n (y_i - \hat{y}_i) x_i \end{align}\]

$\theta_0$ and $\theta_1$ are then iteratively updated:

\[\begin{align} \theta_0 &\leftarrow \theta_0 - \alpha \frac{\partial \text{MSE}}{\partial \theta_0} \\ \theta_0 &\leftarrow \theta_0 + \alpha \left( \frac{2}{n} \sum_{i=1}^n (y_i - \hat{y}_i) \right)\\ \theta_1 &\leftarrow \theta_1 - \alpha \frac{\partial \text{MSE}}{\partial \theta_1}\\ \theta_1 &\leftarrow \theta_1 + \alpha \left( \frac{2}{n} \sum_{i=1}^n (y_i - \hat{y}_i) x_i \right) \end{align}\]

The learning rate $\alpha$

Hyperparameter
scales the gradient, influencing how model parameters are adjusted

Classroom

Algorithm

\begin{algorithm} \caption{gradient descent} \begin{algorithmic} \Require cost function $J(\theta)$, learning rate $\alpha$, number of iterations $n$ \State $i \gets 0$ \State random $\theta$ \Procedure{GradientDescent}{$J(\theta),\theta, \alpha, n$} \For{$i$ \To $n$} \For{$j = 1$ \To number of training examples $m$} \State \textbf{Compute gradient of $J$ wrt $\theta$} \State gradient = $\nabla J(\theta, x^j, y^j)$ \State \textbf{Update the parameters $\theta$:} \State $\theta = \theta - \alpha \times gradient$ \EndFor \EndFor \EndProcedure \end{algorithmic} \end{algorithm}

Simulation

\[\begin{align} Y(x_i) = \theta_0 + \theta_1 x_i \end{align}\]

set.seed(1000) #reproducibility

theta_0 <- 5 # parameter 1
theta_1 <- 2 # parameter 2
n_obs <- 500 # number of datapoints
x <- rnorm(n_obs) # independent variable
y <- theta_1*x + theta_0 + rnorm(n_obs, 0, 3) # simulate data
rm(theta_0, theta_1) # get rid of ground truth

The data

core functions and parameters

gradient_desc <- function(theta_0, theta_1, x, y){
  N = length(x)
  pred <- theta_1*x + theta_0
  res <- y - pred
  delta_theta_0 <- (2/N)*sum(res)
  delta_theta_1 <- (2/N)*sum(res*x)
  return(c(delta_theta_0, delta_theta_1))
}
minimize_function <- function(theta_0, theta_1, x, y, alpha){
  gd <- gradient_desc(theta_0, theta_1, x, y)
  d_theta_0 <- gd[1] * alpha
  d_theta_1 <- gd[2] * alpha
  new_theta_0 <- theta_0 + d_theta_0
  new_theta_1 <- theta_1 + d_theta_1
  return(c(new_theta_0, new_theta_1))
}
alpha <- 0.1
iter <- 100

optimization

res <- list()
res[[1]] <- c(0, 0)

for (i in 2:iter){
  res[[i]] <- minimize_function(
    res[[i-1]][1], res[[i-1]][2], data$x, data$y, alpha
  )
}

mse vs. iteration

the models

different learning rates

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning. Springer.

Gradient Descent