Density Based Spatial Clustering and Application with Noise (DBSCAN)

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

Use Cases

Classical cluster algorithms work best for:

spherical
convex
compact and well separated data

DBSCAN groups data points based on density, making it effective for identifying clusters of various shapes and sizes.

Core Points: Points with at least a minimum number of neighbors (MinPts) within a specified radius (ε).
Border Points: Points within the ε radius of core points but with fewer neighbors.
Noise Points: Outliers that don’t belong to any cluster.

Clusters are formed by connecting core points and their reachable neighbors, separating high-density regions from low-density areas.

(Ester et al. 1996)

The clustering problem

(Ester et al. 1996)

the idea

where it works

we have kmeans don’t we?

DBSCAN

Direct density reachable:: A point “A” is directly density reachable from another point “B” if i) “A” is in the \(\varepsilon\)-neighborhood of “B” and ii) “B” is a core point.
Density reachable:: A point “A” is density reachable from “B” if there are a set of core points leading from “B” to “A.
Density connected:: Two points “A” and “B” are density connected if there are a core point “C”, such that both “A” and “B” are density reachable from “C”.

lets get to it

the role of hyperparameters

A hyperparameter is a parameter, such as the learning rate or choice of optimizer, which specifies details of the learning process, hence the name hyperparameter. This is in contrast to parameters which determine the model itself.

\(\epsilon\): Reachability Distance
\(MinPts\): Reachability minimum number of points

(Ester et al. 1996)

\(MinPts\)

\(\epsilon\)

parameter study

silhouette computation

finding eps: knn \(\rightarrow k = MinPts\)

\(MinPts?\)

\(MinPts\) is selected based on the domain knowledge.
If you do not have domain understanding, a rule of thumb is to derive \(MinPts\) from the number of dimensions \(D\) in the data set.
- \(MinPts \geq D + 1\)
For 2D data, take \(MinPts = 4\).
For larger datasets, with much noise, it suggested to go with \(MinPts = 2 * D\).

Turns out …

… knowing what you do helps!

Application for Accuracy - Technical Cleaniness

Workflow

TECSA Filter

Output

Repeatability?

Challenge

In order to judge the process we need repeated measurements FROM the actual process.

… so we need to find the same particles again!

particles

clustering?

parameter search

Noise or Cluster?

Null Filter?

Spread?

length +/- sd

Cpk distribution

Six Sigma and Big Data

References

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proc. Of 2nd International Conference on Knowledge Discovery and, 226–31.