Density Based Spatial Clustering and Application with Noise (DBSCAN)

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

Use Cases

Classical cluster algorithms work best for:

  • spherical
  • convex
  • compact and well separated data

DBSCAN groups data points based on density, making it effective for identifying clusters of various shapes and sizes.

  1. Core Points: Points with at least a minimum number of neighbors (MinPts) within a specified radius (ε).
  2. Border Points: Points within the ε radius of core points but with fewer neighbors.
  3. Noise Points: Outliers that don’t belong to any cluster.

Clusters are formed by connecting core points and their reachable neighbors, separating high-density regions from low-density areas.

The clustering problem

the idea

where it works

we have kmeans don’t we?

DBSCAN

Direct density reachable:
A point “A” is directly density reachable from another point “B” if i) “A” is in the \(\varepsilon\)-neighborhood of “B” and ii) “B” is a core point.
Density reachable:
A point “A” is density reachable from “B” if there are a set of core points leading from “B” to “A.
Density connected:
Two points “A” and “B” are density connected if there are a core point “C”, such that both “A” and “B” are density reachable from “C”.

lets get to it

the role of hyperparameters

\(MinPts\)

\(\epsilon\)

parameter study

silhouette computation

finding eps: knn \(\rightarrow k = MinPts\)

\(MinPts?\)

  • \(MinPts\) is selected based on the domain knowledge.

  • If you do not have domain understanding, a rule of thumb is to derive \(MinPts\) from the number of dimensions \(D\) in the data set.

    • \(MinPts \geq D + 1\)
  • For 2D data, take \(MinPts = 4\).

  • For larger datasets, with much noise, it suggested to go with \(MinPts = 2 * D\).

Turns out …

… knowing what you do helps!

Application for Accuracy - Technical Cleaniness

Workflow

TECSA Filter

Output

Repeatability?

Challenge

In order to judge the process we need repeated measurements FROM the actual process.

… so we need to find the same particles again!

particles

clustering?

Noise or Cluster?

Null Filter?

Spread?

length +/- sd

Cpk distribution

Six Sigma and Big Data

References

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proc. Of 2nd International Conference on Knowledge Discovery and, 226–31.