DBSCAN using spatial and temporal data

I am considering data points that have lat, lng and event date / time. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works fine when clustering lat and lng, my concern is that it will fall apart when timing information is included as it does not have the same scale or the same type of distance.

What are my options for including temporary data in the DBSCAN algorithm?

+3


source to share


2 answers


Take a look at the generalized DBSCAN by the same authors.

Sander, Jorg; Esther, Martin; Chriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The GDBSCAN Algorithm and Its Applications . Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2 (2): 169-194. DOI: 10.1023 / A: 1009745219419.

For a (generic) DBSCAN, you need two functions:



  • findNeighbors - Get all related objects from your database

  • corePoint - decide if this set is enough to start the cluster

then you can repeatedly find neighbors to grow clusters.

Feature 1 is where you want to connect, for example using two thresholds: one geographic and one time (i.e. within 100 miles and within 1 hour).

+3


source


tl; dr you will have to change your feature set, i.e. scale date / time to fit the size of your geodata.

The DBSCAN input is just a vector, and the algorithm itself does not know that one dimension (time) is orders of magnitude greater or less than others (distance). So, when calculating the density of the data points, the difference in scaling will screw up.

Now, I suppose you can change the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. By providing your own distance function, instead of using the standard Euclidean distance.



IMHO, though, it's easiest to scale one of your dimensions to fit the other. just multiply your time values ​​by a fixed linear factor so they are at the same magnitude level as the geo values ​​and you should be good to go.

More generally, it is part of the feature selection process, which is arguably the most important part of any machine learning algorithm solution. choosing the right functions and transforming them correctly and you will be more than halfway there.

+2


source







All Articles