Skip to content

ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs #1965

Discussion options

You must be logged in to vote

The execution time of DBSCAN on different implementations/approaches is multifactorial. Specifically, it will be influenced by:

  • The dataset itself
    • dataset size
    • spatial density
    • spatial patterns in the data (e.g. some severe spatial concentration). In the graphframes connected components algorithm this can lead to a long right tail for execution time vs number of running tasks. In other words you whole cluster is waiting for only a few cores to do work.
  • The parameters for DBSCAN:
    • larger values of epsilon result in higher selectivity of the distance join and a larger graph for the connected components calculation
    • smaller values of min pts means a larger graph for the connected component…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by james-willis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants