ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs #1965

ishraqrayeedbp · 2025-05-30T13:58:17Z

ishraqrayeedbp
May 30, 2025

Hi Sedona team,

Most of my geospatial ETL workflows run on AWS Glue, and I’ve successfully integrated Apache Sedona in that environment.

For one use case, I needed to perform DBSCAN clustering on partitioned data (each partition corresponds to let's say - a city or county). The size of each batch varies significantly - sometimes as low as 100K rows, other times as high as 20M rows.

I tested both:

ST_DBSCAN from Sedona (Spark-based)
sklearn.DBSCAN (Pandas in-memory)

What I observed:

sklearn.DBSCAN was faster, but consumed significantly more memory, since it pulls the data into memory using Pandas.
ST_DBSCAN was slower, but more memory-efficient, presumably because it leverages Spark's distributed processing.

Is this behavior expected?

And is there any guidance on when to prefer ST_DBSCAN vs a manual fallback to sklearn.DBSCAN, based on batch size or resource availability? I'd prefer not to maintain two separate pipelines for small vs large datasets, but optimizing for both memory and performance has been a challenge.

Thanks in advance for any insights!

Answered by james-willis

Jun 10, 2025

The execution time of DBSCAN on different implementations/approaches is multifactorial. Specifically, it will be influenced by:

The dataset itself
- dataset size
- spatial density
- spatial patterns in the data (e.g. some severe spatial concentration). In the graphframes connected components algorithm this can lead to a long right tail for execution time vs number of running tasks. In other words you whole cluster is waiting for only a few cores to do work.
The parameters for DBSCAN:
- larger values of epsilon result in higher selectivity of the distance join and a larger graph for the connected components calculation
- smaller values of min pts means a larger graph for the connected component…

View full answer

james-willis · 2025-06-10T18:33:22Z

james-willis
Jun 10, 2025
Collaborator

The execution time of DBSCAN on different implementations/approaches is multifactorial. Specifically, it will be influenced by:

The dataset itself
- dataset size
- spatial density
- spatial patterns in the data (e.g. some severe spatial concentration). In the graphframes connected components algorithm this can lead to a long right tail for execution time vs number of running tasks. In other words you whole cluster is waiting for only a few cores to do work.
The parameters for DBSCAN:
- larger values of epsilon result in higher selectivity of the distance join and a larger graph for the connected components calculation
- smaller values of min pts means a larger graph for the connected components calculation
cluster characteristics - spark is a multihost solution so the composition of the cluster comes in to play
- network speed - the connected components algorithm from graphframes does a lot of shuffles and so network throughput is critical to performance.
- host size - for a given cluster size (ie cores and ram) larger hosts will have more local tasks which will reduce the amount of network io and speed up shuffle stages.

The Sedona implementation of DBSCAN is designed to be performant and robust on large datasets with many core points. It will process datasets that are not feasible on sklearn. It will often not outperform sklearn when the data is smaller, as you've found.

sklearn is a single host solution and so saves itself a lot of the overhead that spark has as a distributed compute platform. Among other things that impair its per-core throughput, Spark incurs a lot of overhead when there is data shuffle. But as you've noticed sklearn is memory hungry. As is often the case memory consumption and CPU throughput are directly in tension with each other.

Out of the box, graphframes uses the algorithm described here to calculate the connected components. In graphx, they use a message passing approach. In most workloads this connected components calculation will dominate the runtime.

When graphframes 0.9.0 releases, you will be able to change which algorithm is being used between these two. See this PR I made. Using the graphx algorithm will be faster but more memory hungry and less robust. You can test if this will be a desirable middle ground for your use case.

In this write up I mostly focused on the connected components element of DBSCAN. There are characteristics of the data that can make the distance join element slower or faster but since you mention sklearn I assume you are working only with point data and thus are probably getting good performance on that front.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs #1965

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs #1965

Uh oh!

ishraqrayeedbp May 30, 2025

Replies: 1 comment

Uh oh!

james-willis Jun 10, 2025 Collaborator

ishraqrayeedbp
May 30, 2025

james-willis
Jun 10, 2025
Collaborator