ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs #1965
-
Hi Sedona team, Most of my geospatial ETL workflows run on AWS Glue, and I’ve successfully integrated Apache Sedona in that environment. For one use case, I needed to perform DBSCAN clustering on partitioned data (each partition corresponds to let's say - a city or county). The size of each batch varies significantly - sometimes as low as 100K rows, other times as high as 20M rows. I tested both:
What I observed:
Is this behavior expected? And is there any guidance on when to prefer ST_DBSCAN vs a manual fallback to sklearn.DBSCAN, based on batch size or resource availability? I'd prefer not to maintain two separate pipelines for small vs large datasets, but optimizing for both memory and performance has been a challenge. Thanks in advance for any insights! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The execution time of DBSCAN on different implementations/approaches is multifactorial. Specifically, it will be influenced by:
The Sedona implementation of DBSCAN is designed to be performant and robust on large datasets with many core points. It will process datasets that are not feasible on sklearn. It will often not outperform sklearn when the data is smaller, as you've found. sklearn is a single host solution and so saves itself a lot of the overhead that spark has as a distributed compute platform. Among other things that impair its per-core throughput, Spark incurs a lot of overhead when there is data shuffle. But as you've noticed sklearn is memory hungry. As is often the case memory consumption and CPU throughput are directly in tension with each other. Out of the box, graphframes uses the algorithm described here to calculate the connected components. In graphx, they use a message passing approach. In most workloads this connected components calculation will dominate the runtime. When graphframes 0.9.0 releases, you will be able to change which algorithm is being used between these two. See this PR I made. Using the graphx algorithm will be faster but more memory hungry and less robust. You can test if this will be a desirable middle ground for your use case. In this write up I mostly focused on the connected components element of DBSCAN. There are characteristics of the data that can make the distance join element slower or faster but since you mention sklearn I assume you are working only with point data and thus are probably getting good performance on that front. |
Beta Was this translation helpful? Give feedback.
The execution time of DBSCAN on different implementations/approaches is multifactorial. Specifically, it will be influenced by: