You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sedona.apache.org by Justin Terry <jt...@gmail.com> on 2021/11/01 20:08:08 UTC

Slow spatial joins

Hi,

I'm using GeoSpark 1.3.2 (can't upgrade due to conflicts with JTS
dependencies) on Scala 2.12.12 and Spark 3.1.1. I've been running into an
issue with very slow spatial joins with DataFrame API. For instance, this
is some code I use to do a spatial join of a point RDD and polygon RDD:

    firePointDF
      .join(featureDF)
      .where("ST_Intersects(pointshape, polyshape)")

So pretty simple use of the API. In the past I did this join with RDD API,
and it ran pretty fast. But now it takes a very long time, and when I look
at Ganglia, the CPU usage always looks like this:

[image: image.png]

The second big spike is the spatial join. It seems to use full CPU for a
few minutes, then gradually decreases CPU usage, until it's basically just
like one or two executors left running for a very long time. Seems like
something is weird with the partitioning, but I'm just using all the
defaults for GeoSpark. Is this a common problem, and is there a good way to
deal with this?

Thanks,
Justin