You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sedona.apache.org by Adam Binford <ad...@gmail.com> on 2021/04/14 10:49:24 UTC

Re: DataFrame joins much slower than SpatialRDD joins

Are you using the 1.0.0 release? If so, there's a bug that prevented
spatial indexing from being used in SQL join queries, which hopefully
explains the difference. Also, there will be broadcast join support too
which could make the SQL join even faster than RDD join for large-small
joins.

Adam

On Tue, Apr 13, 2021 at 7:20 PM Andrew Brooks <an...@flightaware.com>
wrote:

> I've noticed that performing joins with the DataFrame API tends to be
> significantly slower than using the SpatialRDD API directly. To illustrate,
> I've put together a simple benchmark, which generates 10k points and 10k
> envelopes at random, then counts the number of envelope/point pairs such
> that the point is contained in the envelope:
> https://gist.github.com/agbrooks/3f82bc7894e931e93a3d8de0a16cfba0
>
>
>
> On my laptop, the DataFrame-based implementation in this benchmark takes
> nearly 10 times as long to execute as the SpatialRDD-based implementation
> (536 vs. 53 seconds).
>
>
>
> Is the performance discrepancy caused by misuse of the API, some inherent
> limitation of the DataFrame-based API, a Sedona bug, or something else
> entirely?
>
>
>
> If it's relevant, I'm running with Scala 2.12.13 / Spark 3.0.2 and using
> the latest commit on the Sedona master branch.
>
>
>
> Best regards,
>
> Andrew Brooks
>


-- 
Adam Binford

Re: DataFrame joins much slower than SpatialRDD joins

Posted by Jia Yu <ji...@apache.org>.
Hi folks,

Some thoughts regarding this.

1. DataFrame Geom serializer was changed to WKB serializer from Sedona
original SHAPE serializer, in 1.0.0 [1]. Based on our test, the old
serializer was several times faster than the WKB serializer [2]. We are
working on a PR to support both SHAPE serializer and WKB serializer [3].
But Sedona RDD API still uses the SHAPE serializer. Serializer is used
intensively in join.
2. DataFrame Scala / Java API automatically does spatial partitioning and
builds an index on one side of the join (Python API 1.0.0 has a bug that
does not use index) . RDD API requires the user to do so. But in the code
from Andrew, he only does spatial partitioning on RDD join without
indexing. On a small dataset used in such case, building index may take
more time than the version without index. Maybe Andrew could try to build
index in RDD join as well. I believe the time difference will be decreased.

Thanks,
Jia


[1]
https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/sedona/sql/utils/GeometrySerializer.scala#L37
[2] https://gist.github.com/netanel246/f85777761ebfc0a5ddef54170ea62f11
[3] https://github.com/apache/incubator-sedona/pull/516

On Wed, Apr 14, 2021 at 3:49 AM Adam Binford <ad...@gmail.com> wrote:

> Are you using the 1.0.0 release? If so, there's a bug that prevented
> spatial indexing from being used in SQL join queries, which hopefully
> explains the difference. Also, there will be broadcast join support too
> which could make the SQL join even faster than RDD join for large-small
> joins.
>
> Adam
>
> On Tue, Apr 13, 2021 at 7:20 PM Andrew Brooks <
> andrew.brooks@flightaware.com>
> wrote:
>
> > I've noticed that performing joins with the DataFrame API tends to be
> > significantly slower than using the SpatialRDD API directly. To
> illustrate,
> > I've put together a simple benchmark, which generates 10k points and 10k
> > envelopes at random, then counts the number of envelope/point pairs such
> > that the point is contained in the envelope:
> > https://gist.github.com/agbrooks/3f82bc7894e931e93a3d8de0a16cfba0
> >
> >
> >
> > On my laptop, the DataFrame-based implementation in this benchmark takes
> > nearly 10 times as long to execute as the SpatialRDD-based implementation
> > (536 vs. 53 seconds).
> >
> >
> >
> > Is the performance discrepancy caused by misuse of the API, some inherent
> > limitation of the DataFrame-based API, a Sedona bug, or something else
> > entirely?
> >
> >
> >
> > If it's relevant, I'm running with Scala 2.12.13 / Spark 3.0.2 and using
> > the latest commit on the Sedona master branch.
> >
> >
> >
> > Best regards,
> >
> > Andrew Brooks
> >
>
>
> --
> Adam Binford
>