You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sedona.apache.org by pietro greselin <p....@gmail.com> on 2021/07/27 09:17:56 UTC

Spatial join performances

To whom it may concern,

we reported the following Sedona behaviour and would like to ask your
opinion on how we can otpimize it.

Our aim is to perform a inner spatial join between a points_df and a
polygon_df when a point in points_df is contained in a polygon from
polygons_df.
Below you can find more details about the 2 dataframes we are considering:
- points_df: it contains 50mln events with latitude and longitude
approximated to the third decimal digit;
- polygon_df: it contains 10k multi-polygons having 600 vertexes on average.

The issue we are reporting is a very long computing time and the spatial
join query never completing even when running on cluster with 40 workers
with 4 cores each.
No error is being print by driver but we are receiving the following
warning:
WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
but no index exists. Will build index on the fly.

Actually we were able to run successfully the same spatial join when only
considering a very small sample of events.
Do you have any suggestion on how we can archive the same result on higher
volumes of data or if there is a way we can optimize the join?

Attached you can find the pseudo-code we are running.

Looking forward to hearing from you.

Kind regards,
Pietro Greselin

Re: Spatial join performances

Posted by Jia Yu <ji...@apache.org>.

Hi Pietro,

As you see from our conversation, for the time being, you can disable Spark
Adaptive Query processing by "spark.sql.adaptive.enabled=false". I believe
this will fix this issue.

Adam and I will dive deep in this issue and fix this bug.

Thanks,
Jia


On Thu, Aug 5, 2021 at 3:10 PM Adam Binford <ad...@gmail.com> wrote:

> I don't think that's the issue. The join detection is the same for both
> broadcast and non-broadcast, so the same match statement needs to run
> either way. I created an issue for what I found from the stack trace (don't
> have a copy of the stack trace to share easily):
> https://issues.apache.org/jira/browse/SEDONA-56
>
> Adam
>
> On Wed, Aug 4, 2021 at 9:02 PM Jia Yu <ji...@gmail.com> wrote:
>
>> Hi Adam,
>>
>> I believe the issue is caused by this chunk of code:
>> https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109
>>
>> If we move the broadcast join detection as the first part of the detector
>> and set other join detection to the "else". Will it fix the issue?
>>
>> if (broadcast XXX)
>>
>> else {
>> all other join detection.
>> }
>>
>> Thanks,
>> Jia
>>
>> On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <ad...@gmail.com> wrote:
>>
>>> Okay I actually did encounter it today. It happens when you have AQE
>>> enabled. Looked into it a little bit and might have to rework the
>>> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even
>>> directly BroadcastExchangeExec, but that might only be compatible with
>>> Spark 3+, so not sure what to do about that. I'm not sure if there's
>>> specific AQE rules or optimizations that can be disabled to get it to
>>> work,
>>> but if you just disable it completely it should work for now. I'm also
>>> not
>>> at all familiar with the inner workings of AQE to know what the right way
>>> to properly work with that is.
>>>
>>> Adam
>>>
>>> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <ad...@gmail.com> wrote:
>>>
>>> > I haven't encountered any issues with it but I can investigate with the
>>> > full stacktrace. Also which version of Spark is this with?
>>> >
>>> > Adam
>>> >
>>> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:
>>> >
>>> >> Hi Pietro,
>>> >>
>>> >> Can you please share the full stacktrace of this scala.MatchError? I
>>> tried
>>> >> a couple test cases but wasn't able to reproduce this error on my
>>> end. In
>>> >> fact, another user complained about the same issue a while back. I
>>> suspect
>>> >> there is a bug for this part.
>>> >>
>>> >> I also CCed the contributor of Sedona broadcast join. @
>>> adamq43@gmail.com
>>> >> <ad...@gmail.com> Hi Adam, do you have any idea about this issue?
>>> >>
>>> >> Thanks,
>>> >> Jia
>>> >>
>>> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p.greselin@gmail.com
>>> >
>>> >> wrote:
>>> >>
>>> >> > Hello Jia,
>>> >> >
>>> >> > thank you so much for your support.
>>> >> >
>>> >> > We have been able to complete our task and to perform a few runs
>>> with
>>> >> > different number of partitions.
>>> >> > At the moment we obtained the best performance when running on 20
>>> nodes
>>> >> > and setting the number of partitions to be 2000. With this
>>> >> configuration,
>>> >> > it took approximately 45 minutes to write the join's output.
>>> >> >
>>> >> > Then we tried to perform the same join through broadcast as you
>>> >> suggested
>>> >> > to see whether we could achieve better results but actually we
>>> obtained
>>> >> the
>>> >> > following error when calling an action like broadcast_join.show()
>>> on the
>>> >> > output
>>> >> >
>>> >> > Py4JJavaError: An error occurred while calling o699.showString.
>>> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry,
>>> QUADTREE,
>>> >> [id=#3312]
>>> >> >
>>> >> > We would be grateful if you can support us on this.
>>> >> >
>>> >> >
>>> >> > The broadcast join was performed as follows: broadcast_join =
>>> >>
>>> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
>>> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
>>> >> >
>>> >> > Attached you can find the pseudo code we used to test broadcast
>>> join.
>>> >> >
>>> >> >
>>> >> > Looking forward to hearing from you.
>>> >> >
>>> >> >
>>> >> > Kind regards,
>>> >> >
>>> >> > Pietro Greselin
>>> >> >
>>> >> >
>>> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
>>> >> >
>>> >> >> Hi Pietro,
>>> >> >>
>>> >> >> A few tips to optimize your join:
>>> >> >>
>>> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See
>>> the
>>> >> >> example here:
>>> >> >>
>>> >>
>>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>>> >> >>
>>> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4),
>>> try to
>>> >> >> use a large number of partitions (say 1000 or more)
>>> >> >>
>>> >> >> If this approach doesn't work, consider broadcast join if needed.
>>> >> >> Broadcast the polygon side:
>>> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Jia
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <
>>> p.greselin@gmail.com>
>>> >> >> wrote:
>>> >> >>
>>> >> >>> To whom it may concern,
>>> >> >>>
>>> >> >>> we reported the following Sedona behaviour and would like to ask
>>> your
>>> >> >>> opinion on how we can otpimize it.
>>> >> >>>
>>> >> >>> Our aim is to perform a inner spatial join between a points_df
>>> and a
>>> >> >>> polygon_df when a point in points_df is contained in a polygon
>>> from
>>> >> >>> polygons_df.
>>> >> >>> Below you can find more details about the 2 dataframes we are
>>> >> >>> considering:
>>> >> >>> - points_df: it contains 50mln events with latitude and longitude
>>> >> >>> approximated to the third decimal digit;
>>> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes
>>> on
>>> >> >>> average.
>>> >> >>>
>>> >> >>> The issue we are reporting is a very long computing time and the
>>> >> spatial
>>> >> >>> join query never completing even when running on cluster with 40
>>> >> workers
>>> >> >>> with 4 cores each.
>>> >> >>> No error is being print by driver but we are receiving the
>>> following
>>> >> >>> warning:
>>> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
>>> >> true,
>>> >> >>> but no index exists. Will build index on the fly.
>>> >> >>>
>>> >> >>> Actually we were able to run successfully the same spatial join
>>> when
>>> >> >>> only considering a very small sample of events.
>>> >> >>> Do you have any suggestion on how we can archive the same result
>>> on
>>> >> >>> higher volumes of data or if there is a way we can optimize the
>>> join?
>>> >> >>>
>>> >> >>> Attached you can find the pseudo-code we are running.
>>> >> >>>
>>> >> >>> Looking forward to hearing from you.
>>> >> >>>
>>> >> >>> Kind regards,
>>> >> >>> Pietro Greselin
>>> >> >>>
>>> >> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Adam Binford
>>> >
>>>
>>>
>>> --
>>> Adam Binford
>>>
>>
>
> --
> Adam Binford
>

Re: Spatial join performances

Posted by Adam Binford <ad...@gmail.com>.

I don't think that's the issue. The join detection is the same for both
broadcast and non-broadcast, so the same match statement needs to run
either way. I created an issue for what I found from the stack trace (don't
have a copy of the stack trace to share easily):
https://issues.apache.org/jira/browse/SEDONA-56

Adam

On Wed, Aug 4, 2021 at 9:02 PM Jia Yu <ji...@gmail.com> wrote:

> Hi Adam,
>
> I believe the issue is caused by this chunk of code:
> https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109
>
> If we move the broadcast join detection as the first part of the detector
> and set other join detection to the "else". Will it fix the issue?
>
> if (broadcast XXX)
>
> else {
> all other join detection.
> }
>
> Thanks,
> Jia
>
> On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <ad...@gmail.com> wrote:
>
>> Okay I actually did encounter it today. It happens when you have AQE
>> enabled. Looked into it a little bit and might have to rework the
>> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even
>> directly BroadcastExchangeExec, but that might only be compatible with
>> Spark 3+, so not sure what to do about that. I'm not sure if there's
>> specific AQE rules or optimizations that can be disabled to get it to
>> work,
>> but if you just disable it completely it should work for now. I'm also not
>> at all familiar with the inner workings of AQE to know what the right way
>> to properly work with that is.
>>
>> Adam
>>
>> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <ad...@gmail.com> wrote:
>>
>> > I haven't encountered any issues with it but I can investigate with the
>> > full stacktrace. Also which version of Spark is this with?
>> >
>> > Adam
>> >
>> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:
>> >
>> >> Hi Pietro,
>> >>
>> >> Can you please share the full stacktrace of this scala.MatchError? I
>> tried
>> >> a couple test cases but wasn't able to reproduce this error on my end.
>> In
>> >> fact, another user complained about the same issue a while back. I
>> suspect
>> >> there is a bug for this part.
>> >>
>> >> I also CCed the contributor of Sedona broadcast join. @
>> adamq43@gmail.com
>> >> <ad...@gmail.com> Hi Adam, do you have any idea about this issue?
>> >>
>> >> Thanks,
>> >> Jia
>> >>
>> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p....@gmail.com>
>> >> wrote:
>> >>
>> >> > Hello Jia,
>> >> >
>> >> > thank you so much for your support.
>> >> >
>> >> > We have been able to complete our task and to perform a few runs with
>> >> > different number of partitions.
>> >> > At the moment we obtained the best performance when running on 20
>> nodes
>> >> > and setting the number of partitions to be 2000. With this
>> >> configuration,
>> >> > it took approximately 45 minutes to write the join's output.
>> >> >
>> >> > Then we tried to perform the same join through broadcast as you
>> >> suggested
>> >> > to see whether we could achieve better results but actually we
>> obtained
>> >> the
>> >> > following error when calling an action like broadcast_join.show() on
>> the
>> >> > output
>> >> >
>> >> > Py4JJavaError: An error occurred while calling o699.showString.
>> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry,
>> QUADTREE,
>> >> [id=#3312]
>> >> >
>> >> > We would be grateful if you can support us on this.
>> >> >
>> >> >
>> >> > The broadcast join was performed as follows: broadcast_join =
>> >>
>> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
>> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
>> >> >
>> >> > Attached you can find the pseudo code we used to test broadcast join.
>> >> >
>> >> >
>> >> > Looking forward to hearing from you.
>> >> >
>> >> >
>> >> > Kind regards,
>> >> >
>> >> > Pietro Greselin
>> >> >
>> >> >
>> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
>> >> >
>> >> >> Hi Pietro,
>> >> >>
>> >> >> A few tips to optimize your join:
>> >> >>
>> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See
>> the
>> >> >> example here:
>> >> >>
>> >>
>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>> >> >>
>> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try
>> to
>> >> >> use a large number of partitions (say 1000 or more)
>> >> >>
>> >> >> If this approach doesn't work, consider broadcast join if needed.
>> >> >> Broadcast the polygon side:
>> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>> >> >>
>> >> >> Thanks,
>> >> >> Jia
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <
>> p.greselin@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> To whom it may concern,
>> >> >>>
>> >> >>> we reported the following Sedona behaviour and would like to ask
>> your
>> >> >>> opinion on how we can otpimize it.
>> >> >>>
>> >> >>> Our aim is to perform a inner spatial join between a points_df and
>> a
>> >> >>> polygon_df when a point in points_df is contained in a polygon from
>> >> >>> polygons_df.
>> >> >>> Below you can find more details about the 2 dataframes we are
>> >> >>> considering:
>> >> >>> - points_df: it contains 50mln events with latitude and longitude
>> >> >>> approximated to the third decimal digit;
>> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
>> >> >>> average.
>> >> >>>
>> >> >>> The issue we are reporting is a very long computing time and the
>> >> spatial
>> >> >>> join query never completing even when running on cluster with 40
>> >> workers
>> >> >>> with 4 cores each.
>> >> >>> No error is being print by driver but we are receiving the
>> following
>> >> >>> warning:
>> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
>> >> true,
>> >> >>> but no index exists. Will build index on the fly.
>> >> >>>
>> >> >>> Actually we were able to run successfully the same spatial join
>> when
>> >> >>> only considering a very small sample of events.
>> >> >>> Do you have any suggestion on how we can archive the same result on
>> >> >>> higher volumes of data or if there is a way we can optimize the
>> join?
>> >> >>>
>> >> >>> Attached you can find the pseudo-code we are running.
>> >> >>>
>> >> >>> Looking forward to hearing from you.
>> >> >>>
>> >> >>> Kind regards,
>> >> >>> Pietro Greselin
>> >> >>>
>> >> >>
>> >>
>> >
>> >
>> > --
>> > Adam Binford
>> >
>>
>>
>> --
>> Adam Binford
>>
>

-- 
Adam Binford

Re: Spatial join performances

Posted by Jia Yu <ji...@gmail.com>.

Hi Adam,

I believe the issue is caused by this chunk of code:
https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109

If we move the broadcast join detection as the first part of the detector
and set other join detection to the "else". Will it fix the issue?

if (broadcast XXX)

else {
all other join detection.
}

Thanks,
Jia

On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <ad...@gmail.com> wrote:

> Okay I actually did encounter it today. It happens when you have AQE
> enabled. Looked into it a little bit and might have to rework the
> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even
> directly BroadcastExchangeExec, but that might only be compatible with
> Spark 3+, so not sure what to do about that. I'm not sure if there's
> specific AQE rules or optimizations that can be disabled to get it to work,
> but if you just disable it completely it should work for now. I'm also not
> at all familiar with the inner workings of AQE to know what the right way
> to properly work with that is.
>
> Adam
>
> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <ad...@gmail.com> wrote:
>
> > I haven't encountered any issues with it but I can investigate with the
> > full stacktrace. Also which version of Spark is this with?
> >
> > Adam
> >
> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:
> >
> >> Hi Pietro,
> >>
> >> Can you please share the full stacktrace of this scala.MatchError? I
> tried
> >> a couple test cases but wasn't able to reproduce this error on my end.
> In
> >> fact, another user complained about the same issue a while back. I
> suspect
> >> there is a bug for this part.
> >>
> >> I also CCed the contributor of Sedona broadcast join. @
> adamq43@gmail.com
> >> <ad...@gmail.com> Hi Adam, do you have any idea about this issue?
> >>
> >> Thanks,
> >> Jia
> >>
> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p....@gmail.com>
> >> wrote:
> >>
> >> > Hello Jia,
> >> >
> >> > thank you so much for your support.
> >> >
> >> > We have been able to complete our task and to perform a few runs with
> >> > different number of partitions.
> >> > At the moment we obtained the best performance when running on 20
> nodes
> >> > and setting the number of partitions to be 2000. With this
> >> configuration,
> >> > it took approximately 45 minutes to write the join's output.
> >> >
> >> > Then we tried to perform the same join through broadcast as you
> >> suggested
> >> > to see whether we could achieve better results but actually we
> obtained
> >> the
> >> > following error when calling an action like broadcast_join.show() on
> the
> >> > output
> >> >
> >> > Py4JJavaError: An error occurred while calling o699.showString.
> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE,
> >> [id=#3312]
> >> >
> >> > We would be grateful if you can support us on this.
> >> >
> >> >
> >> > The broadcast join was performed as follows: broadcast_join =
> >>
> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
> >> >
> >> > Attached you can find the pseudo code we used to test broadcast join.
> >> >
> >> >
> >> > Looking forward to hearing from you.
> >> >
> >> >
> >> > Kind regards,
> >> >
> >> > Pietro Greselin
> >> >
> >> >
> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
> >> >
> >> >> Hi Pietro,
> >> >>
> >> >> A few tips to optimize your join:
> >> >>
> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See the
> >> >> example here:
> >> >>
> >>
> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
> >> >>
> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try
> to
> >> >> use a large number of partitions (say 1000 or more)
> >> >>
> >> >> If this approach doesn't work, consider broadcast join if needed.
> >> >> Broadcast the polygon side:
> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
> >> >>
> >> >> Thanks,
> >> >> Jia
> >> >>
> >> >>
> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <
> p.greselin@gmail.com>
> >> >> wrote:
> >> >>
> >> >>> To whom it may concern,
> >> >>>
> >> >>> we reported the following Sedona behaviour and would like to ask
> your
> >> >>> opinion on how we can otpimize it.
> >> >>>
> >> >>> Our aim is to perform a inner spatial join between a points_df and a
> >> >>> polygon_df when a point in points_df is contained in a polygon from
> >> >>> polygons_df.
> >> >>> Below you can find more details about the 2 dataframes we are
> >> >>> considering:
> >> >>> - points_df: it contains 50mln events with latitude and longitude
> >> >>> approximated to the third decimal digit;
> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
> >> >>> average.
> >> >>>
> >> >>> The issue we are reporting is a very long computing time and the
> >> spatial
> >> >>> join query never completing even when running on cluster with 40
> >> workers
> >> >>> with 4 cores each.
> >> >>> No error is being print by driver but we are receiving the following
> >> >>> warning:
> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
> >> true,
> >> >>> but no index exists. Will build index on the fly.
> >> >>>
> >> >>> Actually we were able to run successfully the same spatial join when
> >> >>> only considering a very small sample of events.
> >> >>> Do you have any suggestion on how we can archive the same result on
> >> >>> higher volumes of data or if there is a way we can optimize the
> join?
> >> >>>
> >> >>> Attached you can find the pseudo-code we are running.
> >> >>>
> >> >>> Looking forward to hearing from you.
> >> >>>
> >> >>> Kind regards,
> >> >>> Pietro Greselin
> >> >>>
> >> >>
> >>
> >
> >
> > --
> > Adam Binford
> >
>
>
> --
> Adam Binford
>

Re: Spatial join performances

Posted by Adam Binford <ad...@gmail.com>.

Okay I actually did encounter it today. It happens when you have AQE
enabled. Looked into it a little bit and might have to rework the
SpatialIndexExec node to extend BroadcastExchangeLike or maybe even
directly BroadcastExchangeExec, but that might only be compatible with
Spark 3+, so not sure what to do about that. I'm not sure if there's
specific AQE rules or optimizations that can be disabled to get it to work,
but if you just disable it completely it should work for now. I'm also not
at all familiar with the inner workings of AQE to know what the right way
to properly work with that is.

Adam

On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <ad...@gmail.com> wrote:

> I haven't encountered any issues with it but I can investigate with the
> full stacktrace. Also which version of Spark is this with?
>
> Adam
>
> On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:
>
>> Hi Pietro,
>>
>> Can you please share the full stacktrace of this scala.MatchError? I tried
>> a couple test cases but wasn't able to reproduce this error on my end. In
>> fact, another user complained about the same issue a while back. I suspect
>> there is a bug for this part.
>>
>> I also CCed the contributor of Sedona broadcast join. @adamq43@gmail.com
>> <ad...@gmail.com> Hi Adam, do you have any idea about this issue?
>>
>> Thanks,
>> Jia
>>
>> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p....@gmail.com>
>> wrote:
>>
>> > Hello Jia,
>> >
>> > thank you so much for your support.
>> >
>> > We have been able to complete our task and to perform a few runs with
>> > different number of partitions.
>> > At the moment we obtained the best performance when running on 20 nodes
>> > and setting the number of partitions to be 2000. With this
>> configuration,
>> > it took approximately 45 minutes to write the join's output.
>> >
>> > Then we tried to perform the same join through broadcast as you
>> suggested
>> > to see whether we could achieve better results but actually we obtained
>> the
>> > following error when calling an action like broadcast_join.show() on the
>> > output
>> >
>> > Py4JJavaError: An error occurred while calling o699.showString.
>> > : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE,
>> [id=#3312]
>> >
>> > We would be grateful if you can support us on this.
>> >
>> >
>> > The broadcast join was performed as follows: broadcast_join =
>> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
>> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
>> >
>> > Attached you can find the pseudo code we used to test broadcast join.
>> >
>> >
>> > Looking forward to hearing from you.
>> >
>> >
>> > Kind regards,
>> >
>> > Pietro Greselin
>> >
>> >
>> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
>> >
>> >> Hi Pietro,
>> >>
>> >> A few tips to optimize your join:
>> >>
>> >> 1. Mix DF and RDD together and use RDD API for the join part. See the
>> >> example here:
>> >>
>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>> >>
>> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to
>> >> use a large number of partitions (say 1000 or more)
>> >>
>> >> If this approach doesn't work, consider broadcast join if needed.
>> >> Broadcast the polygon side:
>> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>> >>
>> >> Thanks,
>> >> Jia
>> >>
>> >>
>> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p....@gmail.com>
>> >> wrote:
>> >>
>> >>> To whom it may concern,
>> >>>
>> >>> we reported the following Sedona behaviour and would like to ask your
>> >>> opinion on how we can otpimize it.
>> >>>
>> >>> Our aim is to perform a inner spatial join between a points_df and a
>> >>> polygon_df when a point in points_df is contained in a polygon from
>> >>> polygons_df.
>> >>> Below you can find more details about the 2 dataframes we are
>> >>> considering:
>> >>> - points_df: it contains 50mln events with latitude and longitude
>> >>> approximated to the third decimal digit;
>> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
>> >>> average.
>> >>>
>> >>> The issue we are reporting is a very long computing time and the
>> spatial
>> >>> join query never completing even when running on cluster with 40
>> workers
>> >>> with 4 cores each.
>> >>> No error is being print by driver but we are receiving the following
>> >>> warning:
>> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
>> true,
>> >>> but no index exists. Will build index on the fly.
>> >>>
>> >>> Actually we were able to run successfully the same spatial join when
>> >>> only considering a very small sample of events.
>> >>> Do you have any suggestion on how we can archive the same result on
>> >>> higher volumes of data or if there is a way we can optimize the join?
>> >>>
>> >>> Attached you can find the pseudo-code we are running.
>> >>>
>> >>> Looking forward to hearing from you.
>> >>>
>> >>> Kind regards,
>> >>> Pietro Greselin
>> >>>
>> >>
>>
>
>
> --
> Adam Binford
>


-- 
Adam Binford

Re: Spatial join performances

Posted by Adam Binford <ad...@gmail.com>.

I haven't encountered any issues with it but I can investigate with the
full stacktrace. Also which version of Spark is this with?

Adam

On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:

> Hi Pietro,
>
> Can you please share the full stacktrace of this scala.MatchError? I tried
> a couple test cases but wasn't able to reproduce this error on my end. In
> fact, another user complained about the same issue a while back. I suspect
> there is a bug for this part.
>
> I also CCed the contributor of Sedona broadcast join. @adamq43@gmail.com
> <ad...@gmail.com> Hi Adam, do you have any idea about this issue?
>
> Thanks,
> Jia
>
> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p....@gmail.com>
> wrote:
>
> > Hello Jia,
> >
> > thank you so much for your support.
> >
> > We have been able to complete our task and to perform a few runs with
> > different number of partitions.
> > At the moment we obtained the best performance when running on 20 nodes
> > and setting the number of partitions to be 2000. With this configuration,
> > it took approximately 45 minutes to write the join's output.
> >
> > Then we tried to perform the same join through broadcast as you suggested
> > to see whether we could achieve better results but actually we obtained
> the
> > following error when calling an action like broadcast_join.show() on the
> > output
> >
> > Py4JJavaError: An error occurred while calling o699.showString.
> > : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE,
> [id=#3312]
> >
> > We would be grateful if you can support us on this.
> >
> >
> > The broadcast join was performed as follows: broadcast_join =
> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
> >
> > Attached you can find the pseudo code we used to test broadcast join.
> >
> >
> > Looking forward to hearing from you.
> >
> >
> > Kind regards,
> >
> > Pietro Greselin
> >
> >
> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
> >
> >> Hi Pietro,
> >>
> >> A few tips to optimize your join:
> >>
> >> 1. Mix DF and RDD together and use RDD API for the join part. See the
> >> example here:
> >>
> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
> >>
> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to
> >> use a large number of partitions (say 1000 or more)
> >>
> >> If this approach doesn't work, consider broadcast join if needed.
> >> Broadcast the polygon side:
> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
> >>
> >> Thanks,
> >> Jia
> >>
> >>
> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p....@gmail.com>
> >> wrote:
> >>
> >>> To whom it may concern,
> >>>
> >>> we reported the following Sedona behaviour and would like to ask your
> >>> opinion on how we can otpimize it.
> >>>
> >>> Our aim is to perform a inner spatial join between a points_df and a
> >>> polygon_df when a point in points_df is contained in a polygon from
> >>> polygons_df.
> >>> Below you can find more details about the 2 dataframes we are
> >>> considering:
> >>> - points_df: it contains 50mln events with latitude and longitude
> >>> approximated to the third decimal digit;
> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
> >>> average.
> >>>
> >>> The issue we are reporting is a very long computing time and the
> spatial
> >>> join query never completing even when running on cluster with 40
> workers
> >>> with 4 cores each.
> >>> No error is being print by driver but we are receiving the following
> >>> warning:
> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
> true,
> >>> but no index exists. Will build index on the fly.
> >>>
> >>> Actually we were able to run successfully the same spatial join when
> >>> only considering a very small sample of events.
> >>> Do you have any suggestion on how we can archive the same result on
> >>> higher volumes of data or if there is a way we can optimize the join?
> >>>
> >>> Attached you can find the pseudo-code we are running.
> >>>
> >>> Looking forward to hearing from you.
> >>>
> >>> Kind regards,
> >>> Pietro Greselin
> >>>
> >>
>


-- 
Adam Binford

Re: Spatial join performances

Posted by Jia Yu <ji...@apache.org>.

Hi Pietro,

Can you please share the full stacktrace of this scala.MatchError? I tried
a couple test cases but wasn't able to reproduce this error on my end. In
fact, another user complained about the same issue a while back. I suspect
there is a bug for this part.

I also CCed the contributor of Sedona broadcast join. @adamq43@gmail.com
<ad...@gmail.com> Hi Adam, do you have any idea about this issue?

Thanks,
Jia

On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p....@gmail.com>
wrote:

> Hello Jia,
>
> thank you so much for your support.
>
> We have been able to complete our task and to perform a few runs with
> different number of partitions.
> At the moment we obtained the best performance when running on 20 nodes
> and setting the number of partitions to be 2000. With this configuration,
> it took approximately 45 minutes to write the join's output.
>
> Then we tried to perform the same join through broadcast as you suggested
> to see whether we could achieve better results but actually we obtained the
> following error when calling an action like broadcast_join.show() on the
> output
>
> Py4JJavaError: An error occurred while calling o699.showString.
> : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE, [id=#3312]
>
> We would be grateful if you can support us on this.
>
>
> The broadcast join was performed as follows: broadcast_join = points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'), f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
>
> Attached you can find the pseudo code we used to test broadcast join.
>
>
> Looking forward to hearing from you.
>
>
> Kind regards,
>
> Pietro Greselin
>
>
> On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
>
>> Hi Pietro,
>>
>> A few tips to optimize your join:
>>
>> 1. Mix DF and RDD together and use RDD API for the join part. See the
>> example here:
>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>>
>> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to
>> use a large number of partitions (say 1000 or more)
>>
>> If this approach doesn't work, consider broadcast join if needed.
>> Broadcast the polygon side:
>> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>>
>> Thanks,
>> Jia
>>
>>
>> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p....@gmail.com>
>> wrote:
>>
>>> To whom it may concern,
>>>
>>> we reported the following Sedona behaviour and would like to ask your
>>> opinion on how we can otpimize it.
>>>
>>> Our aim is to perform a inner spatial join between a points_df and a
>>> polygon_df when a point in points_df is contained in a polygon from
>>> polygons_df.
>>> Below you can find more details about the 2 dataframes we are
>>> considering:
>>> - points_df: it contains 50mln events with latitude and longitude
>>> approximated to the third decimal digit;
>>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
>>> average.
>>>
>>> The issue we are reporting is a very long computing time and the spatial
>>> join query never completing even when running on cluster with 40 workers
>>> with 4 cores each.
>>> No error is being print by driver but we are receiving the following
>>> warning:
>>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
>>> but no index exists. Will build index on the fly.
>>>
>>> Actually we were able to run successfully the same spatial join when
>>> only considering a very small sample of events.
>>> Do you have any suggestion on how we can archive the same result on
>>> higher volumes of data or if there is a way we can optimize the join?
>>>
>>> Attached you can find the pseudo-code we are running.
>>>
>>> Looking forward to hearing from you.
>>>
>>> Kind regards,
>>> Pietro Greselin
>>>
>>

Re: Spatial join performances

Posted by pietro greselin <p....@gmail.com>.

Hello Jia,

thank you so much for your support.

We have been able to complete our task and to perform a few runs with
different number of partitions.
At the moment we obtained the best performance when running on 20 nodes and
setting the number of partitions to be 2000. With this configuration, it
took approximately 45 minutes to write the join's output.

Then we tried to perform the same join through broadcast as you suggested
to see whether we could achieve better results but actually we obtained the
following error when calling an action like broadcast_join.show() on the
output

Py4JJavaError: An error occurred while calling o699.showString.
: scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE,
[id=#3312]

We would be grateful if you can support us on this.


The broadcast join was performed as follows: broadcast_join =
points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))

Attached you can find the pseudo code we used to test broadcast join.


Looking forward to hearing from you.


Kind regards,

Pietro Greselin


On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:

> Hi Pietro,
>
> A few tips to optimize your join:
>
> 1. Mix DF and RDD together and use RDD API for the join part. See the
> example here:
> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>
> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to
> use a large number of partitions (say 1000 or more)
>
> If this approach doesn't work, consider broadcast join if needed.
> Broadcast the polygon side:
> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>
> Thanks,
> Jia
>
>
> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p....@gmail.com>
> wrote:
>
>> To whom it may concern,
>>
>> we reported the following Sedona behaviour and would like to ask your
>> opinion on how we can otpimize it.
>>
>> Our aim is to perform a inner spatial join between a points_df and a
>> polygon_df when a point in points_df is contained in a polygon from
>> polygons_df.
>> Below you can find more details about the 2 dataframes we are considering:
>> - points_df: it contains 50mln events with latitude and longitude
>> approximated to the third decimal digit;
>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
>> average.
>>
>> The issue we are reporting is a very long computing time and the spatial
>> join query never completing even when running on cluster with 40 workers
>> with 4 cores each.
>> No error is being print by driver but we are receiving the following
>> warning:
>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
>> but no index exists. Will build index on the fly.
>>
>> Actually we were able to run successfully the same spatial join when only
>> considering a very small sample of events.
>> Do you have any suggestion on how we can archive the same result on
>> higher volumes of data or if there is a way we can optimize the join?
>>
>> Attached you can find the pseudo-code we are running.
>>
>> Looking forward to hearing from you.
>>
>> Kind regards,
>> Pietro Greselin
>>
>

Re: Spatial join performances

Posted by Jia Yu <ji...@apache.org>.

Hi Pietro,

A few tips to optimize your join:

1. Mix DF and RDD together and use RDD API for the join part. See the
example here:
https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb

2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to use
a large number of partitions (say 1000 or more)

If this approach doesn't work, consider broadcast join if needed. Broadcast
the polygon side:
https://sedona.apache.org/api/sql/Optimizer/#broadcast-join

Thanks,
Jia


On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p....@gmail.com>
wrote:

> To whom it may concern,
>
> we reported the following Sedona behaviour and would like to ask your
> opinion on how we can otpimize it.
>
> Our aim is to perform a inner spatial join between a points_df and a
> polygon_df when a point in points_df is contained in a polygon from
> polygons_df.
> Below you can find more details about the 2 dataframes we are considering:
> - points_df: it contains 50mln events with latitude and longitude
> approximated to the third decimal digit;
> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
> average.
>
> The issue we are reporting is a very long computing time and the spatial
> join query never completing even when running on cluster with 40 workers
> with 4 cores each.
> No error is being print by driver but we are receiving the following
> warning:
> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
> but no index exists. Will build index on the fly.
>
> Actually we were able to run successfully the same spatial join when only
> considering a very small sample of events.
> Do you have any suggestion on how we can archive the same result on higher
> volumes of data or if there is a way we can optimize the join?
>
> Attached you can find the pseudo-code we are running.
>
> Looking forward to hearing from you.
>
> Kind regards,
> Pietro Greselin
>