You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Indhumathi <in...@gmail.com> on 2021/03/22 10:37:00 UTC

[DISCUSSION] Support JOIN query with spatial index

Hi community,

Currently, carbon supports IN_POLYGON and IN_POLYGON_LIST udf's, 
where user has to manually provide the polygon points(series of latitude 
and longitude pair), to query carbon table based on spatial index.

This feature will support JOIN tables based on IN_POLYGON udf 
filter, where polygon data exists in a table.

Please find below link of design doc. Please check and give 
your inputs/suggestions. 

https://docs.google.com/document/d/11PnotaAiEJQK_QvKsHznDy1I9tO4idflW32LstwcLhc/edit#heading=h.yh6qp815dh3p


Thanks & Regards,
Indhumathi M



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by VenuReddy <k....@gmail.com>.

+1



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by Ajantha Bhat <aj...@gmail.com>.

ok.
+1 from my side.

If polygon join query still has performance bottleneck, we can later
optimize it.

Thanks,
Ajantha

On Tue, Apr 27, 2021 at 3:59 PM Indhumathi <in...@gmail.com> wrote:

> Thanks Ajantha for your inputs.
>
> I have modified the design, by adding ToRangeList Udf filter as a implicit
> column projection to the polygon table dataframe and modified the JOIN
> condition with range list udf column, in order to improve performance.
>
> By this way, we can avoid making quadtree from N*M times to M times.
> I have attached new design document in the JIRA.
> CARBONDATA-4166 <https://issues.apache.org/jira/browse/CARBONDATA-4166>
>
> Regards,
> Indhumathi
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by David CaiQiang <da...@gmail.com>.

+1



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by Indhumathi <in...@gmail.com>.

Thanks Ajantha for your inputs.

I have modified the design, by adding ToRangeList Udf filter as a implicit 
column projection to the polygon table dataframe and modified the JOIN 
condition with range list udf column, in order to improve performance.

By this way, we can avoid making quadtree from N*M times to M times.
I have attached new design document in the JIRA.
CARBONDATA-4166 <https://issues.apache.org/jira/browse/CARBONDATA-4166>  

Regards,
Indhumathi



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi,
I think now the latest document has addressed my previous comments and
questions.

polygon list query and polyline list query design looks ok.

But the design of polygon query with join, I have performance concern.
In this approach, we are using union polygon filter on spatial_table to
prune till blocklet.
It may identify all the rows in blocklet in the worst case and with this
output (N) we will perform join with the polygon table output(M).
which will again check IN_POLYGON condition during join (N*M) times. I too
don't have any different solution at the moment.

But we can optimize the current solution further by below points:
a) Here for the polygon table output you can reduce making quadtree for N*M
times to M times and use the quadtree output as range filter/UDF for join.
b) Also later if we need more improvement, maybe we can try row-level
filtering on the spatial table.

Thanks,
Ajantha

On Thu, Apr 15, 2021 at 9:37 PM Indhumathi <in...@gmail.com> wrote:

> Hello all,
>
> Please find the design document link attached in JIRA,  CARBONDATA-4166
> <https://issues.apache.org/jira/browse/CARBONDATA-4166>
> Any inputs/suggestions from the community is most welcomed.
>
> Regards,
> Indhumathi
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by Indhumathi <in...@gmail.com>.

Hello all,

Please find the design document link attached in JIRA,  CARBONDATA-4166
<https://issues.apache.org/jira/browse/CARBONDATA-4166>  
Any inputs/suggestions from the community is most welcomed.

Regards,
Indhumathi



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by Indhumathi <in...@gmail.com>.

Hello all,

Current design is based on Union of polygons identified from 
polygon table.

Based on discussion with customer, need to change the design to
support IN_POLYGON_JOIN in below way.

Apply IN_POLYGON udf on each polygon identified from Polygon table
and apply aggregation/group by each polygon result.

For example:
Select sum(t1.col1),t2.polygon,t2.type
from table1 t1
inner join
(select polygon,type from table2 where type='x') t2
on in_polygon_join(t1, t2.polygon)
group by t2.polygon, t2.type

table1:
------------+-------------------
Col1       + mygeohash |
------------+------------------
1             |01                 |
2             |02                 |
3             |03                 |
4             |04                 |
5             |05                 |
6             |06                 |
------------+------------------

table2:
-----------------+------------------
polygon        + type             |
-----------------+------------------
1_Polygon() |x                   |
2_Polygon() |y                   |
3_Polygon() |x                   |
4_Polygon() |r                    |
------------+-----------------------

If 1_polygon lies in range (0 & 1), 3_polygon in (5 & 6),
result could be like,
-----------------+------------------+------------
sum(t1.col1  + polygon        + type      |
-----------------+------------------+------------
3                  + 1_Polygon() | x             |
11                + 3_Polygon() | x             |
-----------------+-----------------+--------------

To achieve this, a solution could be to, run query with IN_POLYGON
udf of each polygon and  finally make a union of query result.

Will update design on further analysis.
Any opinion or suggestions are welcomed.

Thanks, 
Indhumathi M



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi, I have some doubts and suggestions for the same.

Currently, we support these UDFs --> IN_POLYGON, IN_POLYGON_LIST,
IN_POLYLINE_LIST, IN_POLYGON_RANGE_LIST
but the user needs to give polygon input manually and as polygon can have
many points, it is hard to give manually.
So, your requirement is to give new UDF , IN_POLYGON_JOIN where polygon
inputs are present in another table and you want to join with the main
table.

*please find my doubts below*
a. why to do join ? when you can form a subquery to query the polygons from
table2 and give it as input for IN_POLYGON UDF and other existing UDFs
b. no need to support the same for  IN_POLYLINE_LIST
and IN_POLYGON_RANGE_LIST UDF also ?

*Suggestions:*
a. Table names and queries are not matching, please update
b. The query doesn't look like the union query as explained in the diagram,
please update and explain
c. Please consider some sample data with examples for t1 and t2. Also,
provide the expected query result also
d. Also mention, how to select data from single polygon and multi polygon
from the tables.

Thanks,
Ajantha

On Tue, Mar 30, 2021 at 11:25 AM Kunal Kapoor <ku...@gmail.com>
wrote:

> +1
>
> On Mon, Mar 22, 2021 at 4:07 PM Indhumathi <in...@gmail.com>
> wrote:
>
> > Hi community,
> >
> > Currently, carbon supports IN_POLYGON and IN_POLYGON_LIST udf's,
> > where user has to manually provide the polygon points(series of latitude
> > and longitude pair), to query carbon table based on spatial index.
> >
> > This feature will support JOIN tables based on IN_POLYGON udf
> > filter, where polygon data exists in a table.
> >
> > Please find below link of design doc. Please check and give
> > your inputs/suggestions.
> >
> >
> >
> https://docs.google.com/document/d/11PnotaAiEJQK_QvKsHznDy1I9tO4idflW32LstwcLhc/edit#heading=h.yh6qp815dh3p
> >
> >
> > Thanks & Regards,
> > Indhumathi M
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

Re: [DISCUSSION] Support JOIN query with spatial index

Posted by Kunal Kapoor <ku...@gmail.com>.

+1

On Mon, Mar 22, 2021 at 4:07 PM Indhumathi <in...@gmail.com> wrote:

> Hi community,
>
> Currently, carbon supports IN_POLYGON and IN_POLYGON_LIST udf's,
> where user has to manually provide the polygon points(series of latitude
> and longitude pair), to query carbon table based on spatial index.
>
> This feature will support JOIN tables based on IN_POLYGON udf
> filter, where polygon data exists in a table.
>
> Please find below link of design doc. Please check and give
> your inputs/suggestions.
>
>
> https://docs.google.com/document/d/11PnotaAiEJQK_QvKsHznDy1I9tO4idflW32LstwcLhc/edit#heading=h.yh6qp815dh3p
>
>
> Thanks & Regards,
> Indhumathi M
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>