You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Venu Reddy <k....@gmail.com> on 2019/10/16 11:31:35 UTC

[DISCUSSION]Support for Geospatial indexing

Hi all,

In general, database may contain geographical location data. For instance,
Telecom operators require to perform analytics based on a particular
region, cell tower IDs(within a region) and/or may include geographical
locations for a particular period of time. At present, Carbon do not have
native support to store geographical locations/coordinates and to do filter
queries based on them. Yet, longitude and latitude of coordinates can be
treated as independent columns, sort hierarchically and store them.

         But, when longitude and latitude are treated independently, 2D
space is linearized i.e., points in the two dimensional domain are ordered
by sorting first on longitide and then on latitude. Thus, data is not
ordered by geospatial proximity. Hence range queries require lot of IO
operations and query performance is degraded.

        To alleviate it, we can use z-order curve to store geospatial data
points. This ensures that geographically nearer points are present at same
block/blocklet. This reduces the IO operations for range queries and
improves query performance. Also can support polygon queries for geodata.

Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and
attached design document to it. Request you to please have a look. Welcome
your opinion and suggestions.

Thanks,

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Jacky Li <ja...@apache.org>.

definitely +1.

Before going through the design doc, I have two questions:
1. In this domain, there are some opensource solutions with SQL extension or DSL designed for geographical analytic, such as geomesa (it also works with spark). So is there considerations to integration with geomesa also? Can geomesa user benefit from CarbonData spatial index?

2. Besides Z-order curve, there are other curve maybe useful in some use case, like Hilbert curve. To maximize the extensionbility for CarbonData, is it possible to have a framework to support different curve implementation?

Regards,
Jacky 

On 2019/10/16 11:31:35, Venu Reddy <k....@gmail.com> wrote: 
> Hi all,
> 
> In general, database may contain geographical location data. For instance,
> Telecom operators require to perform analytics based on a particular
> region, cell tower IDs(within a region) and/or may include geographical
> locations for a particular period of time. At present, Carbon do not have
> native support to store geographical locations/coordinates and to do filter
> queries based on them. Yet, longitude and latitude of coordinates can be
> treated as independent columns, sort hierarchically and store them.
> 
>          But, when longitude and latitude are treated independently, 2D
> space is linearized i.e., points in the two dimensional domain are ordered
> by sorting first on longitide and then on latitude. Thus, data is not
> ordered by geospatial proximity. Hence range queries require lot of IO
> operations and query performance is degraded.
> 
>         To alleviate it, we can use z-order curve to store geospatial data
> points. This ensures that geographically nearer points are present at same
> block/blocklet. This reduces the IO operations for range queries and
> improves query performance. Also can support polygon queries for geodata.
> 
> Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and
> attached design document to it. Request you to please have a look. Welcome
> your opinion and suggestions.
> 
> Thanks,
>

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Jacky Li <ja...@qq.com>.

I am not familliar with Apache SIS, is it already integrated with other
storage system? Is there any pointer to learn about this?

In my opinion, this thread was discussing the indexing part in the
CarbonData to accelerate geosptial related queries. If Apache SIS offers
integration framework and can provide more APIs for application, I'd like to
explore more possibility to enlarge CarbonData's usage.

Regards,
Jacky



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by brijoobopanna <br...@huawei.com>.

Thanks for proposing i would suggest to explore and think of integrating
already avail lib like Apache Spatial Information System rather than
developing : https://sis.apache.org/



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Jacky Li <ja...@apache.org>.

Thanks for the analysis. Please be careful of the code reuse from other "opensource" repo, especially for the License.

Regards,
Jacky

On 2019/10/24 06:25:40, Ajantha Bhat <aj...@gmail.com> wrote: 
> Hi Jacky,
> 
> we have checked about geomesa
> 
> [image: Screenshot from 2019-10-23 16-25-23.png]
> 
> a. Geomesa is tightly coupled with  key-value pair databases like Accumulo,
> HBase, Google Bigtable and Cassandra databases and used for OLTP queries.
> b. Geomesa current spark integration is only in query flow, load from spark
> is not supported. spark can be used for analytics on geomesa store.
> Here they override spark catalyst optimizer code to intercept filter from
> logical relation and they push down to geomesa server.
> All the query logic like spatial time curve building (z curve, quadtree)
> doesn't happen at spark layer. It happens in geoserver layer which is
> coupled with key-value pair databases.
> https://www.geomesa.org/documentation/user/architecture.html
> 
> https://www.geomesa.org/documentation/user/spark/architecture.html
> 
> https://www.youtube.com/watch?v=Otf2jwdNaUY
> 
> c. Geomesa is for spatio-temporal data , not just a spatial data.
> so, we cannot integrate carbon with  geo mesa directly, but we can reuse
> some of the logic present in it like quadtree formation and look up.
> 
> Also I found *another alternative* "*GeoSpark", *this project is not
> coupled with any store.
> https://datasystemslab.github.io/GeoSpark/
> 
> https://www.public.asu.edu/~jiayu2/presentation/jia-icde19-tutorial.pdf
> so, we will check further about integrating carbon to GeoSpark or reusing
> some of the code from this.
> 
> Also regarding the second point, yes, we can have carbon implementation as
> a generic framework where we can plugin the different logic.
> 
> Thanks,
> Ajantha
> 
> 
> 
> 
> 
> On Mon, Oct 21, 2019 at 6:34 PM Indhumathi <in...@gmail.com> wrote:
> 
> > Hi Venu,
> >
> > I have some questions regarding this feature.
> >
> > 1. Does geospatial index supports on streaming table?. If so, will there be
> > any impact on generating
> >     geoIndex on steaming data?
> > 2. Does it have any restrictions on sort_scope?
> > 3. Apart from Point and Polygon queries, will geospatial index also support
> > Aggregation queries on
> >     geographical location data?
> >
> > Thanks & Regards,
> > Indhumathi
> >
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi Jacky,

we have checked about geomesa

[image: Screenshot from 2019-10-23 16-25-23.png]

a. Geomesa is tightly coupled with  key-value pair databases like Accumulo,
HBase, Google Bigtable and Cassandra databases and used for OLTP queries.
b. Geomesa current spark integration is only in query flow, load from spark
is not supported. spark can be used for analytics on geomesa store.
Here they override spark catalyst optimizer code to intercept filter from
logical relation and they push down to geomesa server.
All the query logic like spatial time curve building (z curve, quadtree)
doesn't happen at spark layer. It happens in geoserver layer which is
coupled with key-value pair databases.
https://www.geomesa.org/documentation/user/architecture.html

https://www.geomesa.org/documentation/user/spark/architecture.html

https://www.youtube.com/watch?v=Otf2jwdNaUY

c. Geomesa is for spatio-temporal data , not just a spatial data.
so, we cannot integrate carbon with  geo mesa directly, but we can reuse
some of the logic present in it like quadtree formation and look up.

Also I found *another alternative* "*GeoSpark", *this project is not
coupled with any store.
https://datasystemslab.github.io/GeoSpark/

https://www.public.asu.edu/~jiayu2/presentation/jia-icde19-tutorial.pdf
so, we will check further about integrating carbon to GeoSpark or reusing
some of the code from this.

Also regarding the second point, yes, we can have carbon implementation as
a generic framework where we can plugin the different logic.

Thanks,
Ajantha

On Mon, Oct 21, 2019 at 6:34 PM Indhumathi <in...@gmail.com> wrote:

> Hi Venu,
>
> I have some questions regarding this feature.
>
> 1. Does geospatial index supports on streaming table?. If so, will there be
> any impact on generating
>     geoIndex on steaming data?
> 2. Does it have any restrictions on sort_scope?
> 3. Apart from Point and Polygon queries, will geospatial index also support
> Aggregation queries on
>     geographical location data?
>
> Thanks & Regards,
> Indhumathi
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [DISCUSSION]Support for Geospatial indexing

Posted by VenuReddy <k....@gmail.com>.

@Indhumathi Please find my reply inline

1. Does geospatial index supports on streaming table?. If so, will there be
any impact on generating geoIndex on steaming data?
=> Yes. We can support for steaming tables as well. But we shall restrict it
for now and enhance in the future.
2. Does it have any restrictions on sort_scope?
=> There is no restriction on sort_scope. Same existing sort_scope applies
to it.
3. Apart from Point and Polygon queries, will geospatial index also support
Aggregation queries on geographical location data?
=> For now, we shall restrict to polygon. But IMO, can extend it for
multiple types of queries.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Indhumathi <in...@gmail.com>.

Hi Venu,

I have some questions regarding this feature.

1. Does geospatial index supports on streaming table?. If so, will there be
any impact on generating
    geoIndex on steaming data?
2. Does it have any restrictions on sort_scope?
3. Apart from Point and Polygon queries, will geospatial index also support
Aggregation queries on 
    geographical location data?

Thanks & Regards,
Indhumathi




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by VenuReddy <k....@gmail.com>.

1. Would table with geospatial location column be allowed to be updated with
non-geospatial data and vice verca . Or would it according to the existing
behavior and any unsupported data in type/column would be treated as bad
records ? 
=> Location columns cannot be allowed with invalid datatypes. It can be
treated as bad records with unsupported data in type/column.

2. Would there be any limitations with respect to using targetColumn column
configured as local dictionary,inverted index,cache column or range column
in table properties ? 
=> I think, there shouldn't be any such restriction. TargetColumn is just an
additional column internally generated when INDEX property is specified.

3. Would only measure data types be supported for targetDataType parameter ?
Supported types can be mentioned in design doc. 
=> We can treat the generated geohash column as dimension column as it
should be part of sort columns.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by chetan bhat <ch...@gmail.com>.

Hi Venu,

1. Would table with geospatial location column be allowed to be updated with non-geospatial data and vice verca . Or would it according to the existing behavior and any unsupported data in type/column would be treated as bad records ?
2. Would there be any limitations with respect to using targetColumn column configured as local dictionary,inverted index,cache column or range column in table properties ? 
3. Would only measure data types be supported for targetDataType parameter ? Supported types can be mentioned in design doc.

Regards
Chetan

On 2019/10/16 11:31:35, Venu Reddy <k....@gmail.com> wrote: 
> Hi all,
> 
> In general, database may contain geographical location data. For instance,
> Telecom operators require to perform analytics based on a particular
> region, cell tower IDs(within a region) and/or may include geographical
> locations for a particular period of time. At present, Carbon do not have
> native support to store geographical locations/coordinates and to do filter
> queries based on them. Yet, longitude and latitude of coordinates can be
> treated as independent columns, sort hierarchically and store them.
> 
>          But, when longitude and latitude are treated independently, 2D
> space is linearized i.e., points in the two dimensional domain are ordered
> by sorting first on longitide and then on latitude. Thus, data is not
> ordered by geospatial proximity. Hence range queries require lot of IO
> operations and query performance is degraded.
> 
>         To alleviate it, we can use z-order curve to store geospatial data
> points. This ensures that geographically nearer points are present at same
> block/blocklet. This reduces the IO operations for range queries and
> improves query performance. Also can support polygon queries for geodata.
> 
> Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and
> attached design document to it. Request you to please have a look. Welcome
> your opinion and suggestions.
> 
> Thanks,
>

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Kumar Vishal <ku...@gmail.com>.

+1
Regards
Kumar Vishal


On Fri, Nov 29, 2019 at 4:36 PM Ajantha Bhat <aj...@gmail.com> wrote:

> @venu: ok. +1
>
> On Fri, Nov 29, 2019 at 3:48 PM VenuReddy <k....@gmail.com>
> wrote:
>
> > @Ajantha
> > Agreed. Have updated the design doc as suggested.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Ajantha Bhat <aj...@gmail.com>.

@venu: ok. +1

On Fri, Nov 29, 2019 at 3:48 PM VenuReddy <k....@gmail.com> wrote:

> @Ajantha
> Agreed. Have updated the design doc as suggested.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [DISCUSSION]Support for Geospatial indexing

Posted by VenuReddy <k....@gmail.com>.

@Ajantha
Agreed. Have updated the design doc as suggested.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi Venu,

1. Please keep the default implementation independent of grid size and
other parameters.
I mean below parameters.
'INDEX_HANDLER.xxx.gridSize',
'INDEX_HANDLER.xxx.minLongitude',
'INDEX_HANDLER.xxx.maxLongitude',
'INDEX_HANDLER.xxx.minLatitude',
'INDEX_HANDLER.xxx.maxLatitude',

*It should work on just longitude , latitude. index type and float data
type as default longitude and latitude. *
*Quadtree* logic can be generic, which takes geohash id and  return ranges.
Can work for all implementations.

Can add custom implementation for gridsize and other parameters if required.

2. In describe formatted table, Instead of non-schema columns, can show it
as Custom Index Information.
And better to show the custom index handler name and source columns used
also in describe.

*# Custom Index Information*

*custom index Handler Class :*

*custom index Handler type:*
*custom index column name : *

*custom index column data type : *
*custom index source columns :*

we can skip display itself if property is not configured.

Thanks,
Ajantha

On Tue, Nov 26, 2019 at 8:38 PM VenuReddy <k....@gmail.com> wrote:

> Hi all,
>
> I've refreshed the design document in jira. Have incorporated changes to
> table properties and fixed review comments.
> Please find  the latest design doc at
> https://issues.apache.org/jira/browse/CARBONDATA-3548
> Request review and let me know your opinion.
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [DISCUSSION]Support for Geospatial indexing

Posted by VenuReddy <k....@gmail.com>.

Hi all,

I've refreshed the design document in jira. Have incorporated changes to
table properties and fixed review comments.
Please find  the latest design doc at
https://issues.apache.org/jira/browse/CARBONDATA-3548
Request review and let me know your opinion.

Thanks,
Venu Reddy



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by VenuReddy <k....@gmail.com>.

@xuchuanyin
With this new property, we can create a non-schema column internally and can
generate the customized value to it upon each row add from the existing
schema columns values(i.e., from source column values). Note that source
columns are specified with property.
             Since the intent of this column creation was to use it as sort
column too, we can implicitly add it to existing configured sort column
list. During the table creation, we can append to the existing sort column
list. And, if we want to change the order of sort columns, we can use
existing alter table set table properties for sort columns.

The way sorting of data happens is still same as present except the fact
that it considers another non-schema column also into account during sort.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION]Support for Geospatial indexing

Posted by xuchuanyin <xu...@apache.org>.

Sorry that I cannot access the document in jira.

In my opinion, both for the SORT_COLUMNS in current implementation and for
the LOCATION_COLUMNS in the proposal, carbondata tries to organize the data
in some order.

So the kernel of the proposal is that, for the SORT_COLUMNS, we can specify
a algorithm for the order of sort. By default it is MIN_MAX-INCREASE sort,
and more sort can be introduced such as INVERSE_MIN_MAX-INCRESAE sort or
Z-ORDER sort.

For the implementation, we can define a sort factory and implement some, and
also keep it open for the users' customization just like column_compressor
we implemented before.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/