You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sis.apache.org by Martin Desruisseaux <ma...@geomatys.com> on 2015/11/10 12:09:16 UTC

Long-term thoughts about big-data queries in SIS

Hello all

In the BigData Apache Conference in Budapest, I attended to some
meetings about exploiting geospatial big data using SQL language. I
though that we could make some long-term plans that could impact the
SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
email is not a request for any change now. This is just a proposal about
some possible long term plans.

In one or two years, Apache SIS would hopefully have some DataStore
implementations ready for production use. But we have a strong request
for capability to use DataStores with big-data technologies like Hadoop.
This request increases the challenge of writing a SQL driver, since a
sophisticated SQL driver would need to be able to restructure query
plans according the available clusters.

I had a discussion with peoples from Apache Drill project
(https://drill.apache.org/), which already provide SQL parsing
capabilities in various big-data environments. In my understanding,
instead of writing our own SQL parser in SIS we could have the following
approach:

 1. Complete the org.apache.sis.storage.DataStore API (it is currently
    very minimalist).
 2. Have the ShapeFile store to extend the abstract SIS DataStore.
 3. In a separated module, write a "SIS DataStore to Drill DataStore"
    adapter. It should work for any SIS DataStore, not only the
    ShapeFile one.

In my understanding once we have a Drill DataStore implementation (I do
not know yet what is the exact name in Drill API), we should
automatically get big-data-ready SQL for any SIS DataStore. If for any
reason Drill DataStore is considered not suitable, we could fallback on
Apache Calcite (http://calcite.apache.org/), which is the SQL parser
used under the hood by Drill. Another project that may be worth to
explore is Magellan: Geospatial Analytics on Spark [2].

My proposal could be summarized as below: maybe in 2016 or 2017, we
could consider to put the SIS SQL support in its own module and allows
it to run not only for ShapeFile, but for any SIS DataStore, if possible
using technology like Drill designed for big-data environments.

Any thoughts?

    Martin


[1] https://issues.apache.org/jira/browse/SIS-180
[2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/

Re: Long-term thoughts about big-data queries in SIS

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

maybe also look at new Big Data tech like AsterixDB in the Incubator
and also look at Search tech like Lucene and/or Solr?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Marc Le Bihan <ml...@gmail.com>
Reply-To: "dev@sis.apache.org" <de...@sis.apache.org>
Date: Tuesday, November 10, 2015 at 1:44 PM
To: "dev@sis.apache.org" <de...@sis.apache.org>
Cc: Mark Giaconia <Ma...@digitalglobe.com>
Subject: Re: Long-term thoughts about big-data queries in SIS

>Hello,
>
>About the development of the SQL Driver I can make you a status on it :
>    1) The driver is currently working for SELECT statements over DBase
>III 
>files, provided the WHERE clause used (if any) is limited to simple
>conditions.
>
>    2) I use it currently with real DBase III files coming from various
>places in order to challenge it.
>
>    3) Parsing of statements is now the main difficulty I have, and this
>subject started a debate few months ago : if I continue clause by clause
>(attempting to detect a GROUP BY, a HAVING, a LIKE ... "manually" it will
>be 
>long and difficult.
>    If I use a parser like AntLR, it will be potent and complete, but
>this 
>API is known to be really hard to handle and to make working perfectly. I
>used it four times, but I still fear each time I'm using it. But I think
>that it's the only solution.
>
>    4) UPDATE statement could come quickly.
>
>    5) For DELETE I have to see if logical delete can be done to avoid
>re-writing the whole file, and for INSERT a new entry will have to be
>set. 
>It's not easy, because if an index file comes with the DBF III file I
>have 
>to update it too.
>    And also : I have to find a way to manage Shapefile reader to
>continue 
>following the content of the DBase file. If I destroy a record on the
>DBase 
>file, the associated entry in Shapefile should not be valid anymore, for
>example.
>    And removals or insertions in shapefile would lead to make some
>changes 
>in their index files too.
>
>    6) The interfaces, the abstract subclasses used to help the
>implementation of DBase III connection, statement, resultset, metadata
>will 
>be helpful to develop another driver for another kind of database, if
>needed. But these abstract classes still might change :
>    I expect many discoveries until the end.
>
>    7) What the good choice after this first part of work (the CRUD
>operations) : being able to handle transactions, or being able to
>implement 
>JPA interfaces ? The two are valuables goals.
>
>Regards,
>
>Marc Le Bihan
>
>
>-----Message d'origine-----
>From: Adam Estrada
>Sent: Tuesday, November 10, 2015 7:20 PM
>To: dev@sis.apache.org
>Cc: Mark Giaconia
>Subject: Re: Long-term thoughts about big-data queries in SIS
>
>Martin,
>
>This is extremely cool and much needed in the geospatial community! My
>company, DigitalGlobe, has done a lot with this and has open sourced
>many of the packages that can be found on GitHub today. Rasdaman[1]
>and PostGIS Raster are other open source examples of how to do this in
>relational databases. We have done a lot of research on how to store
>pixels and query for them in HBASE/Hadoop and ElasticSearch too. There
>are many options to this one!
>
>Adam
>
>[1] http://rasdaman.org/
>
>On Tue, Nov 10, 2015 at 6:09 AM, Martin Desruisseaux
><ma...@geomatys.com> wrote:
>> Hello all
>>
>> In the BigData Apache Conference in Budapest, I attended to some
>> meetings about exploiting geospatial big data using SQL language. I
>> though that we could make some long-term plans that could impact the
>> SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
>> email is not a request for any change now. This is just a proposal about
>> some possible long term plans.
>>
>> In one or two years, Apache SIS would hopefully have some DataStore
>> implementations ready for production use. But we have a strong request
>> for capability to use DataStores with big-data technologies like Hadoop.
>> This request increases the challenge of writing a SQL driver, since a
>> sophisticated SQL driver would need to be able to restructure query
>> plans according the available clusters.
>>
>> I had a discussion with peoples from Apache Drill project
>> (https://drill.apache.org/), which already provide SQL parsing
>> capabilities in various big-data environments. In my understanding,
>> instead of writing our own SQL parser in SIS we could have the following
>> approach:
>>
>>  1. Complete the org.apache.sis.storage.DataStore API (it is currently
>>     very minimalist).
>>  2. Have the ShapeFile store to extend the abstract SIS DataStore.
>>  3. In a separated module, write a "SIS DataStore to Drill DataStore"
>>     adapter. It should work for any SIS DataStore, not only the
>>     ShapeFile one.
>>
>> In my understanding once we have a Drill DataStore implementation (I do
>> not know yet what is the exact name in Drill API), we should
>> automatically get big-data-ready SQL for any SIS DataStore. If for any
>> reason Drill DataStore is considered not suitable, we could fallback on
>> Apache Calcite (http://calcite.apache.org/), which is the SQL parser
>> used under the hood by Drill. Another project that may be worth to
>> explore is Magellan: Geospatial Analytics on Spark [2].
>>
>> My proposal could be summarized as below: maybe in 2016 or 2017, we
>> could consider to put the SIS SQL support in its own module and allows
>> it to run not only for ShapeFile, but for any SIS DataStore, if possible
>> using technology like Drill designed for big-data environments.
>>
>> Any thoughts?
>>
>>     Martin
>>
>>
>> [1] https://issues.apache.org/jira/browse/SIS-180
>> [2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/
>>
>> 
>

Re: Long-term thoughts about big-data queries in SIS

Posted by Marc Le Bihan <ml...@gmail.com>.

Calcite seems impressive, yes !
Its worth the study.

-----Message d'origine----- 
From: Martin Desruisseaux
Sent: Wednesday, November 11, 2015 2:22 PM
To: dev@sis.apache.org
Subject: Re: Long-term thoughts about big-data queries in SIS

Le 10/11/15 19:44, Marc Le Bihan a écrit :
>    3) Parsing of statements is now the main difficulty I have, and
> this subject started a debate few months ago : if I continue clause by
> clause (attempting to detect a GROUP BY, a HAVING, a LIKE ...
> "manually" it will be long and difficult.
>    If I use a parser like AntLR, it will be potent and complete, but
> this API is known to be really hard to handle and to make working
> perfectly. I used it four times, but I still fear each time I'm using
> it. But I think that it's the only solution.

Could http://calcite.apache.org/ free us from this task? If I understand
correctly, we would just need to implement some methods that are
automatically invoked by Calcite. So we would have no SQL parser to
write at all and no JDBC interface to implement ourself. According their
documentation, Calcite already implements SELECT, FROM (including JOIN),
WHERE, GROUP BY (including GROUPING SETS), COUNT(DISTINCT …), FILTER,
HAVING, ORDER BY (including NULLS FIRST/LAST), UNION, INTERSECT, MINUS,
sub-queries and more.

However one open question is whether it is easy or hard to add our own
SQL instructions to Calcite, since we will need to provide geometry
functions. I do not know the answer to that question at this time.

Calcite provides an example using CSV file as a database. We would copy
this example and replace the code reading from CSV file by code reading
from Shapefile.

We could also go when step further and try to use
http://drill.apache.org/ instead of Calcite, in anticipation for
big-data. However since Drill uses Calcite under the hood, it is
probably fine to start with Calcite for now since it would not introduce
any additional dependency compared to Drill.

    Martin

Re: Long-term thoughts about big-data queries in SIS

Posted by Adam Estrada <es...@gmail.com>.

+1 as a separate module or even part of the geometry module.

A

On Wed, Nov 11, 2015 at 8:33 AM, Martin Desruisseaux
<ma...@geomatys.com> wrote:
> Le 11/11/15 14:22, Martin Desruisseaux a écrit :
>> We could also go when step further and try to use
>> http://drill.apache.org/ instead of Calcite, in anticipation for
>> big-data. However since Drill uses Calcite under the hood, it is
>> probably fine to start with Calcite for now since it would not
>> introduce any additional dependency compared to Drill.
>
> Just for clarification: if we introduce such dependency, in my opinion
> it should not be in the core of SIS (my hope is to have the core depends
> on nothing else than GeoAPI and Unit of Measurement). I rather see that
> as a separated SIS module.
>
>     Martin
>
>

Re: Long-term thoughts about big-data queries in SIS

Posted by Martin Desruisseaux <ma...@geomatys.com>.

Le 11/11/15 14:22, Martin Desruisseaux a écrit :
> We could also go when step further and try to use
> http://drill.apache.org/ instead of Calcite, in anticipation for
> big-data. However since Drill uses Calcite under the hood, it is
> probably fine to start with Calcite for now since it would not
> introduce any additional dependency compared to Drill.

Just for clarification: if we introduce such dependency, in my opinion
it should not be in the core of SIS (my hope is to have the core depends
on nothing else than GeoAPI and Unit of Measurement). I rather see that
as a separated SIS module.

    Martin

Re: Long-term thoughts about big-data queries in SIS

Posted by Martin Desruisseaux <ma...@geomatys.com>.

Le 10/11/15 19:44, Marc Le Bihan a écrit :
>    3) Parsing of statements is now the main difficulty I have, and
> this subject started a debate few months ago : if I continue clause by
> clause (attempting to detect a GROUP BY, a HAVING, a LIKE ...
> "manually" it will be long and difficult.
>    If I use a parser like AntLR, it will be potent and complete, but
> this API is known to be really hard to handle and to make working
> perfectly. I used it four times, but I still fear each time I'm using
> it. But I think that it's the only solution.

Could http://calcite.apache.org/ free us from this task? If I understand
correctly, we would just need to implement some methods that are
automatically invoked by Calcite. So we would have no SQL parser to
write at all and no JDBC interface to implement ourself. According their
documentation, Calcite already implements SELECT, FROM (including JOIN),
WHERE, GROUP BY (including GROUPING SETS), COUNT(DISTINCT …), FILTER,
HAVING, ORDER BY (including NULLS FIRST/LAST), UNION, INTERSECT, MINUS,
sub-queries and more.

However one open question is whether it is easy or hard to add our own
SQL instructions to Calcite, since we will need to provide geometry
functions. I do not know the answer to that question at this time.

Calcite provides an example using CSV file as a database. We would copy
this example and replace the code reading from CSV file by code reading
from Shapefile.

We could also go when step further and try to use
http://drill.apache.org/ instead of Calcite, in anticipation for
big-data. However since Drill uses Calcite under the hood, it is
probably fine to start with Calcite for now since it would not introduce
any additional dependency compared to Drill.

    Martin

Re: Long-term thoughts about big-data queries in SIS

Posted by Marc Le Bihan <ml...@gmail.com>.

Hello,

About the development of the SQL Driver I can make you a status on it :
    1) The driver is currently working for SELECT statements over DBase III 
files, provided the WHERE clause used (if any) is limited to simple 
conditions.

    2) I use it currently with real DBase III files coming from various 
places in order to challenge it.

    3) Parsing of statements is now the main difficulty I have, and this 
subject started a debate few months ago : if I continue clause by clause 
(attempting to detect a GROUP BY, a HAVING, a LIKE ... "manually" it will be 
long and difficult.
    If I use a parser like AntLR, it will be potent and complete, but this 
API is known to be really hard to handle and to make working perfectly. I 
used it four times, but I still fear each time I'm using it. But I think 
that it's the only solution.

    4) UPDATE statement could come quickly.

    5) For DELETE I have to see if logical delete can be done to avoid 
re-writing the whole file, and for INSERT a new entry will have to be set. 
It's not easy, because if an index file comes with the DBF III file I have 
to update it too.
    And also : I have to find a way to manage Shapefile reader to continue 
following the content of the DBase file. If I destroy a record on the DBase 
file, the associated entry in Shapefile should not be valid anymore, for 
example.
    And removals or insertions in shapefile would lead to make some changes 
in their index files too.

    6) The interfaces, the abstract subclasses used to help the 
implementation of DBase III connection, statement, resultset, metadata will 
be helpful to develop another driver for another kind of database, if 
needed. But these abstract classes still might change :
    I expect many discoveries until the end.

    7) What the good choice after this first part of work (the CRUD 
operations) : being able to handle transactions, or being able to implement 
JPA interfaces ? The two are valuables goals.

Regards,

Marc Le Bihan


-----Message d'origine----- 
From: Adam Estrada
Sent: Tuesday, November 10, 2015 7:20 PM
To: dev@sis.apache.org
Cc: Mark Giaconia
Subject: Re: Long-term thoughts about big-data queries in SIS

Martin,

This is extremely cool and much needed in the geospatial community! My
company, DigitalGlobe, has done a lot with this and has open sourced
many of the packages that can be found on GitHub today. Rasdaman[1]
and PostGIS Raster are other open source examples of how to do this in
relational databases. We have done a lot of research on how to store
pixels and query for them in HBASE/Hadoop and ElasticSearch too. There
are many options to this one!

Adam

[1] http://rasdaman.org/

On Tue, Nov 10, 2015 at 6:09 AM, Martin Desruisseaux
<ma...@geomatys.com> wrote:
> Hello all
>
> In the BigData Apache Conference in Budapest, I attended to some
> meetings about exploiting geospatial big data using SQL language. I
> though that we could make some long-term plans that could impact the
> SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
> email is not a request for any change now. This is just a proposal about
> some possible long term plans.
>
> In one or two years, Apache SIS would hopefully have some DataStore
> implementations ready for production use. But we have a strong request
> for capability to use DataStores with big-data technologies like Hadoop.
> This request increases the challenge of writing a SQL driver, since a
> sophisticated SQL driver would need to be able to restructure query
> plans according the available clusters.
>
> I had a discussion with peoples from Apache Drill project
> (https://drill.apache.org/), which already provide SQL parsing
> capabilities in various big-data environments. In my understanding,
> instead of writing our own SQL parser in SIS we could have the following
> approach:
>
>  1. Complete the org.apache.sis.storage.DataStore API (it is currently
>     very minimalist).
>  2. Have the ShapeFile store to extend the abstract SIS DataStore.
>  3. In a separated module, write a "SIS DataStore to Drill DataStore"
>     adapter. It should work for any SIS DataStore, not only the
>     ShapeFile one.
>
> In my understanding once we have a Drill DataStore implementation (I do
> not know yet what is the exact name in Drill API), we should
> automatically get big-data-ready SQL for any SIS DataStore. If for any
> reason Drill DataStore is considered not suitable, we could fallback on
> Apache Calcite (http://calcite.apache.org/), which is the SQL parser
> used under the hood by Drill. Another project that may be worth to
> explore is Magellan: Geospatial Analytics on Spark [2].
>
> My proposal could be summarized as below: maybe in 2016 or 2017, we
> could consider to put the SIS SQL support in its own module and allows
> it to run not only for ShapeFile, but for any SIS DataStore, if possible
> using technology like Drill designed for big-data environments.
>
> Any thoughts?
>
>     Martin
>
>
> [1] https://issues.apache.org/jira/browse/SIS-180
> [2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/
>
>

Re: Long-term thoughts about big-data queries in SIS

Posted by Martin Desruisseaux <ma...@geomatys.com>.

Indeed, we saw many Rasdaman presentations in OGC meetings. My feeling
is that Drill + SIS (after 1 more year of development) could provide
something similar.

PostGIS rasters is a little bit different story since it stores the
pixel values in the database itself (in my understanding). This is
convenient for getting a backup of everything with a single PostgreSQL
dump. But I tend to prefer keeping the raster data in ordinary files, if
possible in their original format (except when we need to create tiles
and pyramids for performance reasons), and use the database as an index
or a catalogue. I think that it is also closer to the Rasdaman approach.
However we should be able to support both approaches and let user
chooses what (s)he prefers.

    Martin


Le 10/11/15 19:20, Adam Estrada a écrit :
> Martin,
>
> This is extremely cool and much needed in the geospatial community! My
> company, DigitalGlobe, has done a lot with this and has open sourced
> many of the packages that can be found on GitHub today. Rasdaman[1]
> and PostGIS Raster are other open source examples of how to do this in
> relational databases. We have done a lot of research on how to store
> pixels and query for them in HBASE/Hadoop and ElasticSearch too. There
> are many options to this one!
>
> Adam
>
> [1] http://rasdaman.org/

Re: Long-term thoughts about big-data queries in SIS

Posted by Adam Estrada <es...@gmail.com>.

Martin,

This is extremely cool and much needed in the geospatial community! My
company, DigitalGlobe, has done a lot with this and has open sourced
many of the packages that can be found on GitHub today. Rasdaman[1]
and PostGIS Raster are other open source examples of how to do this in
relational databases. We have done a lot of research on how to store
pixels and query for them in HBASE/Hadoop and ElasticSearch too. There
are many options to this one!

Adam

[1] http://rasdaman.org/

On Tue, Nov 10, 2015 at 6:09 AM, Martin Desruisseaux
<ma...@geomatys.com> wrote:
> Hello all
>
> In the BigData Apache Conference in Budapest, I attended to some
> meetings about exploiting geospatial big data using SQL language. I
> though that we could make some long-term plans that could impact the
> SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
> email is not a request for any change now. This is just a proposal about
> some possible long term plans.
>
> In one or two years, Apache SIS would hopefully have some DataStore
> implementations ready for production use. But we have a strong request
> for capability to use DataStores with big-data technologies like Hadoop.
> This request increases the challenge of writing a SQL driver, since a
> sophisticated SQL driver would need to be able to restructure query
> plans according the available clusters.
>
> I had a discussion with peoples from Apache Drill project
> (https://drill.apache.org/), which already provide SQL parsing
> capabilities in various big-data environments. In my understanding,
> instead of writing our own SQL parser in SIS we could have the following
> approach:
>
>  1. Complete the org.apache.sis.storage.DataStore API (it is currently
>     very minimalist).
>  2. Have the ShapeFile store to extend the abstract SIS DataStore.
>  3. In a separated module, write a "SIS DataStore to Drill DataStore"
>     adapter. It should work for any SIS DataStore, not only the
>     ShapeFile one.
>
> In my understanding once we have a Drill DataStore implementation (I do
> not know yet what is the exact name in Drill API), we should
> automatically get big-data-ready SQL for any SIS DataStore. If for any
> reason Drill DataStore is considered not suitable, we could fallback on
> Apache Calcite (http://calcite.apache.org/), which is the SQL parser
> used under the hood by Drill. Another project that may be worth to
> explore is Magellan: Geospatial Analytics on Spark [2].
>
> My proposal could be summarized as below: maybe in 2016 or 2017, we
> could consider to put the SIS SQL support in its own module and allows
> it to run not only for ShapeFile, but for any SIS DataStore, if possible
> using technology like Drill designed for big-data environments.
>
> Any thoughts?
>
>     Martin
>
>
> [1] https://issues.apache.org/jira/browse/SIS-180
> [2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/
>
>