You are viewing a plain text version of this content. The canonical link for it is here.

Posted to geospatial@apache.org by Julian Hyde <jh...@gmail.com> on 2021/06/25 18:50:59 UTC

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Cc += geospatial@.

I think allowing WKB and WKT is sufficient.

Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID (spatial reference identifier) is almost always needed to qualify a geometry value. It is analogous to how TimeZone is needed (implicitly or explicitly) to qualify a DateTime value.

For Geospatial queries to perform well requires some kind of indexing (and/or clever data organization). Geospatial indexing is very complex, and there is no “one size fits all” approach. So I recommend that Arrow stays out of the indexing business, and leaves indexing to the engine.

Julian


> On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <ma...@uc.cl.INVALID> wrote:
> 
> Dear Jon
> 
> Thanks for sending this. Based on previous projects, WKB works well with
> SQLite, DuckDB and others, at the expense of creating heavier size columns
> compared to PostGIS.
> 
> In order to experiment with, it can be interesting to use the CENSO 2017
> shape files: https://github.com/ropensci/censo2017-cartografias;
> https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
> This includes rivers, streets, etc etc.
> 
> Provided that Arrow is installed in a very straightforward way (for
> Windows, at least), creating something based on PostGIS is probably not a
> bad idea, but WKB works ok, and it integrates with 0 problems with the SF
> package. I clearly see a great compression advantage here if we decide to
> use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.
> 
> Best,
> 
> 
> 
> 
> 
> 
> 
> On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jk...@gmail.com> wrote:
> 
>> Hello,
>> 
>> There is an emerging spec[1] for how to store geospatial data in Arrow
>> + pass through parquet files in the geopandas world. There is even a
>> new R package that implements a wrapper to do the same in R[2]. These
>> both define a serialization[3] for storing geospatial data as an Arrow
>> table (and thus also when saving to parquet with Arrow).
>> 
>> I could see a number of ways that we might interact with standards
>> like these, and for any of these that we pursue it would be good to
>> clarify that in our docs:
>> 
>> 1. Point to the standard — we could mention that this standard exists
>> and that if someone is building a geospatial data aware application,
>> they _could_ refer to this standard if they want to.
>> 2. Adopt a/this standard — this could range from stating that we've
>> adopted it as the way that spatial data _ought_ to be stored to asking
>> the creators if maintaining it within the Arrow project itself would
>> be better (either by adopting it or creating a fork — of course
>> communication with the folks working on it now would be critical!)
>> 3. Create extension type(s) for geospatial data — this would require
>> adopting a standard like the one linked, but on top of that providing
>> an extension type within Arrow itself that the various clients could
>> implement as they saw fit.
>> 4. Create new, fully separate type(s) for geospatial data — again,
>> this would require adopting a standard of some sort, but we would
>> implement it as a specific type and presumably support it in all of
>> the clients as we could.
>> 
>> There are of course pros and cons to all of these. This type of data
>> *is* somewhat specialized and I don't think we want to have a huge
>> profusion of types for all of the possible specialized data types out
>> there. But, at a minimum we should acknowledge (or adopt) a standard
>> if it exists and encourage implementations that use Arrow to follow
>> that standard (like sfarrow does to be compatible with geopandas) so
>> that some level of interoperability is there + people aren't needing
>> to reinvent the wheel each time they store spatial data.
>> 
>> Thoughts? Are there other projects out there that already do something
>> like this with Arrow that we should consider?
>> 
>> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
>> [2] https://github.com/wcjochem/sfarrow
>> [3] for now they create a binary WKB column + attach a bit of metadata
>> to the schema that that's what happened, though there are other ways
>> one could encode this and the spec might include other way(s) to store
>> this data in the future.
>> 
>> -Jon
>> 
> 
> 
> -- 
> —
> *Mauricio 'Pachá' Vargas Sepúlveda*
> Site: pacha.dev
> Blog: pacha.dev/blog

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by Martin Desruisseaux <ma...@geomatys.com>.

Hello Josh!

Le 28/06/2021 à 22:18, Joshua Lieberman a écrit :

> This is Josh Lieberman, on the staff of the OGC. I’ve been interested 
> for a while in helping to standardize geospatial data types for 
> various python packages and other mainstream computing tools. The goal 
> is to help base these on common, valid geospatial models in order to 
> make data sharing across tools and platforms more reliable.
>
I'm very interested in that! Actually OGC GeoAPI [1] is trying to do 
exactly that and contains a Python packages [2] in addition of Java [3], 
with only metadata and CRS stuff for now. Those Python interfaces have 
never been formally released yet, so the floor is open for feedback, 
changes, etc.

We did not had a GeoAPI SWG meeting at OGC for a while, in part because 
I'm hoping for face-to-face meetings to resume and in part because I 
have difficulty to put enough time in it. If someone is interested in 
being GeoAPI chair or co-chair, I think it would help a lot.

Outside OGC, the OpenEO effort [4] has a similar goal. In my 
understanding of the presentation they did a few years ago, the two 
projects differ by their approach. GeoAPI has a more "bottom to top" 
approach, trying to secure foundation (CRS, metadata, features, etc.) 
before to move to the next level, while OpenEO focuses more directly on 
top-level services.


> Another goal is to help avoid reinventing wheels. For example, PostGIS 
> supports EWKT and EWKB which already include SRID’s.
>
Yes, but I think that those SRID can not really be used outside the 
context of the PostGIS database that contains the EWKT or EWKB. It seems 
to me (I may be wrong) that they are just integer codes without 
authority names. Peoples often assume that they are EPSG codes, but this 
is not necessarily the case; we have to check in the "spatial_ref_sys" 
table for the actual definition. So if a geometry encoded in EWKB needs 
to be used by a process outside the PostGIS database, the SRID may need 
to be resolved to a full CRS definition if we want to be unambiguous.


> We (OGC) are specifically interested in working with our partners, 
> such as Apache, to host an “Encoding Summit” at our September or 
> December Member Meeting. A principle focus will be JSON encodings, but 
> data type encodings for other environments are also of interest. One 
> outcome could be an Encodings working group that targets “round trip” 
> interoperability between various encodings. Another could simply be 
> joint work items to further these objectives. Let me know if you might 
> be interested in participating.
>
I'm interested on the assumption that, in the context of Python or Java, 
we interpret "encodings" as "interfaces" with data structures left to 
implementers. There is a "round trip" interoperability between Python 
and Java in GeoAPI (still in development) that I would like to continue.

     Regards,

         Martin

[1] http://www.geoapi.org/
[2] http://www.geoapi.org/snapshot/python/index.html
[3] http://www.geoapi.org/3.0/javadoc/index.html
[4] https://github.com/Open-EO

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by Julian Hyde <jh...@gmail.com>.

I’m kind of a kibitzer in Arrow - I help with the project governance, and may have helpful comments occasionally, but I don’t implement features, and use Arrow only rarely. 

So I don’t think the OGC should respond to my comments specifically. 

I suggest that you monitor that Arrow thread, see whether consensus and an individual who is a thought leader emerges, and respond if and when that happens. 

Also, people chiming in and nudging the discussion in the right direction, as some have already done, is most welcome.

Julian

> On Jun 29, 2021, at 8:35 AM, George Percivall <pe...@gmail.com> wrote:
> 
> Josh,
> 
> Thanks for inviting the Apache Projects to participate in OGC activities.
> 
> To get things jump started, could you get an OGC response to Julian’s comments?
> 
>> I think allowing WKB and WKT is sufficient.
>> 
>> Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID (spatial reference identifier) is almost always needed to qualify a geometry value. It is analogous to how TimeZone is needed (implicitly or explicitly) to qualify a DateTime value.
>> 
>> For Geospatial queries to perform well requires some kind of indexing (and/or clever data organization). Geospatial indexing is very complex, and there is no “one size fits all” approach. So I recommend that Arrow stays out of the indexing business, and leaves indexing to the engine.
> 
> 
> George Percivall
> GeoRoundtable
> percivall@apache.org
> 
> 
> 
> 
>> On Jun 28, 2021, at 4:18 PM, Joshua Lieberman <jl...@tumblingwalls.com> wrote:
>> 
>> Hi,
>> 
>> This is Josh Lieberman, on the staff of the OGC. I’ve been interested for a while in helping to standardize geospatial data types for various python packages and other mainstream computing tools. The goal is to help base these on common, valid geospatial models in order to make data sharing across tools and platforms more reliable. Another goal is to help avoid reinventing wheels. For example, PostGIS supports EWKT and EWKB which already include SRID’s.
>> 
>> We (OGC) are specifically interested in working with our partners, such as Apache, to host an “Encoding Summit” at our September or December Member Meeting. A principle focus will be JSON encodings, but data type encodings for other environments are also of interest. One outcome could be an Encodings working group that targets “round trip” interoperability between various encodings. Another could simply be joint work items to further these objectives.
>> 
>> Let me know if you might be interested in participating.
>> 
>> Thanks,
>> 
>> Josh
>> 
>> 
>>> On Jun 28, 2021, at 12:49 PM, Martin Desruisseaux <ma...@geomatys.com> wrote:
>>> 
>>>> Le 28/06/2021 à 18:07, Jim Hughes a écrit :
>>>> Would the geospatial@apache.org mailing list be a good place to discuss things?
>>> 
>>> Sure please do. I'm not familiar with the arrow project but I could try to participate on the use of international standards (OGC and ISO) on the following topics:
>>> 
>>> * SRID (or more generally Coordinate Reference Systems)
>>> * WKB or WKT for geometries
>>> * Queries using standard spatial filters (defined by OGC)
>>> 
>>> Martin
>>> 
>>> 
>> 
>

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by George Percivall <pe...@gmail.com>.

Josh,

Thanks for inviting the Apache Projects to participate in OGC activities.

To get things jump started, could you get an OGC response to Julian’s comments?

> I think allowing WKB and WKT is sufficient.
> 
> Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID (spatial reference identifier) is almost always needed to qualify a geometry value. It is analogous to how TimeZone is needed (implicitly or explicitly) to qualify a DateTime value.
> 
> For Geospatial queries to perform well requires some kind of indexing (and/or clever data organization). Geospatial indexing is very complex, and there is no “one size fits all” approach. So I recommend that Arrow stays out of the indexing business, and leaves indexing to the engine.


George Percivall
GeoRoundtable <https://www.linkedin.com/company/geo-roundtable/>
percivall@apache.org




> On Jun 28, 2021, at 4:18 PM, Joshua Lieberman <jl...@tumblingwalls.com> wrote:
> 
> Hi,
> 
> This is Josh Lieberman, on the staff of the OGC. I’ve been interested for a while in helping to standardize geospatial data types for various python packages and other mainstream computing tools. The goal is to help base these on common, valid geospatial models in order to make data sharing across tools and platforms more reliable. Another goal is to help avoid reinventing wheels. For example, PostGIS supports EWKT and EWKB which already include SRID’s.
> 
> We (OGC) are specifically interested in working with our partners, such as Apache, to host an “Encoding Summit” at our September or December Member Meeting. A principle focus will be JSON encodings, but data type encodings for other environments are also of interest. One outcome could be an Encodings working group that targets “round trip” interoperability between various encodings. Another could simply be joint work items to further these objectives.
> 
> Let me know if you might be interested in participating.
> 
> Thanks,
> 
> Josh
> 
> 
>> On Jun 28, 2021, at 12:49 PM, Martin Desruisseaux <ma...@geomatys.com> wrote:
>> 
>> Le 28/06/2021 à 18:07, Jim Hughes a écrit :
>>> Would the geospatial@apache.org mailing list be a good place to discuss things?
>> 
>> Sure please do. I'm not familiar with the arrow project but I could try to participate on the use of international standards (OGC and ISO) on the following topics:
>> 
>> * SRID (or more generally Coordinate Reference Systems)
>> * WKB or WKT for geometries
>> * Queries using standard spatial filters (defined by OGC)
>> 
>> Martin
>> 
>> 
>

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by Joshua Lieberman <jl...@tumblingwalls.com>.

Hi,

This is Josh Lieberman, on the staff of the OGC. I’ve been interested for a while in helping to standardize geospatial data types for various python packages and other mainstream computing tools. The goal is to help base these on common, valid geospatial models in order to make data sharing across tools and platforms more reliable. Another goal is to help avoid reinventing wheels. For example, PostGIS supports EWKT and EWKB which already include SRID’s.

We (OGC) are specifically interested in working with our partners, such as Apache, to host an “Encoding Summit” at our September or December Member Meeting. A principle focus will be JSON encodings, but data type encodings for other environments are also of interest. One outcome could be an Encodings working group that targets “round trip” interoperability between various encodings. Another could simply be joint work items to further these objectives.

Let me know if you might be interested in participating.

Thanks,

Josh

> On Jun 28, 2021, at 12:49 PM, Martin Desruisseaux <ma...@geomatys.com> wrote:
> 
> Le 28/06/2021 à 18:07, Jim Hughes a écrit :
>> Would the geospatial@apache.org mailing list be a good place to discuss things?
> 
> Sure please do. I'm not familiar with the arrow project but I could try to participate on the use of international standards (OGC and ISO) on the following topics:
> 
> * SRID (or more generally Coordinate Reference Systems)
> * WKB or WKT for geometries
> * Queries using standard spatial filters (defined by OGC)
> 
> Martin
> 
>

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by Martin Desruisseaux <ma...@geomatys.com>.

Le 28/06/2021 à 18:07, Jim Hughes a écrit :
> Would the geospatial@apache.org mailing list be a good place to 
> discuss things?

Sure please do. I'm not familiar with the arrow project but I could try 
to participate on the use of international standards (OGC and ISO) on 
the following topics:

  * SRID (or more generally Coordinate Reference Systems)
  * WKB or WKT for geometries
  * Queries using standard spatial filters (defined by OGC)

Martin

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by Jim Hughes <jh...@ccri.com>.

Hi all,

I'd add two points:

First, while indexing is complex, does Arrow maintain/create stats on 
columns of data?  If so, capturing an MBR of the geometries would be 
awesome to help with very rough filtering/pruning.

Second, I'm very keen to understand the connections between Arrow and 
Parquet that the spec PR discusses.  I work on GeoMesa, we've added 
spatial support to Arrow, Parquet, Orc, and Spark.  Spatial dataframes 
in Spark can be saved to Parquet and Orc.  I mention that to point out 
that the "round trip" consideration can be extended to at least these 
four Apache projects.

Recently, I've seen another developer starting from scratch on adding 
Parquet support to Sedona and another group interested in sorting out a 
"GeoParquet" format.  It seems like we ought to sort out a place to 
discuss this.

Would the geospatial@apache.org mailing list be a good place to discuss 
things?

Cheers,

Jim

On 6/25/2021 2:50 PM, Julian Hyde wrote:
> Cc += geospatial@.
>
> I think allowing WKB and WKT is sufficient.
>
> Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID (spatial reference identifier) is almost always needed to qualify a geometry value. It is analogous to how TimeZone is needed (implicitly or explicitly) to qualify a DateTime value.
>
> For Geospatial queries to perform well requires some kind of indexing (and/or clever data organization). Geospatial indexing is very complex, and there is no “one size fits all” approach. So I recommend that Arrow stays out of the indexing business, and leaves indexing to the engine.
>
> Julian
>
>
>> On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <ma...@uc.cl.INVALID> wrote:
>>
>> Dear Jon
>>
>> Thanks for sending this. Based on previous projects, WKB works well with
>> SQLite, DuckDB and others, at the expense of creating heavier size columns
>> compared to PostGIS.
>>
>> In order to experiment with, it can be interesting to use the CENSO 2017
>> shape files: https://github.com/ropensci/censo2017-cartografias;
>> https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
>> This includes rivers, streets, etc etc.
>>
>> Provided that Arrow is installed in a very straightforward way (for
>> Windows, at least), creating something based on PostGIS is probably not a
>> bad idea, but WKB works ok, and it integrates with 0 problems with the SF
>> package. I clearly see a great compression advantage here if we decide to
>> use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.
>>
>> Best,
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jk...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> There is an emerging spec[1] for how to store geospatial data in Arrow
>>> + pass through parquet files in the geopandas world. There is even a
>>> new R package that implements a wrapper to do the same in R[2]. These
>>> both define a serialization[3] for storing geospatial data as an Arrow
>>> table (and thus also when saving to parquet with Arrow).
>>>
>>> I could see a number of ways that we might interact with standards
>>> like these, and for any of these that we pursue it would be good to
>>> clarify that in our docs:
>>>
>>> 1. Point to the standard — we could mention that this standard exists
>>> and that if someone is building a geospatial data aware application,
>>> they _could_ refer to this standard if they want to.
>>> 2. Adopt a/this standard — this could range from stating that we've
>>> adopted it as the way that spatial data _ought_ to be stored to asking
>>> the creators if maintaining it within the Arrow project itself would
>>> be better (either by adopting it or creating a fork — of course
>>> communication with the folks working on it now would be critical!)
>>> 3. Create extension type(s) for geospatial data — this would require
>>> adopting a standard like the one linked, but on top of that providing
>>> an extension type within Arrow itself that the various clients could
>>> implement as they saw fit.
>>> 4. Create new, fully separate type(s) for geospatial data — again,
>>> this would require adopting a standard of some sort, but we would
>>> implement it as a specific type and presumably support it in all of
>>> the clients as we could.
>>>
>>> There are of course pros and cons to all of these. This type of data
>>> *is* somewhat specialized and I don't think we want to have a huge
>>> profusion of types for all of the possible specialized data types out
>>> there. But, at a minimum we should acknowledge (or adopt) a standard
>>> if it exists and encourage implementations that use Arrow to follow
>>> that standard (like sfarrow does to be compatible with geopandas) so
>>> that some level of interoperability is there + people aren't needing
>>> to reinvent the wheel each time they store spatial data.
>>>
>>> Thoughts? Are there other projects out there that already do something
>>> like this with Arrow that we should consider?
>>>
>>> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
>>> [2] https://github.com/wcjochem/sfarrow
>>> [3] for now they create a binary WKB column + attach a bit of metadata
>>> to the schema that that's what happened, though there are other ways
>>> one could encode this and the spec might include other way(s) to store
>>> this data in the future.
>>>
>>> -Jon
>>>
>>
>> -- 
>> —
>> *Mauricio 'Pachá' Vargas Sepúlveda*
>> Site: pacha.dev
>> Blog: pacha.dev/blog

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by Max Burke <ma...@urbanlogiq.com>.

We've been using binary field types in Parquet and Arrow for WKB-formatted
data and we've been finding that it works very well. Having a geospatial
type in Arrow that allowed an optional SRID to be passed along would be
nice but would be more useful if it came with a corresponding Parquet
logical type annotation too.

On Fri, Jun 25, 2021 at 12:15 PM M. Edward (Ed) Borasky <zn...@znmeb.net>
wrote:

> I don't know about GeoPandas but in R there are two main in-memory GIS
> data types: the old-ish "sp" format and the new "sf" (simple features)
> format. As an R GIS developer, I would expect any Arrow GIS capability
>  to efficiently facilitate "sf" / "tidyverse" operations. See
> https://geocompr.robinlovelace.net/ for the details.
>
> On Fri, Jun 25, 2021 at 11:51 AM Julian Hyde <jh...@gmail.com>
> wrote:
> >
> > Cc += geospatial@.
> >
> > I think allowing WKB and WKT is sufficient.
> >
> > Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID).
> SRID (spatial reference identifier) is almost always needed to qualify a
> geometry value. It is analogous to how TimeZone is needed (implicitly or
> explicitly) to qualify a DateTime value.
> >
> > For Geospatial queries to perform well requires some kind of indexing
> (and/or clever data organization). Geospatial indexing is very complex, and
> there is no “one size fits all” approach. So I recommend that Arrow stays
> out of the indexing business, and leaves indexing to the engine.
> >
> > Julian
> >
> >
> > > On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <ma...@uc.cl.INVALID>
> wrote:
> > >
> > > Dear Jon
> > >
> > > Thanks for sending this. Based on previous projects, WKB works well
> with
> > > SQLite, DuckDB and others, at the expense of creating heavier size
> columns
> > > compared to PostGIS.
> > >
> > > In order to experiment with, it can be interesting to use the CENSO
> 2017
> > > shape files: https://github.com/ropensci/censo2017-cartografias;
> > >
> https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
> > > This includes rivers, streets, etc etc.
> > >
> > > Provided that Arrow is installed in a very straightforward way (for
> > > Windows, at least), creating something based on PostGIS is probably
> not a
> > > bad idea, but WKB works ok, and it integrates with 0 problems with the
> SF
> > > package. I clearly see a great compression advantage here if we decide
> to
> > > use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.
> > >
> > > Best,
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jk...@gmail.com>
> wrote:
> > >
> > >> Hello,
> > >>
> > >> There is an emerging spec[1] for how to store geospatial data in Arrow
> > >> + pass through parquet files in the geopandas world. There is even a
> > >> new R package that implements a wrapper to do the same in R[2]. These
> > >> both define a serialization[3] for storing geospatial data as an Arrow
> > >> table (and thus also when saving to parquet with Arrow).
> > >>
> > >> I could see a number of ways that we might interact with standards
> > >> like these, and for any of these that we pursue it would be good to
> > >> clarify that in our docs:
> > >>
> > >> 1. Point to the standard — we could mention that this standard exists
> > >> and that if someone is building a geospatial data aware application,
> > >> they _could_ refer to this standard if they want to.
> > >> 2. Adopt a/this standard — this could range from stating that we've
> > >> adopted it as the way that spatial data _ought_ to be stored to asking
> > >> the creators if maintaining it within the Arrow project itself would
> > >> be better (either by adopting it or creating a fork — of course
> > >> communication with the folks working on it now would be critical!)
> > >> 3. Create extension type(s) for geospatial data — this would require
> > >> adopting a standard like the one linked, but on top of that providing
> > >> an extension type within Arrow itself that the various clients could
> > >> implement as they saw fit.
> > >> 4. Create new, fully separate type(s) for geospatial data — again,
> > >> this would require adopting a standard of some sort, but we would
> > >> implement it as a specific type and presumably support it in all of
> > >> the clients as we could.
> > >>
> > >> There are of course pros and cons to all of these. This type of data
> > >> *is* somewhat specialized and I don't think we want to have a huge
> > >> profusion of types for all of the possible specialized data types out
> > >> there. But, at a minimum we should acknowledge (or adopt) a standard
> > >> if it exists and encourage implementations that use Arrow to follow
> > >> that standard (like sfarrow does to be compatible with geopandas) so
> > >> that some level of interoperability is there + people aren't needing
> > >> to reinvent the wheel each time they store spatial data.
> > >>
> > >> Thoughts? Are there other projects out there that already do something
> > >> like this with Arrow that we should consider?
> > >>
> > >> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
> > >> [2] https://github.com/wcjochem/sfarrow
> > >> [3] for now they create a binary WKB column + attach a bit of metadata
> > >> to the schema that that's what happened, though there are other ways
> > >> one could encode this and the spec might include other way(s) to store
> > >> this data in the future.
> > >>
> > >> -Jon
> > >>
> > >
> > >
> > > --
> > > —
> > > *Mauricio 'Pachá' Vargas Sepúlveda*
> > > Site: pacha.dev
> > > Blog: pacha.dev/blog
> >
>
>
> --
> Borasky Research Journal https://www.znmeb.mobi
>
> Markovs of the world, unite! You have nothing to lose but your chains!
>


-- 
-Max

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by "M. Edward (Ed) Borasky" <zn...@znmeb.net>.

I don't know about GeoPandas but in R there are two main in-memory GIS
data types: the old-ish "sp" format and the new "sf" (simple features)
format. As an R GIS developer, I would expect any Arrow GIS capability
 to efficiently facilitate "sf" / "tidyverse" operations. See
https://geocompr.robinlovelace.net/ for the details.

On Fri, Jun 25, 2021 at 11:51 AM Julian Hyde <jh...@gmail.com> wrote:
>
> Cc += geospatial@.
>
> I think allowing WKB and WKT is sufficient.
>
> Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID (spatial reference identifier) is almost always needed to qualify a geometry value. It is analogous to how TimeZone is needed (implicitly or explicitly) to qualify a DateTime value.
>
> For Geospatial queries to perform well requires some kind of indexing (and/or clever data organization). Geospatial indexing is very complex, and there is no “one size fits all” approach. So I recommend that Arrow stays out of the indexing business, and leaves indexing to the engine.
>
> Julian
>
>
> > On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <ma...@uc.cl.INVALID> wrote:
> >
> > Dear Jon
> >
> > Thanks for sending this. Based on previous projects, WKB works well with
> > SQLite, DuckDB and others, at the expense of creating heavier size columns
> > compared to PostGIS.
> >
> > In order to experiment with, it can be interesting to use the CENSO 2017
> > shape files: https://github.com/ropensci/censo2017-cartografias;
> > https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
> > This includes rivers, streets, etc etc.
> >
> > Provided that Arrow is installed in a very straightforward way (for
> > Windows, at least), creating something based on PostGIS is probably not a
> > bad idea, but WKB works ok, and it integrates with 0 problems with the SF
> > package. I clearly see a great compression advantage here if we decide to
> > use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.
> >
> > Best,
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jk...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> There is an emerging spec[1] for how to store geospatial data in Arrow
> >> + pass through parquet files in the geopandas world. There is even a
> >> new R package that implements a wrapper to do the same in R[2]. These
> >> both define a serialization[3] for storing geospatial data as an Arrow
> >> table (and thus also when saving to parquet with Arrow).
> >>
> >> I could see a number of ways that we might interact with standards
> >> like these, and for any of these that we pursue it would be good to
> >> clarify that in our docs:
> >>
> >> 1. Point to the standard — we could mention that this standard exists
> >> and that if someone is building a geospatial data aware application,
> >> they _could_ refer to this standard if they want to.
> >> 2. Adopt a/this standard — this could range from stating that we've
> >> adopted it as the way that spatial data _ought_ to be stored to asking
> >> the creators if maintaining it within the Arrow project itself would
> >> be better (either by adopting it or creating a fork — of course
> >> communication with the folks working on it now would be critical!)
> >> 3. Create extension type(s) for geospatial data — this would require
> >> adopting a standard like the one linked, but on top of that providing
> >> an extension type within Arrow itself that the various clients could
> >> implement as they saw fit.
> >> 4. Create new, fully separate type(s) for geospatial data — again,
> >> this would require adopting a standard of some sort, but we would
> >> implement it as a specific type and presumably support it in all of
> >> the clients as we could.
> >>
> >> There are of course pros and cons to all of these. This type of data
> >> *is* somewhat specialized and I don't think we want to have a huge
> >> profusion of types for all of the possible specialized data types out
> >> there. But, at a minimum we should acknowledge (or adopt) a standard
> >> if it exists and encourage implementations that use Arrow to follow
> >> that standard (like sfarrow does to be compatible with geopandas) so
> >> that some level of interoperability is there + people aren't needing
> >> to reinvent the wheel each time they store spatial data.
> >>
> >> Thoughts? Are there other projects out there that already do something
> >> like this with Arrow that we should consider?
> >>
> >> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
> >> [2] https://github.com/wcjochem/sfarrow
> >> [3] for now they create a binary WKB column + attach a bit of metadata
> >> to the schema that that's what happened, though there are other ways
> >> one could encode this and the spec might include other way(s) to store
> >> this data in the future.
> >>
> >> -Jon
> >>
> >
> >
> > --
> > —
> > *Mauricio 'Pachá' Vargas Sepúlveda*
> > Site: pacha.dev
> > Blog: pacha.dev/blog
>


-- 
Borasky Research Journal https://www.znmeb.mobi

Markovs of the world, unite! You have nothing to lose but your chains!

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Posted by Jim Hughes <jh...@ccri.com>.

Hi all,

I'd add two points:

First, while indexing is complex, does Arrow maintain/create stats on 
columns of data?  If so, capturing an MBR of the geometries would be 
awesome to help with very rough filtering/pruning.

Second, I'm very keen to understand the connections between Arrow and 
Parquet that the spec PR discusses.  I work on GeoMesa, we've added 
spatial support to Arrow, Parquet, Orc, and Spark.  Spatial dataframes 
in Spark can be saved to Parquet and Orc.  I mention that to point out 
that the "round trip" consideration can be extended to at least these 
four Apache projects.

Recently, I've seen another developer starting from scratch on adding 
Parquet support to Sedona and another group interested in sorting out a 
"GeoParquet" format.  It seems like we ought to sort out a place to 
discuss this.

Would the geospatial@apache.org mailing list be a good place to discuss 
things?

Cheers,

Jim

On 6/25/2021 2:50 PM, Julian Hyde wrote:
> Cc += geospatial@.
>
> I think allowing WKB and WKT is sufficient.
>
> Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID (spatial reference identifier) is almost always needed to qualify a geometry value. It is analogous to how TimeZone is needed (implicitly or explicitly) to qualify a DateTime value.
>
> For Geospatial queries to perform well requires some kind of indexing (and/or clever data organization). Geospatial indexing is very complex, and there is no “one size fits all” approach. So I recommend that Arrow stays out of the indexing business, and leaves indexing to the engine.
>
> Julian
>
>
>> On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <ma...@uc.cl.INVALID> wrote:
>>
>> Dear Jon
>>
>> Thanks for sending this. Based on previous projects, WKB works well with
>> SQLite, DuckDB and others, at the expense of creating heavier size columns
>> compared to PostGIS.
>>
>> In order to experiment with, it can be interesting to use the CENSO 2017
>> shape files: https://github.com/ropensci/censo2017-cartografias;
>> https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
>> This includes rivers, streets, etc etc.
>>
>> Provided that Arrow is installed in a very straightforward way (for
>> Windows, at least), creating something based on PostGIS is probably not a
>> bad idea, but WKB works ok, and it integrates with 0 problems with the SF
>> package. I clearly see a great compression advantage here if we decide to
>> use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.
>>
>> Best,
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jk...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> There is an emerging spec[1] for how to store geospatial data in Arrow
>>> + pass through parquet files in the geopandas world. There is even a
>>> new R package that implements a wrapper to do the same in R[2]. These
>>> both define a serialization[3] for storing geospatial data as an Arrow
>>> table (and thus also when saving to parquet with Arrow).
>>>
>>> I could see a number of ways that we might interact with standards
>>> like these, and for any of these that we pursue it would be good to
>>> clarify that in our docs:
>>>
>>> 1. Point to the standard — we could mention that this standard exists
>>> and that if someone is building a geospatial data aware application,
>>> they _could_ refer to this standard if they want to.
>>> 2. Adopt a/this standard — this could range from stating that we've
>>> adopted it as the way that spatial data _ought_ to be stored to asking
>>> the creators if maintaining it within the Arrow project itself would
>>> be better (either by adopting it or creating a fork — of course
>>> communication with the folks working on it now would be critical!)
>>> 3. Create extension type(s) for geospatial data — this would require
>>> adopting a standard like the one linked, but on top of that providing
>>> an extension type within Arrow itself that the various clients could
>>> implement as they saw fit.
>>> 4. Create new, fully separate type(s) for geospatial data — again,
>>> this would require adopting a standard of some sort, but we would
>>> implement it as a specific type and presumably support it in all of
>>> the clients as we could.
>>>
>>> There are of course pros and cons to all of these. This type of data
>>> *is* somewhat specialized and I don't think we want to have a huge
>>> profusion of types for all of the possible specialized data types out
>>> there. But, at a minimum we should acknowledge (or adopt) a standard
>>> if it exists and encourage implementations that use Arrow to follow
>>> that standard (like sfarrow does to be compatible with geopandas) so
>>> that some level of interoperability is there + people aren't needing
>>> to reinvent the wheel each time they store spatial data.
>>>
>>> Thoughts? Are there other projects out there that already do something
>>> like this with Arrow that we should consider?
>>>
>>> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
>>> [2] https://github.com/wcjochem/sfarrow
>>> [3] for now they create a binary WKB column + attach a bit of metadata
>>> to the schema that that's what happened, though there are other ways
>>> one could encode this and the spec might include other way(s) to store
>>> this data in the future.
>>>
>>> -Jon
>>>
>>
>> -- 
>> —
>> *Mauricio 'Pachá' Vargas Sepúlveda*
>> Site: pacha.dev
>> Blog: pacha.dev/blog