You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sedona.apache.org by Jan Rittenbach <Ja...@raytheon.com.INVALID> on 2022/07/01 17:53:28 UTC

Project with Apache Sedona

Dear Sir or Madam,
we are currently in the process of determining if the Apache-Sedona environment will be suitable for our project within the Software Engineering of Raytheon Anschuetz GmbH Germany.

Specifically, we are testing the Apache-Sedona environment in terms of GeoJSON files in OGC standard.
Some problems during the testing  made me write this little "help" email:

We want to work with the Python Spark solution and a typical converted GeoJSON file.
For this first approach we try

  *   to read from GeoJSON and save to RDD,
  *   make a simple query,
  *   use the native RTREE index and save to a permanent indexed storage.
  *   Afterward we want to make a range query with a query window with the use of the implemented RTREE
_________________________________________
Maybe you have a short and good tutorial for us? The Jupyter notebook examples are not very precise.
_________________________________________

Already the first step of READING a standard geojson does not work like assumed:
We used native GeoJsonReader.readToGeometryRDD from the Sedona.core.formatMapper

The "test.geojson" looks like: (source: https://geojson.org/ )
  { "type": "FeatureCollection",
    "features": [
      { "type": "Feature",
        "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
        "properties": {"prop0": "value0"}
        },
      { "type": "Feature",
        "geometry": {
          "type": "LineString",
          "coordinates": [
            [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
            ]
          },
        "properties": {
          "prop0": "value0",
          "prop1": 0.0
          }
        },
      { "type": "Feature",
         "geometry": {
           "type": "Polygon",
           "coordinates": [
             [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
               [100.0, 1.0], [100.0, 0.0] ]
             ]
         },
         "properties": {
           "prop0": "value0",
           "prop1": {"this": "that"}
           }
         }
       ]
     }

Only with trial and error we found out that the following not native "GeoJSON" format works:
"test_updated.geojson" looks like:
{ "type": "Feature", "geometry": {"type": "Point", "coordinates": [102.0, 0.5]}, "properties": {"prop0": "value0"} }
{ "type": "Feature", "geometry": {"type": "LineString", "coordinates": [[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]]}, "properties": {"prop0": "value0","prop1": 0.0}}
{ "type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],[100.0, 1.0], [100.0, 0.0] ]]}, "properties": {"prop0": "value0",  "prop1": {"this": "that"} }}

To query the file data we used the sedona.utils.adapter Adapter.toDf.
Only with this not common GeoJSON format the first step of reading was possible.
_________________________________________
Why the Apache Sedona environment do not support native GeoJSON format?
_________________________________________

Our environment:
Ubuntu 22.04 LTS
openjdk version "1.8.0_312"
spark-3.0.3
apache-sedona[spark] 1.2.0)
attrs 21.4.0
shapely 1.8.2
pyspark 3.3.0
py4j 0.10.9.5

Also wondering why the documentation for the python solution is not completed. Python doc - Apache Sedona(tm) (incubating)<https://sedona.apache.org/api/python-api/>

Have a nice day and best wishes from Kiel/Germany.

Jan Rittenbach
Werkstudent Software Engineering (ESW)
Raytheon Anschütz

Jan.Rittenbach@raytheon.com<ma...@raytheon.com>

raytheon-anschuetz.com<https://www.raytheon-anschuetz.com/> | LinkedIn<https://www.linkedin.com/company/raytheon-anschuetz> | Xing<https://www.xing.com/company/raytheonanschuetz>

Raytheon Anschütz GmbH, Zeyestr. 16-24, 24106 Kiel, Deutschland
Sitz der Gesellschaft: Kiel, Registergericht: Amtsgericht Kiel HRB 4086
Geschäftsführer: Michael Schulz, Vorsitzende des Aufsichtsrats: Kimberly Nicole Ernzen

Unsere aktuelle Datenschutzerklärung finden Sie unter / Our most current Privacy Policy can be found under
https://www.raytheon-anschuetz.com/fileadmin/content/Downloads_Documents/Privacy_Policy.pdf


Re: Project with Apache Sedona

Posted by Martin Andersson <u....@gmail.com>.
Hi Jan,

GeoJSON is a poor Big Data format since it contains an envelope object at
the top level that wraps all features in a single large array. As you found
out, one row per json object is much better choice. Both formats can be
read in Spark/Sedona. Another quirk in the GeoJSON format is that it
supports mixed geometries. If your GeoJSON contains both points and multi
points the dimensions in the coordinates array varies across rows. One way
to work around that is to supply a schema when reading the file and set the
geometry field to string. Then you can convert the geometry field with
ST_GeomFromGeoJSON.

Example for reading GeoJSON:

schema = "features array<struct<geometry: string>>" # Add more fields if
needed

df = spark.read.json([path to GeoJSON file], schema=schema)
df = df.selectExpr("explode(features) as feature") # Explode since GeoJson
is wrapped in a single array
df.selectExpr("ST_GeomFromGeoJson(feature.geometry)").show()

If you need any new features currently not implemented in Sedona I'm sure
the community would appreciate PR:s from you. In this case an update to the
documentation might be enough. The community is great. Usually me and my
colleges have had our pull requests reviewed within hours.

Br,
Martin


Den fre 1 juli 2022 kl 20:02 skrev Jan Rittenbach
<Ja...@raytheon.com.invalid>:

> Dear Sir or Madam,
> we are currently in the process of determining if the Apache-Sedona
> environment will be suitable for our project within the Software
> Engineering of Raytheon Anschuetz GmbH Germany.
>
> Specifically, we are testing the Apache-Sedona environment in terms of
> GeoJSON files in OGC standard.
> Some problems during the testing  made me write this little "help" email:
>
> We want to work with the Python Spark solution and a typical converted
> GeoJSON file.
> For this first approach we try
>
>   *   to read from GeoJSON and save to RDD,
>   *   make a simple query,
>   *   use the native RTREE index and save to a permanent indexed storage.
>   *   Afterward we want to make a range query with a query window with the
> use of the implemented RTREE
> _________________________________________
> Maybe you have a short and good tutorial for us? The Jupyter notebook
> examples are not very precise.
> _________________________________________
>
> Already the first step of READING a standard geojson does not work like
> assumed:
> We used native GeoJsonReader.readToGeometryRDD from the
> Sedona.core.formatMapper
>
> The "test.geojson" looks like: (source: https://geojson.org/ )
>   { "type": "FeatureCollection",
>     "features": [
>       { "type": "Feature",
>         "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
>         "properties": {"prop0": "value0"}
>         },
>       { "type": "Feature",
>         "geometry": {
>           "type": "LineString",
>           "coordinates": [
>             [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
>             ]
>           },
>         "properties": {
>           "prop0": "value0",
>           "prop1": 0.0
>           }
>         },
>       { "type": "Feature",
>          "geometry": {
>            "type": "Polygon",
>            "coordinates": [
>              [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
>                [100.0, 1.0], [100.0, 0.0] ]
>              ]
>          },
>          "properties": {
>            "prop0": "value0",
>            "prop1": {"this": "that"}
>            }
>          }
>        ]
>      }
>
> Only with trial and error we found out that the following not native
> "GeoJSON" format works:
> "test_updated.geojson" looks like:
> { "type": "Feature", "geometry": {"type": "Point", "coordinates": [102.0,
> 0.5]}, "properties": {"prop0": "value0"} }
> { "type": "Feature", "geometry": {"type": "LineString", "coordinates":
> [[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]]}, "properties":
> {"prop0": "value0","prop1": 0.0}}
> { "type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[
> [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],[100.0, 1.0], [100.0, 0.0] ]]},
> "properties": {"prop0": "value0",  "prop1": {"this": "that"} }}
>
> To query the file data we used the sedona.utils.adapter Adapter.toDf.
> Only with this not common GeoJSON format the first step of reading was
> possible.
> _________________________________________
> Why the Apache Sedona environment do not support native GeoJSON format?
> _________________________________________
>
> Our environment:
> Ubuntu 22.04 LTS
> openjdk version "1.8.0_312"
> spark-3.0.3
> apache-sedona[spark] 1.2.0)
> attrs 21.4.0
> shapely 1.8.2
> pyspark 3.3.0
> py4j 0.10.9.5
>
> Also wondering why the documentation for the python solution is not
> completed. Python doc - Apache Sedona(tm) (incubating)<
> https://sedona.apache.org/api/python-api/>
>
> Have a nice day and best wishes from Kiel/Germany.
>
> Jan Rittenbach
> Werkstudent Software Engineering (ESW)
> Raytheon Anschütz
>
> Jan.Rittenbach@raytheon.com<ma...@raytheon.com>
>
> raytheon-anschuetz.com<https://www.raytheon-anschuetz.com/> | LinkedIn<
> https://www.linkedin.com/company/raytheon-anschuetz> | Xing<
> https://www.xing.com/company/raytheonanschuetz>
>
> Raytheon Anschütz GmbH, Zeyestr. 16-24, 24106 Kiel, Deutschland
> Sitz der Gesellschaft: Kiel, Registergericht: Amtsgericht Kiel HRB 4086
> Geschäftsführer: Michael Schulz, Vorsitzende des Aufsichtsrats: Kimberly
> Nicole Ernzen
>
> Unsere aktuelle Datenschutzerklärung finden Sie unter / Our most current
> Privacy Policy can be found under
>
> https://www.raytheon-anschuetz.com/fileadmin/content/Downloads_Documents/Privacy_Policy.pdf
>
>

-- 
Hälsningar,
Martin