You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@sedona.apache.org by ji...@apache.org on 2023/03/27 08:02:57 UTC
[sedona] branch master updated: [DOC] Document how to load GeoJSON using Spark SQL's built-in JSON da… (#809)
This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new a6157738 [DOC] Document how to load GeoJSON using Spark SQL's built-in JSON da… (#809)
a6157738 is described below
commit a61577388fe552def9d8453f41b8a007c1dddb5a
Author: Martin Andersson <u....@gmail.com>
AuthorDate: Mon Mar 27 10:02:49 2023 +0200
[DOC] Document how to load GeoJSON using Spark SQL's built-in JSON da… (#809)
---
docs/tutorial/sql.md | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 126 insertions(+), 2 deletions(-)
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
index 81b43c9d..3aa197e0 100644
--- a/docs/tutorial/sql.md
+++ b/docs/tutorial/sql.md
@@ -212,10 +212,50 @@ root
!!!note
SedonaSQL provides lots of functions to create a Geometry column, please read [SedonaSQL constructor API](../api/sql/Constructor.md).
+
+## Load GeoJSON using Spark JSON Data Source
+
+Spark SQL's built-in JSON data source supports reading GeoJSON data.
+To ensure proper parsing of the geometry property, we can define a schema with the geometry property set to type 'string'.
+This prevents Spark from interpreting the property and allows us to use the ST_GeomFromGeoJSON function for accurate geometry parsing.
+
+=== "Scala"
+
+ ```scala
+ val schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>"
+ sparkSession.read.schema(schema).json(geojson_path)
+ .selectExpr("explode(features) as features") // Explode the envelope to get one feature per row.
+ .select("features.*") // Unpack the features struct.
+ .withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)")) // Convert the geometry string.
+ .printSchema()
+ ```
+
+=== "Java"
+
+ ```java
+ String schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";
+ sparkSession.read.schema(schema).json(geojson_path)
+ .selectExpr("explode(features) as features") // Explode the envelope to get one feature per row.
+ .select("features.*") // Unpack the features struct.
+ .withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)")) // Convert the geometry string.
+ .printSchema();
+ ```
+
+=== "Python"
+
+ ```python
+ schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";
+ (sparkSession.read.json(geojson_path, schema=schema)
+ .selectExpr("explode(features) as features") # Explode the envelope to get one feature per row.
+ .select("features.*") # Unpack the features struct.
+ .withColumn("geometry", f.expr("ST_GeomFromGeoJSON(geometry)")) # Convert the geometry string.
+ .printSchema())
+ ```
+
-## Load Shapefile and GeoJSON
+## Load Shapefile and GeoJSON using SpatialRDD
-Shapefile and GeoJSON must be loaded by SpatialRDD and converted to DataFrame using Adapter. Please read [Load SpatialRDD](../rdd/#create-a-generic-spatialrdd) and [DataFrame <-> RDD](#convert-between-dataframe-and-spatialrdd).
+Shapefile and GeoJSON can be loaded by SpatialRDD and converted to DataFrame using Adapter. Please read [Load SpatialRDD](../rdd/#create-a-generic-spatialrdd) and [DataFrame <-> RDD](#convert-between-dataframe-and-spatialrdd).
## Load GeoParquet
@@ -256,6 +296,66 @@ root
Sedona supports spatial predicate push-down for GeoParquet files, please refer to the [SedonaSQL query optimizer](../api/sql/Optimizer.md) documentation for details.
+## Load data from JDBC data sources
+
+The 'query' option in Spark SQL's JDBC data source can be used to convert geometry columns to a format that Sedona can interpret.
+This should work for most spatial JDBC data sources.
+For Postgis there is no need to add a query to convert geometry types since it's already using EWKB as it's wire format.
+
+=== "Scala"
+
+ ```scala
+ // For any JDBC data source, inluding Postgis.
+ val df = sparkSession.read.format("jdbc")
+ // Other options.
+ .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
+ .load()
+ .withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+
+ // This is a simplified version that works for Postgis.
+ val df = sparkSession.read.format("jdbc")
+ // Other options.
+ .option("dbtable", "my_table")
+ .load()
+ .withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+ ```
+
+=== "Java"
+
+ ```java
+ // For any JDBC data source, inluding Postgis.
+ Dataset<Row> df = sparkSession.read().format("jdbc")
+ // Other options.
+ .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
+ .load()
+ .withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+
+ // This is a simplified version that works for Postgis.
+ Dataset<Row> df = sparkSession.read().format("jdbc")
+ // Other options.
+ .option("dbtable", "my_table")
+ .load()
+ .withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+ ```
+
+=== "Python"
+
+ ```python
+ # For any JDBC data source, inluding Postgis.
+ df = (sparkSession.read.format("jdbc")
+ # Other options.
+ .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
+ .load()
+ .withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))
+
+ # This is a simplified version that works for Postgis.
+ df = (sparkSession.read.format("jdbc")
+ # Other options.
+ .option("dbtable", "my_table")
+ .load()
+ .withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))
+ ```
+
## Transform the Coordinate Reference System
Sedona doesn't control the coordinate unit (degree-based or meter-based) of all geometries in a Geometry column. The unit of all related distances in SedonaSQL is same as the unit of all geometries in a Geometry column.
@@ -362,6 +462,30 @@ FROM spatialDf
ORDER BY geohash
```
+## Save to Postgis
+
+Unfortunately, the Spark SQL JDBC data source doesn't support creating geometry types in PostGIS using the 'createTableColumnTypes' option.
+Only the Spark built-in types are recognized.
+This means that you'll need to manage your PostGIS schema separately from Spark.
+One way to do this is to create the table with the correct geometry column before writing data to it with Spark.
+Alternatively, you can write your data to the table using Spark and then manually alter the column to be a geometry type afterward.
+
+Postgis uses EWKB to serialize geometries.
+If you convert your geometries to EWKB format in Sedona you don't have to do any additional conversion in Postgis.
+
+```
+my_postgis_db# create table my_table (id int8, geom geometry);
+
+df.withColumn("geom", expr("ST_AsEWKB(geom)")
+ .write.format("jdbc")
+ .option("truncate","true") // Don't let Spark recreate the table.
+ // Other options.
+ .save()
+
+// If you didn't create the table before writing you can change the type afterward.
+my_postgis_db# alter table my_table alter column geom type geometry;
+
+```
## Convert between DataFrame and SpatialRDD