You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@sedona.apache.org by ji...@apache.org on 2023/03/27 08:02:57 UTC

[sedona] branch master updated: [DOC] Document how to load GeoJSON using Spark SQL's built-in JSON da… (#809)

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git


The following commit(s) were added to refs/heads/master by this push:
     new a6157738 [DOC] Document how to load GeoJSON using Spark SQL's built-in JSON da… (#809)
a6157738 is described below

commit a61577388fe552def9d8453f41b8a007c1dddb5a
Author: Martin Andersson <u....@gmail.com>
AuthorDate: Mon Mar 27 10:02:49 2023 +0200

    [DOC] Document how to load GeoJSON using Spark SQL's built-in JSON da… (#809)
---
 docs/tutorial/sql.md | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 126 insertions(+), 2 deletions(-)

diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
index 81b43c9d..3aa197e0 100644
--- a/docs/tutorial/sql.md
+++ b/docs/tutorial/sql.md
@@ -212,10 +212,50 @@ root
 
 !!!note
 	SedonaSQL provides lots of functions to create a Geometry column, please read [SedonaSQL constructor API](../api/sql/Constructor.md).
+
+## Load GeoJSON using Spark JSON Data Source
+
+Spark SQL's built-in JSON data source supports reading GeoJSON data.
+To ensure proper parsing of the geometry property, we can define a schema with the geometry property set to type 'string'.
+This prevents Spark from interpreting the property and allows us to use the ST_GeomFromGeoJSON function for accurate geometry parsing.
+
+=== "Scala"
+
+	```scala
+	val schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>"
+	sparkSession.read.schema(schema).json(geojson_path)
+		.selectExpr("explode(features) as features") // Explode the envelope to get one feature per row.
+		.select("features.*") // Unpack the features struct.
+		.withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)")) // Convert the geometry string.
+		.printSchema()
+	```
+
+=== "Java"
+
+	```java
+	String schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";
+	sparkSession.read.schema(schema).json(geojson_path)
+		.selectExpr("explode(features) as features") // Explode the envelope to get one feature per row.
+		.select("features.*") // Unpack the features struct.
+		.withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)")) // Convert the geometry string.
+		.printSchema();
+	```
+
+=== "Python"
+
+	```python
+	schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";
+	(sparkSession.read.json(geojson_path, schema=schema) 
+		.selectExpr("explode(features) as features") # Explode the envelope to get one feature per row.
+		.select("features.*") # Unpack the features struct.
+		.withColumn("geometry", f.expr("ST_GeomFromGeoJSON(geometry)")) # Convert the geometry string.
+		.printSchema())
+	```
+
 	
-## Load Shapefile and GeoJSON
+## Load Shapefile and GeoJSON using SpatialRDD
 
-Shapefile and GeoJSON must be loaded by SpatialRDD and converted to DataFrame using Adapter. Please read [Load SpatialRDD](../rdd/#create-a-generic-spatialrdd) and [DataFrame <-> RDD](#convert-between-dataframe-and-spatialrdd).
+Shapefile and GeoJSON can be loaded by SpatialRDD and converted to DataFrame using Adapter. Please read [Load SpatialRDD](../rdd/#create-a-generic-spatialrdd) and [DataFrame <-> RDD](#convert-between-dataframe-and-spatialrdd).
 
 ## Load GeoParquet
 
@@ -256,6 +296,66 @@ root
 
 Sedona supports spatial predicate push-down for GeoParquet files, please refer to the [SedonaSQL query optimizer](../api/sql/Optimizer.md) documentation for details.
 
+## Load data from JDBC data sources
+
+The 'query' option in Spark SQL's JDBC data source can be used to convert geometry columns to a format that Sedona can interpret.
+This should work for most spatial JDBC data sources.
+For Postgis there is no need to add a query to convert geometry types since it's already using EWKB as it's wire format.
+
+=== "Scala"
+
+	```scala
+	// For any JDBC data source, inluding Postgis.
+	val df = sparkSession.read.format("jdbc")
+		// Other options.
+		.option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
+		.load()
+		.withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+
+	// This is a simplified version that works for Postgis.
+	val df = sparkSession.read.format("jdbc")
+		// Other options.
+		.option("dbtable", "my_table")
+		.load()
+		.withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+	```
+
+=== "Java"
+
+	```java
+	// For any JDBC data source, inluding Postgis.
+	Dataset<Row> df = sparkSession.read().format("jdbc")
+		// Other options.
+		.option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
+		.load()
+		.withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+
+	// This is a simplified version that works for Postgis.
+	Dataset<Row> df = sparkSession.read().format("jdbc")
+		// Other options.
+		.option("dbtable", "my_table")
+		.load()
+		.withColumn("geom", expr("ST_GeomFromWKB(geom)"))
+	```
+
+=== "Python"
+
+	```python
+	# For any JDBC data source, inluding Postgis.
+	df = (sparkSession.read.format("jdbc")
+		# Other options.
+		.option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
+		.load()
+		.withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))
+
+	# This is a simplified version that works for Postgis.
+	df = (sparkSession.read.format("jdbc")
+		# Other options.
+		.option("dbtable", "my_table")
+		.load()
+		.withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))
+	```
+
 ## Transform the Coordinate Reference System
 
 Sedona doesn't control the coordinate unit (degree-based or meter-based) of all geometries in a Geometry column. The unit of all related distances in SedonaSQL is same as the unit of all geometries in a Geometry column.
@@ -362,6 +462,30 @@ FROM spatialDf
 ORDER BY geohash
 ```
 
+## Save to Postgis
+
+Unfortunately, the Spark SQL JDBC data source doesn't support creating geometry types in PostGIS using the 'createTableColumnTypes' option.
+Only the Spark built-in types are recognized.
+This means that you'll need to manage your PostGIS schema separately from Spark.
+One way to do this is to create the table with the correct geometry column before writing data to it with Spark.
+Alternatively, you can write your data to the table using Spark and then manually alter the column to be a geometry type afterward.
+
+Postgis uses EWKB to serialize geometries.
+If you convert your geometries to EWKB format in Sedona you don't have to do any additional conversion in Postgis.
+
+```
+my_postgis_db# create table my_table (id int8, geom geometry);
+
+df.withColumn("geom", expr("ST_AsEWKB(geom)")
+	.write.format("jdbc")
+	.option("truncate","true") // Don't let Spark recreate the table.
+	// Other options.
+	.save()
+
+// If you didn't create the table before writing you can change the type afterward.
+my_postgis_db# alter table my_table alter column geom type geometry;
+
+```
 
 ## Convert between DataFrame and SpatialRDD