You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Fitch, Simeon" <fi...@astraea.io> on 2021/06/03 20:14:45 UTC

Help Migrating BaseRelation to Spark 3.x

Hi,

I'm the tech lead on RasterFrames, which adds geospatial raster data
capability to Apache Spark SQL. We are trying to migrate to Spark 3.x, and
are struggling with getting our various DataSources to work, and wondered
if some might share some tips on what might be going on. Most of our issues
have been with changes to ExpressionEncoder and/or code that maps JVM types
to Catalyst types.

In this example exception from our [GeoTiffRelation][1], the error is
complaining about a very simple case class (`case class SpatialKey(col:
Int, row: Int)`), which should map exactly to `struct<col:int,row:int>`.

```
java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: geotrellis.layer.SpatialKey is not a valid
external type for schema of struct<col:int,row:int>
named_struct(col,
validateexternaltype(getexternalrowfield(validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true]), 0, spatial_key),
StructField(col,IntegerType,false), StructField(row,IntegerType,false)), 0,
col), IntegerType), row,
validateexternaltype(getexternalrowfield(validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true]), 0, spatial_key),
StructField(col,IntegerType,false), StructField(row,IntegerType,false)), 1,
row), IntegerType)) AS spatial_key#1125
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:213)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:195)
...

Caused by: java.lang.RuntimeException: geotrellis.layer.SpatialKey is not a
valid external type for schema of struct<col:int,row:int>
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.GetExternalRowField_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:209)
```

To add to the difficulty, the exception is thrown from CodeGen, so I'm not
able to attach a breakpoint. Using the new `df.explain("codegen")` is
great, but doesn't emit the code throwing the exception.

Sorry for the somewhat open-ended question, but this is just one among
dozens of migration problems we're having with changes in the Encoder
semantics, and would appreciate any direction or migration tips. My
suspicion is that our methods used across DataSources are violating some
core assumption in Spark 3.x that managed to work by accident in Spark
2.x., and we just need a little more insight into the changes to unlock our
migration efforts.

[1]:
https://github.com/locationtech/rasterframes/blob/develop/datasource/src/main/scala/org/locationtech/rasterframes/datasource/geotiff/GeoTiffRelation.scala

Thanks,

Simeon