You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/08 16:44:19 UTC

[GitHub] [spark] metasim commented on pull request #31461: [SPARK-7768][CORE][SQL] Open UserDefinedType as a Developer API

metasim commented on pull request #31461:
URL: https://github.com/apache/spark/pull/31461#issuecomment-775280219

> @jnh5y @viirya I got a good question from @marmbrus - can you support user-defined types by just defining an Encoder for it? so that it can work in a Dataset?

@srowen I'll jump in here as I've contended with this quite a lot. If I understand the question, the answer is "no", at least not for `TileUDT`, and I doubt it for `AbstractGeometryUDT` (but the use cases are actually quite different). RasterFrames has [a host of `Encoder`s](https://github.com/locationtech/rasterframes/tree/develop/core/src/main/scala/org/locationtech/rasterframes/encoders), and defaults to that modality whenever possible. But the computational dynamics of spatiotemporal raster data require further optimizations, provided for by using UDTs.

For `TileUDT`, there are a couple things going on:

1. Geospatial raster data is what I call "heavy data" (there's not just many rows, but those rows are very thick). Raster data is relatively expensive to read, consumes a lot of memory, and comes in many encodings and arbitrary projections. Often in an analysis a lot of reasoning can be done over the metadata of the raster, filtering out rows or providing opportunities for logical plan optimization before the raster data is even read into memory. We defer reading of the pixels/cells until absolutely necessary. Therefore, behind `TileUDT` we are implementing old-school OO information hiding and polymorphism to enable lazy evaluation of raster data, [implementing](https://github.com/locationtech/rasterframes/blob/0da9aa936fc4dc0e27a91ca5809bc0416b0c4174/core/src/main/scala/org/apache/spark/sql/rf/TileUDT.scala#L83-L89) essentially two specializations: logically, a "RasterRefTile" (lazy Tile) and a "RealizedTile" (eager Tile). Once the plan identifies the need for pixels/cells the "R
asterRefTile" is converted to a "RealizedTile"
2. The structure of a TileUDT has and is evolving. We need information hiding to keep users from depending on the internal structure, and to provide us with backwards compatible migration paths that `Encoder`s alone are not able to provide.
3. Serialization to/from Parquet is UDT aware, so reading in a RasterFrames Parquet file maintains type coherence in a way `Encoder`s alone would not
4. Serialization to/from PySpark, while extremely expensive (because Spark's Arrow encoder doesn't support UDTs), does allow for nice user-facing ergonomics whereby we can have a first-class `Tile` Python class mirroring the JVM instance that, again, hides the nasty details of the raster structure while providing NumPy and Pandas interoperability. I don't think we'd be able to provide the same ergonomics if we relied on UDTs alone.

I'll note the PySpark use case is also very pertinent to [interoperability](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) between GeoMesa's `AbstractGeometryUDT` and Python's Shapely library.

I hope all this helps, and doesn't just confuse things further. In short, we tried to avoid UDTs for quite a while in RasterFrames, but with one of our primary goals being usability by data scientists and analysts (not software engineers), we found UDTs as the perfect mechanism to hide the sausage making while allowing us to provide reliability and stability in the execution engine.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org