You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Aleksander Eskilson <al...@gmail.com> on 2019/04/18 21:33:34 UTC

Open PRs RE: Datasets Typed by Arbitrary Avro

There are now a couple different pull-requests each attempting to address
the need for an enhancement providing Typed Dataset support for Avro
Objects. These PRs and their respective JIRA tickets are

   - https://github.com/apache/spark/pull/22878 :
   https://issues.apache.org/jira/browse/SPARK-25789 (originally in
   Databricks/spark-avro, https://github.com/databricks/spark-avro/pull/217
    : https://github.com/databricks/spark-avro/issues/169)
   - https://github.com/apache/spark/pull/24299 :
   https://issues.apache.org/jira/browse/SPARK-27388
   - https://github.com/apache/spark/pull/24367 :
   https://issues.apache.org/jira/browse/SPARK-27457

Approaches between these differ considerably, and respective coverages may
not be equal. Some analysis of tradeoffs and perhaps a deeper analysis of
workarounds would be necessary.

Full disclosure, I contributed significantly to Spark#22878/Spark-Avro#217,
so I don't think I'll say more about the topics in this thread, but I would
be looking to Spark committers for some more direction either here or in
the PR threads. I'd be happy to be respond to questions from the community.

The topic of and request for Typed Datasets of Avro goes back to
Spark-Avro#169 <https://github.com/databricks/spark-avro/issues/169>. I saw
relatively recently that project was folded into Spark-proper, but the need
for Statically type, Dataset support (as opposed to dynamically typed
Dataframe support) continues.

Hoping a resolution can come out of this visibility.

Aleksander Eskilson
https://github.com/bdrillard