You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/10 16:56:12 UTC

[GitHub] [spark] bdrillard commented on issue #24299: [SPARK-27388][SQL] expression encoder for objects defined by properties

bdrillard commented on issue #24299: [SPARK-27388][SQL] expression encoder for objects defined by properties
URL: https://github.com/apache/spark/pull/24299#issuecomment-481773895

Many thanks to @mazeboard for bringing this PR to my attention and for taking a crack at the problem of typed Avro Datasets in Spark!

I feel it's a matter of due diligence for me to point to another PR supporting Avro-typed Datasets in Spark, namely #22878, which (full transparency) is the work of @xuanyuanking and myself. The approaches taken here and there are different, and it would seem so are the coverage of the Avro spec. I'd like to take the time to compare/contrast.

I am more qualified to speak to the approach and capabilities introduced #22878 (which has a history going back to [Spark-Avro](https://github.com/databricks/spark-avro/pull/217)), and so if I misread this PR in that process, @mazeboard, please do correct my understanding.

I apologize for the length of this reply ahead of time!

## #24299

I'll summarize my reading of this PR's approach: to extend the existing Java `Bean` Encoder functionality to more broadly support tricky types generated by Avro `SpecificRecord` classes, especially Bytes, which Avro doesn't allow access to via typical getters/setters, and Enums, which have to be encoded as Strings and therefore have to be “discovered” via their class.

One stated limitation is complex Union types, which Avro will represent as nested Object. It’s stated that there isn’t an Encoder for Object type (I think a serializer/deserializer could _perhaps_ be made using the [Object](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ObjectType.scala) and [NewInstance](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L431 , but I can’t say how it would work in this Reflection-driven approach). It’s said things could get tough with how Avro serializes those objects when things like Kyro are used. I can see that as a limitation to this Bean-based approach even if the DataTypes and Expressions were sorted out.

Correct my understanding if this is wrong, but because this approach is based on Reflection over the types of the generated SpecificRecord class (viewing Avro as a Bean), it would not be (in a first-class sense) “aware” of the AvroSchema which generates the Class. I think this distinction may matter, and I’ll discuss it more below.

## #22878

To summarize the approach of #22878: create an AvroEncoder which generates a pair of serializer/deserializer Expressions based on the AvroSchema.

This work stands on the Avro efforts that were recently folded into Spark-proper from the (now deprecated) Databricks/Spark-Avro project, but which still _does not_ provide support for a first-class Encoder, and the efficiency and Strong/Static typing that entails. Being based on Spark-Avro however, #22878 _does_ gain benefits of an AvroSchema driven approach. To avoid confusion, I’m going to refer to this folded-in functionality as “Spark-Avro”, relating to [this](https://github.com/apache/spark/tree/master/external/avro) portion of the current Spark project, rather than to the former deprecated project.

### AvroSchema

Because #22878 generates its Encoder through the AvroSchema, we gain a couple things:

1. An isomorphism between the Dataset Schema and the source of truth for the structure of any Avro, namely, its AvroSchema. The important thing here is we can generate an Encoder both from the Class and the AvroSchema in its different representations, like JSON, which opens support for Schema evolution, and use of Schema stores.
2. Direct support Encoders of any `GenericRecord`, where an AvroSchema is known, but no `SpecificRecord` class can be generated.

### Coverage

Spark-Avro’s ability to move between AvroSchema and Spark Dataset Schema also gives us the ability to traverse the DataType to create our Encoder’s Ser/De Expressions rather than using Reflection. This gives us two immediate benefits

1. The qualities of Bytes and Enums are more transparently represented in the AvroSchema, and so the SerDe expressions can be more directly generated.
2. Nested Records and Complex Unions (as well as Avro Lists and Maps) are a solved-problem in Spark-Avro, so we can generate a Dataset Schema of arbitrary complexity and nesting.
3. Naturally, we don’t have to extend the Bean Encoder for properties.

These two items mean the AvroEncoder in #22878 can generate an `Encoder` having full coverage of Avro types, and this coverage of the various combinations of types that can appear when adding nested Records and Unions is well tested in the PR.

## Last thoughts

The PR goes a long way in support of Avro while still being very concise, which is definitely advantageous from a maintainability perspective. My concerns with a Reflection-based approach are:

1. It’s unknown or perhaps would prove difficult to extend support for (especially) Union types, which are _very_ common in Avro and the _only_ way to express Nullability in Avro.
2. A Reflection-based approach also foregoes Datasets of `GenericRecord` objects.

Parting words for #22878:

1. While it’s length (in terms of lines of code) as a feature has been discussed, however, I’d say ultimately it’s very well tested, considering we’re ultimately testing a closed set of type combinations described by AvroSchema, and nested complex types are tested via inductive unit tests. Anecdotally, I’ll say we’ve been using it in a fork of Spark-Avro with great success over exceptionally complicated Schemas.
2. It fits quite natively in the new Spark-Avro “external” sub-project, and accomplishes the goal: providing an Encoder for Datasets typed by arbitrary Avro (Generic and Specific).

Again, I'm very happy to see where this discussion goes as it evolves (:

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org