You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Fangshi Li (JIRA)" <ji...@apache.org> on 2018/09/05 01:11:00 UTC
[jira] [Commented] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

    [ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603800#comment-16603800 ] 

Fangshi Li commented on SPARK-24256:
------------------------------------

To summarize our discussion for this pr:
Spark-avro is now merged into Spark as a built-in data source. Upstream community is not merging the AvroEncoder to support Avro types in Dataset, instead, the plan is to exposing the user-defined type API to support defining arbitrary user types in Dataset.

The purpose of this patch is to enable ExpressionEncoder to work together with other types of Encoders, while it seems like upstream prefers to go with UDT. Given this, we can close this PR and we will start the discussion on UDT in another channel

> ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24256
>                 URL: https://issues.apache.org/jira/browse/SPARK-24256
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Fangshi Li
>            Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case class/tuple with subfields being any other user-defined type with its own Encoder for serde.
> To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove above-mentioned limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org