You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by fangshil <gi...@git.apache.org> on 2018/05/12 06:19:37 UTC

[GitHub] spark pull request #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder...

GitHub user fangshil opened a pull request:

    https://github.com/apache/spark/pull/21310

    [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

    ## What changes were proposed in this pull request?
    
    Right now, ExpressionEncoder supports ser/de of primitive types, as well as scala case class, tuple and java bean class. Spark's Dataset natively supports these mentioned types, but we find Dataset is not flexible for other user-defined types and encoders.
    
    For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. Although we can use AvroEncoder to define Dataset with types being the Avro Generic or Specific Record, using such Avro typed Dataset has many limitations: 
    1. We can not use joinWith on this Dataset since the result is a tuple, but Avro types cannot be the field of this tuple.
    2. We can not use some type-safe aggregation methods on this Dataset, such as KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
    3. We cannot augment an Avro SpecificRecord with additional primitive fields together in a case class, which we find is a very common use case.
    
    The limitation that Spark does not support define a Scala case class/tuple with subfields being any other user-defined type, is because ExpressionEncoder does not discover the implicit Encoder for the user-defined field types, thus can not use any Encoder to serde the user-defined fields in case class/tuple.
    
    To address this issue, we propose a trait as a contract(between ExpressionEncoder and any other user-defined Encoder) to enable case class/tuple/java bean's ExpressionEncoder to discover the serializer/deserializer/schema from the Encoder of the user-defined type.
    
    With this proposed patch and our minor modification in AvroEncoder, we remove these limitations with cluster-default conf spark.expressionencoder.org.apache.avro.specific.SpecificRecord = com.databricks.spark.avro.AvroEncoder$
    
    This is a patch we have implemented internally and has been used for a few quarters. We want to propose to upstream as we think this is a useful feature to make Dataset more flexible to user types.
    
    
    ## How was this patch tested?
    We have tested this patch internally. Did not write unit test since the user-defined Encoder(AvroEncoder) is defined outside Spark. We look for comments on how to write unit tests for this path.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fangshil/spark SPARK-24256

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21310.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21310
    
----
commit 547ff81e0470bed14371996da89924bfed0cc101
Author: Fangshi Li <fl...@...>
Date:   2018-02-02T02:16:14Z

    [SPARK-24256][SQL]ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by fangshil <gi...@git.apache.org>.
Github user fangshil commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    I will investigate how can we add test for this. thoughts are welcomed


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    You need to add tests first. Could you?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder...

Posted by fangshil <gi...@git.apache.org>.
Github user fangshil closed the pull request at:

    https://github.com/apache/spark/pull/21310


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    @fangshil also can you point me out the PR not merged into spark-avro please so that I can check when I have some time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    @fangshil, Avro was now in Spark. How does it relate to this PR? Should we go forward?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by fangshil <gi...@git.apache.org>.
Github user fangshil commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    @HyukjinKwon  thanks for the update. What do you mean by "Avro was now in Spark"?  The PR I mentioned is https://github.com/databricks/spark-avro/pull/215/files. I have been maintaining this PR internally for a while in my company with Spark 2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by fangshil <gi...@git.apache.org>.
Github user fangshil commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    @viirya  thanks for the feedback. We internally customized the AvroEncoder based on the open source PR, since it never gets merged into spark-avro. we propose this feature since it should apply to every user-defined Encoder, not limited to AvroEncoder.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by fangshil <gi...@git.apache.org>.
Github user fangshil commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    To summarize our discussion in this pr: 
    Spark-avro is now merged into Spark as a built-in data source. Upstream community is not merging the AvroEncoder to support Avro types in Dataset, instead, the plan is to exposing the user-defined type API to support defining arbitrary user types in Dataset. 
    
    The purpose of this patch is to enable ExpressionEncoder to work together with other types of Encoders, while it seems like upstream prefers to go with UDT. Given this, we can close this PR and we will start the discussion on UDT in another channel



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    I think this PR is blocked by adding UDT officially(it's currently internal). Maybe we can start a thread about UDT in the dev list.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    I'm not sure if I look into correct project. But seems spark-avro project doesn't have `AvroEncoder` yet. The PR going to add `AvroEncoder` looks like this one https://github.com/databricks/spark-avro/pull/217.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    I will take a look later today.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21310: [SPARK-24256][SQL] SPARK-24256: ExpressionEncoder should...

Posted by fangshil <gi...@git.apache.org>.
Github user fangshil commented on the issue:

    https://github.com/apache/spark/pull/21310
  
    @viirya  @cloud-fan before I add test, could you guys take a look and advise if the approach taken in this patch is acceptable?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org