You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2016/05/24 00:06:51 UTC

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/13269

    [SPARK-15494][SQL] encoder code cleanup

    ## What changes were proposed in this pull request?
    
    Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
    
    1. move validation logic to analyzer instead of encoder
    2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
    3. `Dataset` don't need to keep a resolved encoder, as is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 
    
    
    ## How was this patch tested?
    
    existing test


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark clean-encoder

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13269.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13269
    
----
commit 73e9c1abec9ac22bb6e1370b0dcd44714b0acf71
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-05-23T23:38:42Z

    encoder code cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-222252647
  
    As we discussed offline, this PR also enables case insensitive encoder resolution. Would be nice to add a test case for it. Basically something like this:
    
    ```scala
    case class A(a: String)
    
    val data = Seq(
      "{ 'A': 'foo' }",
      "{ 'A': 'bar' }"
    )
    
    val df1 = spark.read.json(sc.parallelize(data))
    df1.printSchema()
    // root
    //  |-- A: string (nullable = true)
    
    val ds1 = df1.as[A]
    ds1.printSchema()
    // root
    //  |-- a: string (nullable = true)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59863/consoleFull)** for PR 13269 at commit [`efa9616`](https://github.com/apache/spark/commit/efa961673a9c90f144675f5b2af2ae2200f8cfbb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65460601
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala ---
    @@ -208,7 +209,7 @@ object Encoders {
               BoundReference(0, ObjectType(classOf[AnyRef]), nullable = true), kryo = useKryo)),
    --- End diff --
    
    So we are still using `BoundReference` for serializer expressions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221132582
  
    **[Test build #59165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59165/consoleFull)** for PR 13269 at commit [`73e9c1a`](https://github.com/apache/spark/commit/73e9c1abec9ac22bb6e1370b0dcd44714b0acf71).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59920/consoleFull)** for PR 13269 at commit [`efe0cd5`](https://github.com/apache/spark/commit/efe0cd57056673fcd3406ffaa468b431fa71eac9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65462066
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala ---
    @@ -208,7 +209,7 @@ object Encoders {
               BoundReference(0, ObjectType(classOf[AnyRef]), nullable = true), kryo = useKryo)),
    --- End diff --
    
    yea, I mentioned it in the PR description


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59920/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59904/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59920/consoleFull)** for PR 13269 at commit [`efe0cd5`](https://github.com/apache/spark/commit/efe0cd57056673fcd3406ffaa468b431fa71eac9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221131554
  
    cc @marmbrus @liancheng @yhuai @clockfly 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59772/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-222041073
  
    **[Test build #59435 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59435/consoleFull)** for PR 13269 at commit [`13fed35`](https://github.com/apache/spark/commit/13fed3591c9486dadc725c6d3fb95407f3c4f868).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class GetColumnByOrdinal(ordinal: Int, dataType: DataType) extends LeafExpression`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13269


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59772 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59772/consoleFull)** for PR 13269 at commit [`c294b3b`](https://github.com/apache/spark/commit/c294b3b8b304d3e223bae165819d3d35cda21c84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r64842044
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala ---
    @@ -449,152 +449,151 @@ object ScalaReflection extends ScalaReflection {
           }
         }
     
    -    if (!inputObject.dataType.isInstanceOf[ObjectType]) {
    --- End diff --
    
    put this `if` into pattern match, to reduce one ident level


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r64617059
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
    @@ -42,17 +42,9 @@ class KeyValueGroupedDataset[K, V] private[sql](
         private val dataAttributes: Seq[Attribute],
         private val groupingAttributes: Seq[Attribute]) extends Serializable {
     
    -  // Similar to [[Dataset]], we use unresolved encoders for later composition and resolved encoders
    -  // when constructing new logical plans that will operate on the output of the current
    -  // queryexecution.
    -
    -  private implicit val unresolvedKEncoder = encoderFor(kEncoder)
    -  private implicit val unresolvedVEncoder = encoderFor(vEncoder)
    -
    -  private val resolvedKEncoder =
    -    unresolvedKEncoder.resolve(groupingAttributes, OuterScopes.outerScopes)
    -  private val resolvedVEncoder =
    -    unresolvedVEncoder.resolve(dataAttributes, OuterScopes.outerScopes)
    +  // Similar to [[Dataset]], we turn the passed in encoder to `ExpressionEncoder` explicitly.
    +  private implicit val kEnc = encoderFor(kEncoder)
    --- End diff --
    
    Is it better to use the full name like keyEncoder?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    So there will be a follow-up for replacing `BoundReference` in serializer expressions with `GetColumnByOrdinal`, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65492290
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1884,10 +1884,62 @@ class Analyzer(
               } else {
                 inputAttributes
               }
    -          val unbound = deserializer transform {
    -            case b: BoundReference => inputs(b.ordinal)
    +
    +          validateTupleColumns(deserializer, inputs)
    +          val ordinalResolved = deserializer transform {
    +            case GetColumnByOrdinal(ordinal, _) => inputs(ordinal)
    +          }
    +          val attrResolved = resolveExpression(
    +            ordinalResolved, LocalRelation(inputs), throws = true)
    +          validateInnerTupleFields(attrResolved)
    +          attrResolved
    +      }
    +    }
    +
    +    private def fail(schema: StructType, maxOrdinal: Int): Unit = {
    +      throw new AnalysisException(s"Try to map ${schema.simpleString} to Tuple${maxOrdinal + 1}, " +
    +        "but failed as the number of fields does not line up.")
    +    }
    +
    +    /**
    +     * For each Tuple field, we use [[GetColumnByOrdinal]] to get its corresponding column by
    +     * position.  However, the actual number of columns may be different from the number of Tuple
    +     * fields.  This method is used to check the number of columns and fields, and throw an
    +     * exception if they do not match.
    +     */
    +    private def validateTupleColumns(deserializer: Expression, inputs: Seq[Attribute]): Unit = {
    +      var maxOrdinal = -1
    +      deserializer.foreach {
    +        case GetColumnByOrdinal(ordinal, _) => if (ordinal > maxOrdinal) maxOrdinal = ordinal
    +        case _ =>
    +      }
    +      if (maxOrdinal >= 0 && maxOrdinal != inputs.length - 1) {
    +        fail(inputs.toStructType, maxOrdinal)
    +      }
    --- End diff --
    
    Is this better?
    
    ```scala
    val ordinals = deserializer.collect {
      case GetColumnByOrdinal(ordinal, _) => ordinal
    }
    
    ordinals.reduceOption(_ max _).foreach { maxOrdinal =>
      if (maxOrdinal != inputs.length - 1) {
      	fail(inputs.toStructType, maxOrdinal)
      }
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59789/consoleFull)** for PR 13269 at commit [`b86bd01`](https://github.com/apache/spark/commit/b86bd011dfda32124ec9a0e275d3a691a0355615).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65637799
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1884,10 +1884,63 @@ class Analyzer(
               } else {
                 inputAttributes
               }
    -          val unbound = deserializer transform {
    -            case b: BoundReference => inputs(b.ordinal)
    +
    +          validateTupleColumns(deserializer, inputs)
    +          val ordinalResolved = deserializer transform {
    +            case GetColumnByOrdinal(ordinal, _) => inputs(ordinal)
    +          }
    +          val attrResolved = resolveExpression(
    +            ordinalResolved, LocalRelation(inputs), throws = true)
    +          validateInnerTupleFields(attrResolved)
    +          attrResolved
    +      }
    +    }
    +
    +    private def fail(schema: StructType, maxOrdinal: Int): Unit = {
    +      throw new AnalysisException(s"Try to map ${schema.simpleString} to Tuple${maxOrdinal + 1}, " +
    +        "but failed as the number of fields does not line up.")
    +    }
    +
    +    /**
    +     * For each Tuple field, we use [[GetColumnByOrdinal]] to get its corresponding column by
    +     * position.  However, the actual number of columns may be different from the number of Tuple
    +     * fields.  This method is used to check the number of columns and fields, and throw an
    +     * exception if they do not match.
    +     */
    +    private def validateTupleColumns(deserializer: Expression, inputs: Seq[Attribute]): Unit = {
    +      val ordinals = deserializer.collect {
    +        case GetColumnByOrdinal(ordinal, _) => ordinal
    +      }.distinct.sorted
    +
    +      if (ordinals.nonEmpty && ordinals != inputs.indices) {
    +        fail(inputs.toStructType, ordinals.last)
    +      }
    +    }
    +
    +    /**
    +     * For each inner Tuple field, we use [[GetStructField]] to get its corresponding struct field
    +     * by position.  However, the actual number of struct fields may be different from the number
    +     * of inner Tuple fields.  This method is used to check the number of struct fields and inner
    +     * Tuple fields, and throw an exception if they do not match.
    +     */
    +    private def validateInnerTupleFields(deserializer: Expression): Unit = {
    +      val exprToOrdinals = scala.collection.mutable.HashMap.empty[Expression, ArrayBuffer[Int]]
    +      deserializer foreach {
    +        case g: GetStructField =>
    +          if (exprToOrdinals.contains(g.child)) {
    +            exprToOrdinals(g.child) += g.ordinal
    +          } else {
    +            exprToOrdinals += g.child -> ArrayBuffer(g.ordinal)
    +          }
    --- End diff --
    
    This `if` expression can be simplified to:
    
    ```scala
    exprToOrdinals.getOrElseUpdate(g.child, ArrayBuffer.empty[Int]) += g.ordinal
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65637841
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1884,10 +1884,63 @@ class Analyzer(
               } else {
                 inputAttributes
               }
    -          val unbound = deserializer transform {
    -            case b: BoundReference => inputs(b.ordinal)
    +
    +          validateTupleColumns(deserializer, inputs)
    +          val ordinalResolved = deserializer transform {
    +            case GetColumnByOrdinal(ordinal, _) => inputs(ordinal)
    +          }
    +          val attrResolved = resolveExpression(
    +            ordinalResolved, LocalRelation(inputs), throws = true)
    +          validateInnerTupleFields(attrResolved)
    +          attrResolved
    +      }
    +    }
    +
    +    private def fail(schema: StructType, maxOrdinal: Int): Unit = {
    +      throw new AnalysisException(s"Try to map ${schema.simpleString} to Tuple${maxOrdinal + 1}, " +
    +        "but failed as the number of fields does not line up.")
    +    }
    +
    +    /**
    +     * For each Tuple field, we use [[GetColumnByOrdinal]] to get its corresponding column by
    +     * position.  However, the actual number of columns may be different from the number of Tuple
    +     * fields.  This method is used to check the number of columns and fields, and throw an
    +     * exception if they do not match.
    +     */
    +    private def validateTupleColumns(deserializer: Expression, inputs: Seq[Attribute]): Unit = {
    +      val ordinals = deserializer.collect {
    +        case GetColumnByOrdinal(ordinal, _) => ordinal
    +      }.distinct.sorted
    +
    +      if (ordinals.nonEmpty && ordinals != inputs.indices) {
    +        fail(inputs.toStructType, ordinals.last)
    +      }
    +    }
    +
    +    /**
    +     * For each inner Tuple field, we use [[GetStructField]] to get its corresponding struct field
    +     * by position.  However, the actual number of struct fields may be different from the number
    +     * of inner Tuple fields.  This method is used to check the number of struct fields and inner
    +     * Tuple fields, and throw an exception if they do not match.
    +     */
    +    private def validateInnerTupleFields(deserializer: Expression): Unit = {
    +      val exprToOrdinals = scala.collection.mutable.HashMap.empty[Expression, ArrayBuffer[Int]]
    --- End diff --
    
    Let's import `scala.collection.mutable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65638928
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1884,10 +1884,63 @@ class Analyzer(
               } else {
                 inputAttributes
               }
    -          val unbound = deserializer transform {
    -            case b: BoundReference => inputs(b.ordinal)
    +
    +          validateTupleColumns(deserializer, inputs)
    +          val ordinalResolved = deserializer transform {
    +            case GetColumnByOrdinal(ordinal, _) => inputs(ordinal)
    +          }
    +          val attrResolved = resolveExpression(
    +            ordinalResolved, LocalRelation(inputs), throws = true)
    +          validateInnerTupleFields(attrResolved)
    +          attrResolved
    +      }
    +    }
    +
    +    private def fail(schema: StructType, maxOrdinal: Int): Unit = {
    +      throw new AnalysisException(s"Try to map ${schema.simpleString} to Tuple${maxOrdinal + 1}, " +
    +        "but failed as the number of fields does not line up.")
    +    }
    +
    +    /**
    +     * For each Tuple field, we use [[GetColumnByOrdinal]] to get its corresponding column by
    +     * position.  However, the actual number of columns may be different from the number of Tuple
    +     * fields.  This method is used to check the number of columns and fields, and throw an
    +     * exception if they do not match.
    +     */
    +    private def validateTupleColumns(deserializer: Expression, inputs: Seq[Attribute]): Unit = {
    +      val ordinals = deserializer.collect {
    +        case GetColumnByOrdinal(ordinal, _) => ordinal
    +      }.distinct.sorted
    +
    +      if (ordinals.nonEmpty && ordinals != inputs.indices) {
    +        fail(inputs.toStructType, ordinals.last)
    +      }
    +    }
    +
    +    /**
    +     * For each inner Tuple field, we use [[GetStructField]] to get its corresponding struct field
    +     * by position.  However, the actual number of struct fields may be different from the number
    +     * of inner Tuple fields.  This method is used to check the number of struct fields and inner
    +     * Tuple fields, and throw an exception if they do not match.
    +     */
    +    private def validateInnerTupleFields(deserializer: Expression): Unit = {
    +      val exprToOrdinals = scala.collection.mutable.HashMap.empty[Expression, ArrayBuffer[Int]]
    +      deserializer foreach {
    +        case g: GetStructField =>
    +          if (exprToOrdinals.contains(g.child)) {
    +            exprToOrdinals(g.child) += g.ordinal
    +          } else {
    +            exprToOrdinals += g.child -> ArrayBuffer(g.ordinal)
    +          }
    +        case _ =>
    +      }
    +      exprToOrdinals.foreach {
    +        case (expr, ordinals) =>
    +          val schema = expr.dataType.asInstanceOf[StructType]
    +          val sortedOrdinals: Seq[Int] = ordinals.distinct.sorted
    +          if (sortedOrdinals.nonEmpty && sortedOrdinals != schema.indices) {
    --- End diff --
    
    The `.nonEmpty` check in unnecessary since empty ordinal vectors won't be put into the hash map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65485798
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala ---
    @@ -208,7 +209,7 @@ object Encoders {
               BoundReference(0, ObjectType(classOf[AnyRef]), nullable = true), kryo = useKryo)),
    --- End diff --
    
    Oh I see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r64940498
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala ---
    @@ -191,6 +189,26 @@ case class ExpressionEncoder[T](
     
       if (flat) require(serializer.size == 1)
     
    +  /**
    +   * Returns a new copy of this encoder, where the `deserializer` is resolved and bound to the
    +   * given schema.
    +   *
    +   * Note that, ideally encoder is used as a container of serde expressions, the resolution and
    +   * binding stuff should happen inside query framework.  However, in some cases we need to
    +   * use encoder as a function to do serialization directly(e.g. Dataset.collect), then we can use
    +   * this method to do resolution and binding outside of query framework.
    +   */
    +  def resolveAndBind(
    +      attrs: Seq[Attribute] = schema.toAttributes,
    +      analyzer: Analyzer = SimpleAnalyzer): ExpressionEncoder[T] = {
    --- End diff --
    
    Where do we pass in an existing analyzer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    yea, but may not happen before 2.0. It needs some more refactor to the object operator execution model. For example, the serializer in `AppendColumns` has no corresponding attribute, its input is an object we got from the given lambda function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/13269
  
    https://github.com/apache/spark/pull/13402 is merged, and I have one more PR to send.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221131551
  
    **[Test build #59165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59165/consoleFull)** for PR 13269 at commit [`73e9c1a`](https://github.com/apache/spark/commit/73e9c1abec9ac22bb6e1370b0dcd44714b0acf71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221158098
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59167/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221132596
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59165/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221158097
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59789/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59906/consoleFull)** for PR 13269 at commit [`6c793a8`](https://github.com/apache/spark/commit/6c793a8ebd5c00680699ecd5b2f08dce2d421b27).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65462283
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala ---
    @@ -191,6 +189,26 @@ case class ExpressionEncoder[T](
     
       if (flat) require(serializer.size == 1)
     
    +  /**
    +   * Returns a new copy of this encoder, where the `deserializer` is resolved and bound to the
    +   * given schema.
    +   *
    +   * Note that, ideally encoder is used as a container of serde expressions, the resolution and
    +   * binding stuff should happen inside query framework.  However, in some cases we need to
    +   * use encoder as a function to do serialization directly(e.g. Dataset.collect), then we can use
    +   * this method to do resolution and binding outside of query framework.
    +   */
    +  def resolveAndBind(
    +      attrs: Seq[Attribute] = schema.toAttributes,
    +      analyzer: Analyzer = SimpleAnalyzer): ExpressionEncoder[T] = {
    --- End diff --
    
    in Dataset


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/13269
  
    Are we going to break this PR to multiple smaller PRs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59863/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Just rebased this branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65639249
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala ---
    @@ -191,6 +189,26 @@ case class ExpressionEncoder[T](
     
       if (flat) require(serializer.size == 1)
     
    +  /**
    +   * Returns a new copy of this encoder, where the `deserializer` is resolved and bound to the
    +   * given schema.
    +   *
    +   * Note that, ideally encoder is used as a container of serde expressions, the resolution and
    +   * binding stuff should happen inside query framework.  However, in some cases we need to
    +   * use encoder as a function to do serialization directly(e.g. Dataset.collect), then we can use
    +   * this method to do resolution and binding outside of query framework.
    +   */
    +  def resolveAndBind(
    +      attrs: Seq[Attribute] = schema.toAttributes,
    +      analyzer: Analyzer = SimpleAnalyzer): ExpressionEncoder[T] = {
    --- End diff --
    
    One thing is that `SimpleAnalyzer` uses case sensitive resolution, and it's hard coded, while `Analyzer` is configurable and used case insensitive resolution by default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-222030468
  
    **[Test build #59435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59435/consoleFull)** for PR 13269 at commit [`13fed35`](https://github.com/apache/spark/commit/13fed3591c9486dadc725c6d3fb95407f3c4f868).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-222041212
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59435/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59789/consoleFull)** for PR 13269 at commit [`b86bd01`](https://github.com/apache/spark/commit/b86bd011dfda32124ec9a0e275d3a691a0355615).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59863/consoleFull)** for PR 13269 at commit [`efa9616`](https://github.com/apache/spark/commit/efa961673a9c90f144675f5b2af2ae2200f8cfbb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221157983
  
    **[Test build #59167 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59167/consoleFull)** for PR 13269 at commit [`3f51e4d`](https://github.com/apache/spark/commit/3f51e4db7a90dfd6ddea855b04a67bb7ca414beb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-222041209
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221132593
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59904/consoleFull)** for PR 13269 at commit [`fb26103`](https://github.com/apache/spark/commit/fb261031fd9c4fbd8082c5eac284e0d8a93bd252).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59904 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59904/consoleFull)** for PR 13269 at commit [`fb26103`](https://github.com/apache/spark/commit/fb261031fd9c4fbd8082c5eac284e0d8a93bd252).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59906/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13269#issuecomment-221132316
  
    **[Test build #59167 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59167/consoleFull)** for PR 13269 at commit [`3f51e4d`](https://github.com/apache/spark/commit/3f51e4db7a90dfd6ddea855b04a67bb7ca414beb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59906/consoleFull)** for PR 13269 at commit [`6c793a8`](https://github.com/apache/spark/commit/6c793a8ebd5c00680699ecd5b2f08dce2d421b27).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59769/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59769/consoleFull)** for PR 13269 at commit [`e321e4c`](https://github.com/apache/spark/commit/e321e4c77e7e6456d210ed1e2b0f4854c261d2fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65638864
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1884,10 +1884,63 @@ class Analyzer(
               } else {
                 inputAttributes
               }
    -          val unbound = deserializer transform {
    -            case b: BoundReference => inputs(b.ordinal)
    +
    +          validateTupleColumns(deserializer, inputs)
    +          val ordinalResolved = deserializer transform {
    +            case GetColumnByOrdinal(ordinal, _) => inputs(ordinal)
    +          }
    +          val attrResolved = resolveExpression(
    +            ordinalResolved, LocalRelation(inputs), throws = true)
    +          validateInnerTupleFields(attrResolved)
    +          attrResolved
    +      }
    +    }
    +
    +    private def fail(schema: StructType, maxOrdinal: Int): Unit = {
    +      throw new AnalysisException(s"Try to map ${schema.simpleString} to Tuple${maxOrdinal + 1}, " +
    +        "but failed as the number of fields does not line up.")
    +    }
    +
    +    /**
    +     * For each Tuple field, we use [[GetColumnByOrdinal]] to get its corresponding column by
    +     * position.  However, the actual number of columns may be different from the number of Tuple
    +     * fields.  This method is used to check the number of columns and fields, and throw an
    +     * exception if they do not match.
    +     */
    +    private def validateTupleColumns(deserializer: Expression, inputs: Seq[Attribute]): Unit = {
    +      val ordinals = deserializer.collect {
    +        case GetColumnByOrdinal(ordinal, _) => ordinal
    +      }.distinct.sorted
    +
    +      if (ordinals.nonEmpty && ordinals != inputs.indices) {
    +        fail(inputs.toStructType, ordinals.last)
    +      }
    +    }
    +
    +    /**
    +     * For each inner Tuple field, we use [[GetStructField]] to get its corresponding struct field
    +     * by position.  However, the actual number of struct fields may be different from the number
    +     * of inner Tuple fields.  This method is used to check the number of struct fields and inner
    +     * Tuple fields, and throw an exception if they do not match.
    +     */
    +    private def validateInnerTupleFields(deserializer: Expression): Unit = {
    +      val exprToOrdinals = scala.collection.mutable.HashMap.empty[Expression, ArrayBuffer[Int]]
    +      deserializer foreach {
    +        case g: GetStructField =>
    +          if (exprToOrdinals.contains(g.child)) {
    +            exprToOrdinals(g.child) += g.ordinal
    +          } else {
    +            exprToOrdinals += g.child -> ArrayBuffer(g.ordinal)
    +          }
    +        case _ =>
    +      }
    +      exprToOrdinals.foreach {
    +        case (expr, ordinals) =>
    +          val schema = expr.dataType.asInstanceOf[StructType]
    +          val sortedOrdinals: Seq[Int] = ordinals.distinct.sorted
    +          if (sortedOrdinals.nonEmpty && sortedOrdinals != schema.indices) {
    +            fail(schema, sortedOrdinals.last)
    --- End diff --
    
    This can be simplified to:
    
    ```scala
    val structChildToOrdinals =
      deserializer
        .collect { case g: GetStructField => g }
        .groupBy(_.child)
        .mapValues(_.map(_.ordinal).distinct.sorted)
    
    structChildToOrdinals.foreach { case (expr, ordinals) =>
      val schema = expr.dataType.asInstanceOf[StructType]
      if (ordinals != schema.indices) {
        fail(schema, ordinals.last)
      }
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59769 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59769/consoleFull)** for PR 13269 at commit [`e321e4c`](https://github.com/apache/spark/commit/e321e4c77e7e6456d210ed1e2b0f4854c261d2fe).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    Merging to master and branch-2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13269
  
    **[Test build #59772 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59772/consoleFull)** for PR 13269 at commit [`c294b3b`](https://github.com/apache/spark/commit/c294b3b8b304d3e223bae165819d3d35cda21c84).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13269: [SPARK-15494][SQL] encoder code cleanup

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13269#discussion_r65492813
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1884,10 +1884,62 @@ class Analyzer(
               } else {
                 inputAttributes
               }
    -          val unbound = deserializer transform {
    -            case b: BoundReference => inputs(b.ordinal)
    +
    +          validateTupleColumns(deserializer, inputs)
    +          val ordinalResolved = deserializer transform {
    +            case GetColumnByOrdinal(ordinal, _) => inputs(ordinal)
    +          }
    +          val attrResolved = resolveExpression(
    +            ordinalResolved, LocalRelation(inputs), throws = true)
    +          validateInnerTupleFields(attrResolved)
    +          attrResolved
    +      }
    +    }
    +
    +    private def fail(schema: StructType, maxOrdinal: Int): Unit = {
    +      throw new AnalysisException(s"Try to map ${schema.simpleString} to Tuple${maxOrdinal + 1}, " +
    +        "but failed as the number of fields does not line up.")
    +    }
    +
    +    /**
    +     * For each Tuple field, we use [[GetColumnByOrdinal]] to get its corresponding column by
    +     * position.  However, the actual number of columns may be different from the number of Tuple
    +     * fields.  This method is used to check the number of columns and fields, and throw an
    +     * exception if they do not match.
    +     */
    +    private def validateTupleColumns(deserializer: Expression, inputs: Seq[Attribute]): Unit = {
    +      var maxOrdinal = -1
    +      deserializer.foreach {
    +        case GetColumnByOrdinal(ordinal, _) => if (ordinal > maxOrdinal) maxOrdinal = ordinal
    +        case _ =>
    +      }
    +      if (maxOrdinal >= 0 && maxOrdinal != inputs.length - 1) {
    +        fail(inputs.toStructType, maxOrdinal)
    +      }
    --- End diff --
    
    Actually we should also check that each ordinal from 0 to `inputs.length - 1` appears in deserializer expression:
    
    ```scala
    val ordinals = deserializer.collect {
      case GetColumnByOrdinal(ordinal, _) => ordinal
    }.distinct.sorted
    
    if (ordinals.nonEmpty && ordinals != (0 until inputs.length)) {
      fail(inputs.toStructType, ordinals.max)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org