You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2016/02/28 07:22:38 UTC

[GitHub] spark pull request: [SPARK-13536] [SQL] Missing Sort Attributes in...

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/11416

    [SPARK-13536] [SQL] Missing Sort Attributes in Transform Clauses

    #### What changes were proposed in this pull request?
    When missing sort attributes are not in the output of `Transform`, we should not bypass it and check its child nodes. Like `subquery`, we should not continue finding the missing attributes in its child nodes.
    
    #### How was this patch tested?
    This is an internal code bug. So far, unable to trigger the external behavior difference. The newly added code is just for covering and verifying it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark missingSortAttrInTransform

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11416.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11416
    
----
commit 52942e9f6be8ca76bfed5a0c723f993c238c6088
Author: gatorsmile <ga...@gmail.com>
Date:   2016-02-28T06:08:06Z

    Merge remote-tracking branch 'upstream/master' into dataSetPruning

commit 1199ca8ab4d6ab6c15a1a391f4cf63dbc9fd9851
Author: gatorsmile <ga...@gmail.com>
Date:   2016-02-28T06:08:38Z

    missing sort attributes for script transform

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile closed the pull request at:

    https://github.com/apache/spark/pull/11416


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11416#issuecomment-189838804
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52139/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11416#issuecomment-189838735
  
    **[Test build #52139 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52139/consoleFull)** for PR 11416 at commit [`1199ca8`](https://github.com/apache/spark/commit/1199ca8ab4d6ab6c15a1a391f4cf63dbc9fd9851).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11416#discussion_r54352359
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -687,7 +687,8 @@ class Analyzer(
             resolved
           } else {
             plan match {
    -          case u: UnaryNode if !u.isInstanceOf[SubqueryAlias] =>
    +          case u: UnaryNode
    --- End diff --
    
    Based on my understanding, the standard is appending/pruning the attributes from `outputSet` does not impact the results of the existing/remaining attributes. Based on this, we can categorize the existing `UnaryNode` into three groups:
    
    - **Group 1**: To add a new attribute into the `outputSet` of one node, we just need to add a new attribute into its child `outputSet`. 
    
      - **Type 1.1**: Adding new attributes will not have any impact on the existing logics of this node. For example, `Filter` and `Sort`.
      - **Type 1.2**: Adding new attributes will impact the parent nodes. For example, `SubqueryAlias`. It will add `alias` into `Quantifier` of attributes in its `outputSet`
    
    - **Group 2**: The `outputSet` of one node is fully/partially controlled by its class parameters. 
    
      - **Type 2.1**: Adding new attributes will not have any impact on the existing logics of this node. For example, `Project` and `Window`.
      - **Type 2.2**: Adding new attributes is restricted by the other class parameters. For example, `Aggregate` and `Generate`. For `Aggregate` nodes, we only can add attributes if they are part of `groupingExpressions`. Adding attributes into `groupingExpressions` will change the results instead of appending new columns.
    
    `ScriptTransformation`, `MapPartitions`, `AppendColumns` and `MapGroups` belong to **Type 2.2**. `script` and `func` restrict us to add new attributes. Thus, I think we should put them into the blacklist.
    
    `EvaluatePython` belongs to **Type 1.1**. Its output is determined by its `child.output` and `resultAttribute`. It should be safe. 
    
    As what I mentioned above, `GroupingSets` and `Pivot` are not visible to this rule. Thus, we do not need to add them into the blacklist. 
    
    Please correct me if my understanding is wrong. @davies Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11416#issuecomment-189838803
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11416#discussion_r54345049
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -687,7 +687,8 @@ class Analyzer(
             resolved
           } else {
             plan match {
    -          case u: UnaryNode if !u.isInstanceOf[SubqueryAlias] =>
    +          case u: UnaryNode
    --- End diff --
    
    how many kinds of `UnaryNode` we need to recursive here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11416#discussion_r54345163
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -687,7 +687,8 @@ class Analyzer(
             resolved
           } else {
             plan match {
    -          case u: UnaryNode if !u.isInstanceOf[SubqueryAlias] =>
    +          case u: UnaryNode
    --- End diff --
    
    There are 31 `UnaryNode` types. Actually, I did check most of `UnaryNode`. 
    
    `ResolveSortReferences` resolving missing attributes if and only if their child nodes have been resolved, as shown below. 
    ```scala
    case s @ Sort(order, _, child) if !s.resolved && child.resolved 
    ```
    
    Thus, most of ineligible `UnaryNode` have been replaced. For example, `GroupingSets` and `Pivot` will not be seen in this resolution. Thus, ignoring them is fine. Besides `Subquery`, I think `ScriptTransformation`, `MapPartitions`, `AppendColumns` and `MapGroups` need to be excluded. 
    
    Or should we use a white list here? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11416#discussion_r54345602
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -687,7 +687,8 @@ class Analyzer(
             resolved
           } else {
             plan match {
    -          case u: UnaryNode if !u.isInstanceOf[SubqueryAlias] =>
    +          case u: UnaryNode
    --- End diff --
    
    How about `EvaluatePython`? What's the specific rule to blacklist a `UnaryNode`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Stop Resolving Missing Sor...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11416#issuecomment-189822335
  
    **[Test build #52139 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52139/consoleFull)** for PR 11416 at commit [`1199ca8`](https://github.com/apache/spark/commit/1199ca8ab4d6ab6c15a1a391f4cf63dbc9fd9851).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13536] [SQL] Missing Sort Attributes in...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/11416#issuecomment-189793752
  
    After submitting this PR, I just realized maybe we should disable the other `UnaryNode`: `MapPartitions`, `AppendColumns` and `MapGroups`. Is that right? @davies @cloud-fan Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org