You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/04/27 18:19:51 UTC

[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/21184

    [WIP][SPARK-24051][SQL] Replace Aliases with the same exprId

    ## What changes were proposed in this pull request?
    
    If a user reuses the same column in different selects, it can happen that we have `Alias` with the same `exprId`. These aliases can of course reference different columns/expressions (as in the use case presented in the JIRA). If any of them is a foldable, all of them are replaced with the foldable value in the Optimizer.
    
    The PR proposes to replace the conflicting aliases generating new ids for the conflicting ones.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-24051

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21184
    
----
commit d676b6277a682894d409e314e64ece7857a97841
Author: Marco Gaido <ma...@...>
Date:   2018-04-25T16:14:55Z

    [SPARK-24051][SQL] Replace Aliases with the same exprId

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r202077386
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -284,6 +288,80 @@ class Analyzer(
         }
       }
     
    +  /**
    +   * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having
    +   * different exprIds. This is a rare situation which can cause incorrect results.
    +   */
    +  object DeduplicateAliases extends Rule[LogicalPlan] {
    --- End diff --
    
    I feel the root cause is in `FoldablePropagation`. We should only replace attribute with literal from children, not siblings. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r201961786
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -284,6 +288,80 @@ class Analyzer(
         }
       }
     
    +  /**
    +   * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having
    +   * different exprIds. This is a rare situation which can cause incorrect results.
    +   */
    +  object DeduplicateAliases extends Rule[LogicalPlan] {
    --- End diff --
    
    The main problem which causes the added UT to fail is that `FoldablePropagation` replaces all foldable aliases which are considered to be the same. If the same alias with same exprId is located in different part of the plan (referencing actually different things, despite they have the same id...) this can cause wrong replacement to happen. So in the added UT, the plan is:
    ```
    == Analyzed Logical Plan ==
    a: int, b: int, n: bigint
    Union
    :- Project [a#5, b#17, n#19L]
    :  +- Project [a#5, b#17, n#19L, n#19L]
    :     +- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS n#19L]
    :        +- Project [a#5, b#6 AS b#17]
    :           +- Project [_1#2 AS a#5, _2#3 AS b#6]
    :              +- LocalRelation [_1#2, _2#3]
    +- Project [a#12, b#17, n#19L]
       +- Project [a#12, b#17, n#19L, n#19L]
          +- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS n#19L]
             +- Project [a#12, b#14 AS b#17]
                +- Project [a#12, 0 AS b#14]
                   +- Project [value#10 AS a#12]
                      +- LocalRelation [value#10]
    ```
    Please note that in both the branches of the union we have the same `b#17` attribute, but they are referencing different things. As the lower one is a foldable value which evaluates to 0, all the `b#17` are replace with a literal 0, causing a wrong result.
    
    Despite we might fix this specific problem in the related Optimizer rule, I think that in general we assume that items with the same id are the same. So I proposed this solution in order to fix all the possible issues which may arise due to this situation which is not expected.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2727/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r201769371
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2265,4 +2266,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
         val df = spark.range(1).select($"id", new Column(Uuid()))
         checkAnswer(df, df.collect())
       }
    +
    +  test("SPARK-24051: using the same alias can produce incorrect result") {
    --- End diff --
    
    This test case failed without your changes? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    **[Test build #89937 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89937/testReport)** for PR 21184 at commit [`d676b62`](https://github.com/apache/spark/commit/d676b6277a682894d409e314e64ece7857a97841).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 closed the pull request at:

    https://github.com/apache/spark/pull/21184


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    @hvanhovell I created this PR to discuss all together the problem in details with a basic solution which can help understanding better the problem. Honestly I am not happy with this change. So if we can find better options, any input would be very appreciated. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    **[Test build #89937 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89937/testReport)** for PR 21184 at commit [`d676b62`](https://github.com/apache/spark/commit/d676b6277a682894d409e314e64ece7857a97841).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    @gatorsmile what do you think about this approach for the problem we hit in https://github.com/apache/spark/pull/21737? Currently, this change doesn't solve all the conflicts, but what do you think about the approach? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r207859370
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -284,6 +288,80 @@ class Analyzer(
         }
       }
     
    +  /**
    +   * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having
    +   * different exprIds. This is a rare situation which can cause incorrect results.
    +   */
    +  object DeduplicateAliases extends Rule[LogicalPlan] {
    --- End diff --
    
    kindly ping @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    **[Test build #93037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93037/testReport)** for PR 21184 at commit [`d676b62`](https://github.com/apache/spark/commit/d676b6277a682894d409e314e64ece7857a97841).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    **[Test build #93037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93037/testReport)** for PR 21184 at commit [`d676b62`](https://github.com/apache/spark/commit/d676b6277a682894d409e314e64ece7857a97841).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89937/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r201957701
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2265,4 +2266,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
         val df = spark.range(1).select($"id", new Column(Uuid()))
         checkAnswer(df, df.collect())
       }
    +
    +  test("SPARK-24051: using the same alias can produce incorrect result") {
    --- End diff --
    
    yes, without the change the result is:
    ```
    +---+---+---+
    |  a|  b|  n|
    +---+---+---+
    |  1|  0|  2|
    |  2|  0|  2|
    |  3|  0|  1|
    +---+---+---+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r202334300
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -284,6 +288,80 @@ class Analyzer(
         }
       }
     
    +  /**
    +   * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having
    +   * different exprIds. This is a rare situation which can cause incorrect results.
    +   */
    +  object DeduplicateAliases extends Rule[LogicalPlan] {
    --- End diff --
    
    Yes, that is also true. But in many places in the codebase we just compare attributes using `semanticEquals` or in some other cases, even `equals`. Well, if we admit that different attributes can have the same `exprId`, all these places should be checked in order to be sure that the same problem cannot happen there too. Moreover (this is more a nit), the `semanticEquals` or `sameRef` method itself would be wrong according to its semantic, as it may return `true` even when two attributes don't have the same reference. This is the reason why I opted for this solution, which seems to me cleaner as it solves the root cause of the problem. What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the same ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21184
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93037/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21184: [WIP][SPARK-24051][SQL] Replace Aliases with the ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r201886545
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -284,6 +288,80 @@ class Analyzer(
         }
       }
     
    +  /**
    +   * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having
    +   * different exprIds. This is a rare situation which can cause incorrect results.
    +   */
    +  object DeduplicateAliases extends Rule[LogicalPlan] {
    --- End diff --
    
    what problem does it try to resolve?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org