You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/05/05 14:26:06 UTC

[GitHub] spark pull request #17874: [SPARK-20612][SQL][WIP] Throw exception when ther...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/17874

    [SPARK-20612][SQL][WIP] Throw exception when there is unresolvable attributes in Filter

    ## What changes were proposed in this pull request?
    
    We have a rule in `Analyzer` that adds missing attributes in a Filter into its child plan. It makes the following codes work:
    
        val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
        df.select("y").where("x=1")
    
    It should throw an analysis exception instead of implicitly adding the missing attributes into underlying plan.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-20612

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17874.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17874
    
----
commit 010d046edfc273ecb12a16a36c47af72eab57881
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-05-05T09:29:58Z

    Throw exception when there is unresolvable attributes in Filter.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76495/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17874: [SPARK-20612][SQL] Throw exception when there is ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17874#discussion_r115136334
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1023,8 +1023,6 @@ class Analyzer(
        * clause.  This rule detects such queries and adds the required attributes to the original
        * projection, so that they will be available during sorting. Another projection is added to
        * remove these attributes after sorting.
    -   *
    -   * The HAVING clause could also used a grouping columns that is not presented in the SELECT.
    --- End diff --
    
    Since we introduce this by accident, I do not think we can remove it now. It could break the applications that are built on it. cc @rxin @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    in postgres, `select a from t where b > 0` can work, I think it's reasonable if `df.select("y").where("x=1")` works in spark.
    
    ```
    Seq(1).toDF("c1").createOrReplaceTempView("onerow")
    sql(
      """
        | select 1
        |        from   (select 1 from onerow t2 LIMIT 1)
        |        where  t2.c1=1""".stripMargin)
    ```
    this one we should not support, we should not add missing attributes though subqueries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76538/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17874: [SPARK-20612][SQL] Throw exception when there is ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17874#discussion_r115135849
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1023,8 +1023,6 @@ class Analyzer(
        * clause.  This rule detects such queries and adds the required attributes to the original
        * projection, so that they will be available during sorting. Another projection is added to
        * remove these attributes after sorting.
    -   *
    -   * The HAVING clause could also used a grouping columns that is not presented in the SELECT.
    --- End diff --
    
    For example,
    
    ```SQL
    select type, avg (price)
    from titles
    group by type
    having sum (total_sales) > 10000
    ```
    
    This example is copied from [Sybase ASE](http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc36272.1570/html/commands/X96182.htm). I believe this is part of Transact-SQL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    **[Test build #76495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76495/testReport)** for PR 17874 at commit [`010d046`](https://github.com/apache/spark/commit/010d046edfc273ecb12a16a36c47af72eab57881).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    yea that example looks very weird and we should fix it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    Maybe another point of view is, we can split `df.select("y").where("x=1")` to two different DataFrames:
    
        val onlyY = df.select("y")  // The schema of onlyY is just "y" attribute
        onlyY.where("x=1") // Then we can filter on a non-existing attribute
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    **[Test build #76538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76538/testReport)** for PR 17874 at commit [`f19976a`](https://github.com/apache/spark/commit/f19976a7e0818f36768d339bdcd883b31197de7e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17874: [SPARK-20612][SQL] Throw exception when there is ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17874#discussion_r115135600
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
    @@ -1023,8 +1023,6 @@ class Analyzer(
        * clause.  This rule detects such queries and adds the required attributes to the original
        * projection, so that they will be available during sorting. Another projection is added to
        * remove these attributes after sorting.
    -   *
    -   * The HAVING clause could also used a grouping columns that is not presented in the SELECT.
    --- End diff --
    
    This is by design. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    @cloud-fan This rule could make the query work:
    
        Seq(1).toDF("c1").createOrReplaceTempView("onerow")
        sql(
          """
            | select 1
            |        from   (select 1 from onerow t2 LIMIT 1)
            |        where  t2.c1=1""".stripMargin)
    
    But the where condition should not be able to refer `t2.c1` which is only available in the inner scope.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    @cloud-fan Since you all concern about breaking existing applications, I'd close this. But I think we should not add missing attributes though subqueries like I showed above. I'll create another PR to fix it. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    is it a bug? IIRC this is intentional, so that the dataframe behavior is consistent with SQL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    **[Test build #76538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76538/testReport)** for PR 17874 at commit [`f19976a`](https://github.com/apache/spark/commit/f19976a7e0818f36768d339bdcd883b31197de7e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    The rule is added by #12235. From the description and code comment, it is just used for HAVING clause that access a grouping column that is not presented in SELECT clause, instead of a general rule to add missing attributes to Filter.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17874: [SPARK-20612][SQL] Throw exception when there is ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/17874


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL] Throw exception when there is unresol...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    `select a from t where b > 0` works. However, it can be seen logically as:
    
        Project [a]
          Filter [b > 0]
            Relation t [a, b]
    
    It seems to me Spark also parses the above SQL query like this way.
    
    There is an order of evaluation in SQL systems. E.g, MySQL:
    
        select a from test where b > 2;   // works. where is evaluated before select
        select a from test having b > 2;  // not works. having is evaluated after select
    
    `df.select("y").where("x=1")` sematically asks a projection of  just `y`  attribute before filtering. It seems to me that it is different with the SQL query.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17874: [SPARK-20612][SQL][WIP] Throw exception when there is un...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17874
  
    **[Test build #76495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76495/testReport)** for PR 17874 at commit [`010d046`](https://github.com/apache/spark/commit/010d046edfc273ecb12a16a36c47af72eab57881).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17874: [SPARK-20612][SQL] Throw exception when there is ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17874#discussion_r115136069

--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ---
@@ -1023,8 +1023,6 @@ class Analyzer(
* clause. This rule detects such queries and adds the required attributes to the original
* projection, so that they will be available during sorting. Another projection is added to
* remove these attributes after sorting.
- *
- * The HAVING clause could also used a grouping columns that is not presented in the SELECT.
--- End diff --

> This is by design.

For the HAVING clause could also used a grouping columns that is not presented in the SELECT, yes.

For other general cases, I doubt it.

We have other rule doing this. That is why the tests are passed after this rule is removed. The above query also works without this rule.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17874: [SPARK-20612][SQL] Throw exception when there is ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17874#discussion_r115204690
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala ---
    @@ -83,7 +83,7 @@ class FPGrowthSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
         )).toDF("id", "items")
         val model = new FPGrowth().setMinSupport(0.7).fit(dataset)
         val prediction = model.transform(df)
    -    assert(prediction.select("prediction").where("id=3").first().getSeq[String](0).isEmpty)
    +    assert(prediction.where("id=3").select("prediction").first().getSeq[String](0).isEmpty)
    --- End diff --
    
    I'm worried that existing spark applications may already use this pattern in the code, so no matter it's a bug or not, it seems a feature now and we can't break it...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org