You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2016/06/23 07:35:47 UTC

[GitHub] spark pull request #13872: [SPARK-16164][SQL] Filter pushdown should keep th...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/13872

    [SPARK-16164][SQL] Filter pushdown should keep the ordering in the logical plan

    ## What changes were proposed in this pull request?
    
    Chris McCubbin reported a bug when he used StringIndexer in an ML pipeline with additional filters. It seems that during filter pushdown, we changed the ordering in the logical plan.
    ```scala
    import org.apache.spark.ml.feature._
    val df1 = (0 until 3).map(_.toString).toDF
    val indexer = new StringIndexer()
      .setInputCol("value")
      .setOutputCol("idx")
      .setHandleInvalid("skip")
      .fit(df1)
    val df2 = (0 until 5).map(_.toString).toDF
    val predictions = indexer.transform(df2)
    predictions.show() // this is okay
    predictions.where('idx > 2).show() // this will throw an exception
    ```
    
    Please see the notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html for error messages.
    
    ## How was this patch tested?
    
    Pass the Jenkins tests (including a new testcase).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-16164

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13872.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13872
    
----
commit 2b21fd7e46af7a082aad3b9c3cb443bda3408ac3
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-06-23T07:32:48Z

    [SPARK-16164][SQL] Filter pushdown should keep the ordering in the logical plan

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Filter pushdown should keep the order...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    **[Test build #61105 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61105/consoleFull)** for PR 13872 at commit [`2b21fd7`](https://github.com/apache/spark/commit/2b21fd7e46af7a082aad3b9c3cb443bda3408ac3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    Thank you, @rxin . I hope so, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Filter pushdown should keep the order...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    Sure, I fully agree with your view. That's the declarative language.
    
    However, we can provide more *natural* order as a default order like in this PR. As you see, without considering this issue, we had better update `CombineFilters` in this way.
    
    By the way, theoretically, we can reorder the unit predicate of the conjunctive form by their dependency if we can do more sophisticate analysis. But, that's not considered here.
    
    I think there is no harm to change `CombineFilters` like this. And this PR definitely reduces the possibility of this kind of errors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    **[Test build #61105 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61105/consoleFull)** for PR 13872 at commit [`2b21fd7`](https://github.com/apache/spark/commit/2b21fd7e46af7a082aad3b9c3cb443bda3408ac3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Filter pushdown should keep the order...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    I think I had better change the title of this PR. (I just copied from the JIRA.)
    Does that will reduce your concern a little bit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Filter pushdown should keep the order...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    cc: @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    My major concern is that even the reported case is "fixed" by this PR, in general user applications shouldn't rely on predicate evaluation order. Other than that, the change made in `CombineFilters` LGTM.
    
    @rxin Shall we have this in 2.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    Merged into master and branch-2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    I think it'd be difficult to guarantee ordering all the time, but since the change is small we might as well take this tiny step to ensure stronger compatibility.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13872: [SPARK-16164][SQL] Update `CombineFilters` to try...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13872


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    For any conclusion, thank you for review, @mengxr and @liancheng !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    @liancheng @rxin In the code, I didn't use UDF explicitly in a filter expression. It is like the following:
    
    ~~~
    filter b > 0
    set a = udf(b)
    filter a > 2
    ~~~
    
    and the optimizer merged them into
    
    ~~~
    filter udf(b) > 2 and b > 0
    ~~~
    
    This could happen for any UDFs that throw exceptions. Are we assuming that UDFs never throw exceptions?
    
    The patch LGTM and I think we should merge it into 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Filter pushdown should keep the order...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    @dongjoon-hyun Thanks for the work! However, I think the optimizer should have the freedom to reorder predicate evaluation order. For example, we may evaluate cheap predicates first in order to shortcut expensive predicates.
    
    The problem here is that, to enable this kind of optimization, we actually require filter predicates to be both deterministic and full, while in case of SPARK-16164, one of the UDF is a partial function (only defined on a partial range of all possible input values), and thus rely on other predicates evaluated before itself to reduce input range.
    
    In a word, it's not safe for applications to make the assumption that filter predicates are always evaluated in exactly the same order they are defined in the original query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61105/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13872: [SPARK-16164][SQL] Update `CombineFilters` to try to con...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13872
  
    Thank you, @mengxr , @liancheng , and @rxin .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org