You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jiangxb1987 <gi...@git.apache.org> on 2016/06/24 12:10:00 UTC

[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

GitHub user jiangxb1987 opened a pull request:

    https://github.com/apache/spark/pull/13893

    [SPARK-14172][SQL] Hive table partition predicate not passed down correctly

    ## What changes were proposed in this pull request?
    
    Currently partition predicate is not passed down correctly when condition contains nondeterministic parts. This PR changed the logic in collectProjectsAndFilters() to add the deterministic parts into filters, so that partition predicate can be passed down correctly.
    
    ## How was this patch tested?
    
    new test in PruningSuite.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jiangxb1987/spark bugfix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13893.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13893
    
----
commit ce6f0b8186dd18301e5087d16a5139057c42ab77
Author: 蒋星博 <ji...@meituan.com>
Date:   2016-06-24T11:53:08Z

    [SPARK-14172][SQL] Hive table partition predicate not passed down correctly

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @cloud-fan could you please have a look at this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @cloud-fan I pushed a commit to apply predicate pushdown on deterministic parts placed before any non-deterministic predicates, should it be safe to do this optimization?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r68538346
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala ---
    @@ -141,6 +141,14 @@ class PruningSuite extends HiveComparisonTest with BeforeAndAfter {
           Seq("2008-04-08", "11"),
           Seq("2008-04-09", "11")))
     
    +  createPruningTest("Partition pruning - with nondeterministic fields",
    +    "SELECT value, hr FROM srcpart1 WHERE ds = '2008-04-08' AND rand(7) < 1",
    --- End diff --
    
    cc @liancheng @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @jiangxb1987 do we still have this bug?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73821493
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    @cloud-fan Now that I could insert a `Scanner` operator over `CatalogRelation` in `Optimizer`, but I noticed a relation may also be something like `l: LogicalRelation(relation: CatalogRelation, _, _)`, in this case, we couldn't analyze the class `LogicalRelation` because it's in package `spark-sql` while `Optimizer` is in `spark-catalyst`, thus we are not able to determine whether a `Scanner` should be added. I think we don't want to add `Scanner` over every `BaseRelation`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    Ping, @jiangxb1987 . 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    **[Test build #62596 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62596/consoleFull)** for PR 13893 at commit [`db28228`](https://github.com/apache/spark/commit/db28228800a4d7fe0aafa15543084de4539a8930).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r68557086
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala ---
    @@ -141,6 +141,14 @@ class PruningSuite extends HiveComparisonTest with BeforeAndAfter {
           Seq("2008-04-08", "11"),
           Seq("2008-04-09", "11")))
     
    +  createPruningTest("Partition pruning - with nondeterministic fields",
    +    "SELECT value, hr FROM srcpart1 WHERE ds = '2008-04-08' AND rand(7) < 1",
    --- End diff --
    
    Yes you are right. I thought the deterministic part can always be PPDed safely but it was not, in fact, the order of each part should also be considered. For example:
    rand() < 0.01 AND partition_col = 'some_value'
    should not be PPDed, but
    partition_col = 'some_value' AND rand() < 0.01
    still could be.
    Thank you for your kindly reply!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73117855
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    after think about it more, I think it's not safe to do so. `collectProjectsAndFilters` should return all deterministic projects and filters upon a scan node. And the returned filter conditions are not only used for filter pushdown, but also treated as the whole filters upon this scan node. So the `rest` conditions here won't get executed.
    
    cc @liancheng to confirm this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73156772
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    I also think that silently dropping nondeterministic filters can be dangerous. Maybe we should just return all operators beneath the top-most nondeterministic filter as the bottom operator?
    
    For example, say we have a plan tree like this:
    
    ```
    Project a, b
     Filter a > 1
      Filter b < 3
       Filter RAND(42) > 0.5
        Filter c < 2
         TableScan t
    ```
    
    We should return the following result:
    
    ```
    (
      // Project list
      Seq(a, b),
    
      // Deterministic filters
      Seq(b < 3, a > 1),
    
      // The top-most nondeterministic operator filter with all operators beneath
      Filter(
        RAND(42) > 0.5,
        Filter(c < 2,
          TableScan(t)))
    )
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by AlekseiS <gi...@git.apache.org>.
Github user AlekseiS commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r68438486
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,12 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
    +        val (deterministicFilters, nondeterministicFilters) = splitConjunctivePredicates(condition)
    +          .partition(_.deterministic)
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +        val substitutedFilters = deterministicFilters.map(substitute(aliases)(_))
    +        (fields, filters ++ deterministicFilters, other, aliases)
    --- End diff --
    
    Likely there's a bug. I think you want to use "substitutedFilters" instead of "deterministicFilters" here. I think you can also add a test with substitution for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73318885
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    @cloud-fan Do you mean something like adding in basicLogicalOperators the following:
    case class Scanner( projectionList: Seq[NamedExpression], filters: Seq[Expression], child: LogicalPlan) extends UnaryNode
    And pass that to the planner instead of applying PhysicalOperation?
    
    I'm willing to take this work. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73126756
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    @cloud-fan Thanks for your comment! But I did searched the codebase and found the returned `filters` only used for predicates pushdown or partition pruning, in both case it should be safe for us to drop the `rest` condition. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    It's a good point, looks like we can also improve the `PushDownPredicate` rule according to this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    Predicates should not be reordered if a condition contains non-deterministic parts, for example,  'rand() < 0.1 AND a=1' should not be reordered to 'a=1 AND rand() < 0.1' as the number of calls rand() will change and thus output different rows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62596/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73136264
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    Can you write a test about this? The logic in `DataSourceStrategy` shows that, when we get a scan node with the projects and filters upon it, we will rebuild the project and filter(with project lists and filter conditions merged) and wrap the scan node with it. So the filter condition that isn't returned by `collectProjectsAndFilters` won't get executed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @AlekseiS , I think we should always consider the risk and take care of it even it's not a common case. We can't assume what users expect, and prepare for the worst case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    If PushDownPredicate should be improved, I would like to send a PR in one or two days. Is a new task need to be created?@cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    no, the predicates order doesn't matter. Our optimizer can reorder the predicates to run them more efficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    ping @jiangxb1987 @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73305098
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    after an offline discussion with @liancheng , we think it would be better to have a wrapper node for scan(table scan or file scan), and this wrapper node can also hold project list and filter conditions. Then in optimizer we can improve the ColumnPrunning and FilterPushdown rules to push down into this wrapper node. After this we don't need `PhysicalOperator` anymore and the planner can match on the wrapper node directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    **[Test build #62596 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62596/consoleFull)** for PR 13893 at commit [`db28228`](https://github.com/apache/spark/commit/db28228800a4d7fe0aafa15543084de4539a8930).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    With [PR#14012](https://github.com/apache/spark/pull/14012) the order between deterministic and non-deterministic predicates would not be changed arbitrarily, so I think we could apply this improvement which push down predicates placed before non-deterministic parts in partition conditions so that we could do partition pruning even when condition contains non-deterministic fields. @liancheng @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @cloud-fan Do you mean something like adding in `basicLogicalOperators` the following:
    `case class Scanner(
        projectionList: Seq[NamedExpression],
        filters: Seq[Expression],
        child: LogicalPlan)
      extends UnaryNode`
    And pass that to the planner instead of applying `PhysicalOperation`?
    
    I'm willing to take this work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @cloud-fan I've send a PR to add `Scanner` operator in #14619 , please have a look at it when you have time, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @jiangxb1987 Please feel free to create a new JIRA ticket and PR for this, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r68556327
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala ---
    @@ -141,6 +141,14 @@ class PruningSuite extends HiveComparisonTest with BeforeAndAfter {
           Seq("2008-04-08", "11"),
           Seq("2008-04-09", "11")))
     
    +  createPruningTest("Partition pruning - with nondeterministic fields",
    +    "SELECT value, hr FROM srcpart1 WHERE ds = '2008-04-08' AND rand(7) < 1",
    --- End diff --
    
    Good question. I think technically we can't push down any predicates that are placed after a non-deterministic predicate. Otherwise number of input rows may change and lead to wrong query results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by AlekseiS <gi...@git.apache.org>.
Github user AlekseiS commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @liancheng I think partition predicates are a bit different. If you explicitly specify a partition predicate, like "date=2016-06-27", do you really expect other partitions being scanned regardless of whether you use non-deterministic function or not? Most likely, no, so if partition filter is specified and it's deterministic it's expected to be always used.
    For this reason, I think that it's always correct to push filters which only reference partition columns to the scan. @jiangxb1987 do you think you could modify the patch to do it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    @heary-cao tried to resolve the same issue in https://github.com/apache/spark/pull/18969 
    
    ping @jiangxb1987 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73176693
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    Thank you @cloud-fan for pointing that out, I realized my previous thoughts were wrong. I fully agree with @liancheng 's improvement idea. Will update related code as well as new testcases tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r68538310
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala ---
    @@ -141,6 +141,14 @@ class PruningSuite extends HiveComparisonTest with BeforeAndAfter {
           Seq("2008-04-08", "11"),
           Seq("2008-04-09", "11")))
     
    +  createPruningTest("Partition pruning - with nondeterministic fields",
    +    "SELECT value, hr FROM srcpart1 WHERE ds = '2008-04-08' AND rand(7) < 1",
    --- End diff --
    
    I'm not sure if it's safe to push it down. For non-deterministic expressions, the order(or number) of  input rows matters. If we push down the deterministic part of filter condition, then the input rows to the remaining filter condition will change and may result to wrong answer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    ya, this still exists. Let me find some time to resolve this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r73325120
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,17 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +
    +        // Deterministic parts of filter condition placed before non-deterministic predicates could
    +        // be pushed down safely.
    +        val (pushDown, rest) =
    --- End diff --
    
    yup, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/13893
  
    ping @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13893#discussion_r68483394
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala ---
    @@ -64,10 +64,12 @@ object PhysicalOperation extends PredicateHelper {
             val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
             (Some(substitutedFields), filters, other, collectAliases(substitutedFields))
     
    -      case Filter(condition, child) if condition.deterministic =>
    +      case Filter(condition, child) =>
    +        val (deterministicFilters, nondeterministicFilters) = splitConjunctivePredicates(condition)
    +          .partition(_.deterministic)
             val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
    -        val substitutedCondition = substitute(aliases)(condition)
    -        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
    +        val substitutedFilters = deterministicFilters.map(substitute(aliases)(_))
    +        (fields, filters ++ deterministicFilters, other, aliases)
    --- End diff --
    
    @AlekseiS U r right! I have fixed this, thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org