You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2015/12/12 19:37:10 UTC

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/10278

    [SPARK-12218] [SQL] Fixed the Parquet's filter generation rule when `Not` is included in Parquet filter pushdown

    When applying the operator `Not`, the current generation rule for Parquet filters simply applies `Not` to all the inclusive/underlying filters. 
    
    For example, when the filter is ```"not (a = 2 and b in ('1', '2'))"```, the generated filter is ```not (a=2)```. When we push down this filter to Parquet, it will remove all the eligible rows satisfying the condition ```not(b in ('1', '2'))```
    
    In the current 1.6, the Optimizer's rule BooleanSimplification added the following new rules in the PR(https://github.com/apache/spark/pull/5700): (BTW, should we move this to analyzer?) 
    ```
            not(A and B) => not(A) or not(B)
            not(A or B) => not(A) and not(B)
    ```
    I do not think we should redo it in the Parquet filter generation. Thus, I just added a condition to avoid the incorrect results in case the Optimizer is unable to handle all the cases. 
    
    **Question**: how can we include the PR https://github.com/apache/spark/pull/5700 into 1.5? Do you need me to submit a new PR for 1.5? Or you can do it? This is a critical PR because the result will be incorrect without the fix.
    
    CC the original reviewers of https://github.com/apache/spark/pull/5700: @marmbrus @cloud-fan 
    
    Thanks!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark parquetFilterNot

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10278
    
----
commit 79be2c3581551ab24273f3da472269814d0d736e
Author: gatorsmile <ga...@gmail.com>
Date:   2015-12-12T18:10:16Z

    added a condition for `Not` operator in ParquetFilter.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165936243
  
    **[Test build #48036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48036/consoleFull)** for PR 10278 at commit [`64cd5e6`](https://github.com/apache/spark/commit/64cd5e66aba9d5fa05a4cfe9543eabaab0c3b825).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class JavaTwitterHashTagJoinSentiments `\n  * `case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)`\n  * `abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback `\n  * `case class UnresolvedWindowExpression(`\n  * `case class WindowExpression(`\n  * `case class Lead(input: Expression, offset: Expression, default: Expression)`\n  * `case class Lag(input: Expression, offset: Expression, default: Expression)`\n  * `abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction `\n  * `abstract class RowNumberLike extends AggregateWindowFunction `\n  * `trait SizeBasedWindowFunction extends AggregateWindowFunction `\n  * `case class RowNumber() extends RowNumberLike `\n  * `case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction `\n  * `case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedW
 indowFunction `\n  * `abstract class RankLike extends AggregateWindowFunction `\n  * `case class Rank(children: Seq[Expression]) extends RankLike `\n  * `case class DenseRank(children: Seq[Expression]) extends RankLike `\n  * `case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164264969
  
    **[Test build #47623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47623/consoleFull)** for PR 10278 at commit [`50733c6`](https://github.com/apache/spark/commit/50733c6239b721ecb1f0691bb3d4680235c15a18).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165334450
  
    https://github.com/apache/spark/pull/10344 shows that the test fails with out 1.5.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165320679
  
    Yeah, you can say that. 
    
    For example, the original filter is ```not (a = 2 and b in ('1', '2'))```. However, Spark 1.5.2 only pushes down ```not (a = 2)```. Thus, the returned data from Parquet is incomplete and thus data loss happens.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165936254
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165917268
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165944650
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165944480
  
    **[Test build #48045 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48045/consoleFull)** for PR 10278 at commit [`64cd5e6`](https://github.com/apache/spark/commit/64cd5e66aba9d5fa05a4cfe9543eabaab0c3b825).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164202727
  
    Ok, will make a try to force it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165297940
  
    @gatorsmile Can you create a pr for 1.5? We can do this. The first commit is to just have your test case. Then, our jenkins should fail. Finally, we add your fix and jenkins should be good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165297020
  
    Yeah, it works without https://github.com/apache/spark/pull/5700. 
    
    However, I still hope we can backport https://github.com/apache/spark/pull/5700. Without it, it will not push down the these filters to Parquet. That means, it will have a negative performance impact.  
    
    If you need it, I also can create another JIRA for backporting https://github.com/apache/spark/pull/5700
    
    Please let me know your opinions. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164232814
  
    After reading the source codes, it does not make sense we do not push down `IN` to Parquet in the above example:
    ```"not (a = 2 and b in ('1', '2'))"```. 
    
    We should fix these two issues in both 1.5.x and 1.6.x


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10278#discussion_r47877766
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -265,7 +268,10 @@ private[sql] object ParquetFilters {
               rhsFilter <- createFilter(schema, rhs)
             } yield FilterApi.or(lhsFilter, rhsFilter)
     
    -      case sources.Not(pred) =>
    +      // Here, we assume the Optimizer's rule BooleanSimplification has pushed `Not` operator
    +      // to the inner most level.
    +      case sources.Not(pred)
    +        if !pred.isInstanceOf[sources.And] && !pred.isInstanceOf[sources.Or] =>
    --- End diff --
    
    Nit: The following version might be clearer:
    
    ```scala
          // (Copy your comment here)
          case sources.Not(_: sources.And) | sources.Not(_: sources.Or) =>
            None
    
          case sources.Not(pred) =>
            createFilter(schema, pred).map(FilterApi.not)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165885859
  
    You can change it to just handle `In`. Actually, I am wondering what was the problem with the other pr that we decided to not add `In` with that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164175763
  
    **[Test build #47615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47615/consoleFull)** for PR 10278 at commit [`79be2c3`](https://github.com/apache/spark/commit/79be2c3581551ab24273f3da472269814d0d736e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164190262
  
    **[Test build #47616 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47616/consoleFull)** for PR 10278 at commit [`2ff70bf`](https://github.com/apache/spark/commit/2ff70bfac2c9be9e75cc7840dd3844854f565325).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165298276
  
    Sure, will do it tonight. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164202142
  
    This only happens in 1.5. Do you need me to write a test case for 1.5?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165301819
  
    @gatorsmile So, the problem is Spark SQL generates wrong parquet filter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164188282
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164188283
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47615/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11164] [SQL] Add InSet pushdown filter ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-166789471
  
    @liancheng Done. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11164] [SQL] Add InSet pushdown filter ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-166811795
  
    Thanks! I'm merging this to master, and will attribute this one to @viirya.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165908776
  
    After reading the contents, that PR was closed due to the issue in String filters. Please correct me if my understanding is wrong. @liancheng 
    
    Do you want to deliver this by submitting another PR, @viirya ? Either is fine for me. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165955121
  
    @gatersmile Yeah, you're right. #8950 initially aimed to fix other issues, and also included the fix for `In`.
    
    BTW, it's almost always strictly better to open small PRs that contains ONLY a single change than bigger ones that contains multiple changes. The former are much easier to review and get merged. (One liner PRs are super welcomed!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11164] [SQL] Add InSet pushdown filter ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-166816608
  
    Thanks @gatorsmile @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164213703
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165575480
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165379694
  
    @gatorsmile Sorry for the late reply and thanks for the nice catch!
    
    The `In` predicate push down issue had been tracked by SPARK-11164, and done as part of PR #8956. Unfortunately that we didn't merge that PR due to other problems in it. Could you please add SPARK-11164 to your PR title?
    
    For the `Not` push-down rule:
    
    1. I'm for adding it to branch-1.5 since it's a pretty safe one.
    2. I think we might also want to add more general [CNF][1] conversion rule to master, which should be done in a separate PR, of course.
    
    Since we don't have existential / universal quantifier in our predicates, I think CNF conversion in Spark SQL can be as simple as keeping pushing `Not` and `Or` inward (or downward) using De Morgan's laws and the distributive law:
    
    ```scala
    object CNFConversion extends Rule[LogicalPlan] {
      override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
        case filter: Filter =>
          import org.apache.spark.sql.catalyst.dsl.expressions._
    
          filter.copy(condition = filter.condition.transform {
            case Not(x Or y) => !x && !y
            case Not(x And y) => !x || !y
            case (x And y) Or z => (x || z) && (y || z)
            case x Or (y And z) => (x || y) && (x || z)
          })
      }
    }
    ```
    
    (Notice that this version doesn't handle common expression elimination.)
    
    That said, the `Not` push-down rule is actually a subset of CNF conversion. There had once been a PR aimed to add CNF conversion for data source filter push-down only, but wasn't merged (see SPARK-6624 and PR #6713). As @marmbrus commented there, CNF conversion might be worth adding to the optimizer.
    
    @rxin @marmbrus Not super confident about the CNF conversion conclusion above, please correct me if I'm wrong.
    
    [1]: https://en.wikipedia.org/wiki/Conjunctive_normal_form



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165550249
  
    **[Test build #47939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47939/consoleFull)** for PR 10278 at commit [`e219ac1`](https://github.com/apache/spark/commit/e219ac1e098abc581a1bf999802f0ec914967eea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164265233
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10278#discussion_r47950692
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -256,6 +256,9 @@ private[sql] object ParquetFilters {
           case sources.GreaterThanOrEqual(name, value) =>
             makeGtEq.lift(dataTypeOf(name)).map(_(name, value))
     
    +      case sources.In(name, valueSet) =>
    +        makeInSet.lift(dataTypeOf(name)).map(_(name, valueSet.toSet))
    +
           case sources.And(lhs, rhs) =>
             (createFilter(schema, lhs) ++ createFilter(schema, rhs)).reduceOption(FilterApi.and)
    --- End diff --
    
    Looks like this is the real problem. It is not safe to just push one side at here. This is the place where we drop that `In` because `createFilter(schema, In(...))` returns `None`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164204557
  
    I might find another bug in Parquet pushdown. Will submit another PR later when I can confirm it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164265237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47623/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165575481
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47939/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165955483
  
    Thank you for your suggestions! @liancheng : )
    
    Next time, I will not mix multiple fixes in the same PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164255495
  
    **[Test build #47623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47623/consoleFull)** for PR 10278 at commit [`50733c6`](https://github.com/apache/spark/commit/50733c6239b721ecb1f0691bb3d4680235c15a18).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164198539
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165575314
  
    **[Test build #47939 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47939/consoleFull)** for PR 10278 at commit [`e219ac1`](https://github.com/apache/spark/commit/e219ac1e098abc581a1bf999802f0ec914967eea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165340920
  
    @yhuai Based on my understanding, if including the fix of `IN` in this PR, we have covered all the filters. The only exceptions are the ones explained in https://issues.apache.org/jira/browse/SPARK-11153
    
    Since 1.6 already has the fix (https://github.com/apache/spark/pull/5700) that can push `Not` operator to the inner most level, we can say 1.6 is not affected by the bug even if some filters are not pushed down. 
    
    Please correct me if anything is not appropriate, @liancheng Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164604885
  
    @liancheng can you look at this?  Seems pretty serious if we are returning wrong answers.
    
    /cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165339904
  
    @gatorsmile @liancheng Looks like we only push a part of the predicate down if we do not understand other parts. Is there any other kind of combinations that can trigger this issue? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164198466
  
    **[Test build #47616 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47616/consoleFull)** for PR 10278 at commit [`2ff70bf`](https://github.com/apache/spark/commit/2ff70bfac2c9be9e75cc7840dd3844854f565325).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164209122
  
    **[Test build #47618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47618/consoleFull)** for PR 10278 at commit [`c9af771`](https://github.com/apache/spark/commit/c9af771adb998b54c8bfcbdf64ac4fc1b82d14ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10278#discussion_r47930611
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -265,7 +268,10 @@ private[sql] object ParquetFilters {
               rhsFilter <- createFilter(schema, rhs)
             } yield FilterApi.or(lhsFilter, rhsFilter)
     
    -      case sources.Not(pred) =>
    +      // Here, we assume the Optimizer's rule BooleanSimplification has pushed `Not` operator
    +      // to the inner most level.
    +      case sources.Not(pred)
    +        if !pred.isInstanceOf[sources.And] && !pred.isInstanceOf[sources.Or] =>
    --- End diff --
    
    Yeah, your version is cleaner! Let me change the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164198540
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47616/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165296448
  
    @gatorsmile how about we also create a jira against 1.5? So, we can use that to test the fix (later when we merge PR, we can merge this one if there is no conflict. Otherwise, we merge that one to 1.5 and merge this one to 1.6 and master).
    
    Also, do we need to backport #5700 to 1.5? Without it, your fix also works, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165808370
  
    @yhuai @liancheng Regarding this PR, should I keep it open? This PR also has another fix that can push down the filter `in`. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164213638
  
    **[Test build #47618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47618/consoleFull)** for PR 10278 at commit [`c9af771`](https://github.com/apache/spark/commit/c9af771adb998b54c8bfcbdf64ac4fc1b82d14ad).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164213704
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47618/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165944651
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48045/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165936393
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165507592
  
    Thank you for your detailed explanation! @liancheng 
    
    I have the same opinion as @marmbrus  . We should include CNF conversion into our optimizer. Some RDBMS systems might put it during the phase of query rewriting. Below is my 2 cents about CNF. 
    
    Generally, CNF conversion is an important concept in query optimization, especially when we support indexing in Spark. When (multi-attribute) indexes exist over some subset of conjucts, we can employ these indexes to improve the selectivity. 
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11164] [SQL] Add InSet pushdown filter ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10278


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164178387
  
    After reading the other push-down PR, I think it also needs a review from @liancheng . Welcome any comment! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164202075
  
    Do you have a test case that actually shows a wrong answer being computed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164202611
  
    Any bug fix should have a regression test.  We could always change the optimizer in a way that does not hide this bug anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165916173
  
    **[Test build #48036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48036/consoleFull)** for PR 10278 at commit [`64cd5e6`](https://github.com/apache/spark/commit/64cd5e66aba9d5fa05a4cfe9543eabaab0c3b825).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165917270
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48034/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10278#discussion_r47950784
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -256,6 +256,9 @@ private[sql] object ParquetFilters {
           case sources.GreaterThanOrEqual(name, value) =>
             makeGtEq.lift(dataTypeOf(name)).map(_(name, value))
     
    +      case sources.In(name, valueSet) =>
    +        makeInSet.lift(dataTypeOf(name)).map(_(name, valueSet.toSet))
    +
           case sources.And(lhs, rhs) =>
             (createFilter(schema, lhs) ++ createFilter(schema, rhs)).reduceOption(FilterApi.and)
    --- End diff --
    
    @gatorsmile I am also going to try to have a fix. We can later see which one is a more suitable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165936833
  
    **[Test build #48045 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48045/consoleFull)** for PR 10278 at commit [`64cd5e6`](https://github.com/apache/spark/commit/64cd5e66aba9d5fa05a4cfe9543eabaab0c3b825).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11164] [SQL] Add InSet pushdown filter ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-166813408
  
    Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-165936255
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48036/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164204488
  
    Great! : )
    
    Let me also post the test case I did in the latest 1.5. Without my fix, the first call of show() did not return the row (2, 0). 
    
    ```scala
        withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
          withTempPath { dir =>
            val path = s"${dir.getCanonicalPath}/table1"
            (1 to 5).map(i => (i, (i%2).toString)).toDF("a", "b").write.parquet(path)
    
            val df = sqlContext.read.parquet(path).where("not (a = 2 and b in ('1'))")
            df.show()
    
            val df1 = sqlContext.read.parquet(path).where("not (a = 2) or not(b in ('1'))")
            df1.show()
          }
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164203719
  
    Its fine if the test only fails on 1.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11164] [SQL] Add InSet pushdown filter ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-166788867
  
    @gatorsmile Could you please update the PR description?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11164] [SQL] Add InSet pushdown filter ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-166737113
  
    @yhuai Do you think it can be merged? 
    
    Please give the credit to @viirya who submitted it in the past. Thank you! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SPARK-11164] [SQL] Fixed the Pa...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10278#discussion_r47954007
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -256,6 +256,9 @@ private[sql] object ParquetFilters {
           case sources.GreaterThanOrEqual(name, value) =>
             makeGtEq.lift(dataTypeOf(name)).map(_(name, value))
     
    +      case sources.In(name, valueSet) =>
    +        makeInSet.lift(dataTypeOf(name)).map(_(name, valueSet.toSet))
    +
           case sources.And(lhs, rhs) =>
             (createFilter(schema, lhs) ++ createFilter(schema, rhs)).reduceOption(FilterApi.and)
    --- End diff --
    
    Glad to see your fix. : ) Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10278#issuecomment-164188245
  
    **[Test build #47615 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47615/consoleFull)** for PR 10278 at commit [`79be2c3`](https://github.com/apache/spark/commit/79be2c3581551ab24273f3da472269814d0d736e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org