You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2015/10/01 13:18:15 UTC

[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/8956

    [SPARK-10895][SQL] Push down string filters to Parquet

    JIRA: https://issues.apache.org/jira/browse/SPARK-10895


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 parquet-stringfilter-pushdown

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8956.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8956
    
----
commit f27288e29211d47e24767cf7731914cdf9865bc1
Author: Liang-Chi Hsieh <vi...@appier.com>
Date:   2015-10-01T11:14:14Z

    Push down string filters to Parquet.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-151361934
  
    @liancheng ok. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148885016
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/8956


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-149368027
  
    Sorry I am in travel. I will submit the test few days after.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145606518
  
      [Test build #43238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43238/console) for   PR 8956 at commit [`eb134b9`](https://github.com/apache/spark/commit/eb134b993720a42154c430e508847f852882c5c1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class StringFilter(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145571273
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145572774
  
      [Test build #43238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43238/consoleFull) for   PR 8956 at commit [`eb134b9`](https://github.com/apache/spark/commit/eb134b993720a42154c430e508847f852882c5c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145606676
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42176719
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -48,6 +48,34 @@ private[sql] object ParquetFilters {
         override def inverseCanDrop(statistics: Statistics[T]): Boolean = false
       }
     
    +  object StringFilter extends Enumeration {
    +    type Mode = Value
    +    val STARTS_WITH, ENDS_WITH, CONTAINS = Value
    +  }
    +
    +  case class StringFilter(
    +    v: java.lang.String,
    +    mode: StringFilter.Mode) extends UserDefinedPredicate[Binary] with Serializable {
    +
    +    private val compare = mode match {
    +      case StringFilter.STARTS_WITH =>
    +        (x: java.lang.String) => x.startsWith(v)
    +      case StringFilter.ENDS_WITH =>
    +        (x: java.lang.String) => x.endsWith(v)
    +      case StringFilter.CONTAINS =>
    +        (x: java.lang.String) => x.contains(v)
    +    }
    +
    +    override def keep(value: Binary): Boolean = {
    +      val str = value.toStringUsingUTF8()
    +      compare(str)
    +    }
    --- End diff --
    
    Instead of using a enum to dispatch the comparator function, I'd prefer to use simple inheritance here, namely:
    
    ```scala
    abstract class StringFilter extends UserDefinedPredicate[Binary] {
      override def canDrop(statistics: Statistics[Binary]): Boolean = false 
      override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false 
    }
    
    case class StringStartsWithFilter(prefix: String) extends StringFilter {
      override def keep(value: Binary): Boolean = value.toStringUsingUTF8.startsWith(value)
    }
    
    case class StringEndsWithFilter(prefix: String) extends StringFilter {
      override def keep(value: Binary): Boolean = value.toStringUsingUTF8.endsWith(value)
    }
    
    case class StringContainsFilter(prefix: String) extends StringFilter {
      override def keep(value: Binary): Boolean = value.toStringUsingUTF8.contains(value)
    }
    ```
    
    The reasons are:
    
    1.  `keep` is on a performance critical code path, an extra level indirection hurts performance.
    1.  `StringStartsWithFilter.canDrop` can leverage statistics information:
    
        ```scala
        override def canDrop(statistics: Statistics[Binary]): Boolean = {
          val max = statistics.getMax.toStringUsingUTF8
          val min = statistics.getMin.toStringUsingUTF8
          max < prefix || !min.startsWith(prefix) && min > prefix
        }
        ```
    
        It would be easier to override `canDrop` in the above version.
    
    *However*, we should NOT add the above `canDrop` for now, because parquet-mr 1.7.0 suffers from [PARQUET-251] [1], which may cause wrong query results because of corrupted binary statistics information.  We may add it after upgrading parquet-mr to 1.8+ (hopefully 1.8.2 since we observed slight performance regression in 1.8.1).
    
    [1]: https://issues.apache.org/jira/browse/PARQUET-251



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148885010
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42176705
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -48,6 +48,34 @@ private[sql] object ParquetFilters {
         override def inverseCanDrop(statistics: Statistics[T]): Boolean = false
       }
     
    +  object StringFilter extends Enumeration {
    +    type Mode = Value
    +    val STARTS_WITH, ENDS_WITH, CONTAINS = Value
    +  }
    +
    +  case class StringFilter(
    +    v: java.lang.String,
    --- End diff --
    
    Scala `String` is just a type alias of `java.lang.String`, so we don't need to be explicit here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145254209
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148533198
  
    Off topic but related, `SetInFilter.canDrop` can also leverage statistics information:
    
    ```scala
        private val min = valueSet.min
        private val max = valueSet.max
    
        override def canDrop(statistics: Statistics[T]): Boolean = {
          statistics.getMax.compareTo(min) < 0 || max.compareTo(statistics.getMin) < 0
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148891128
  
      [Test build #43880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43880/console) for   PR 8956 at commit [`02bbab8`](https://github.com/apache/spark/commit/02bbab84a82d645250b2afe9c58bed194523da38).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])`
      * `  abstract class StringFilter extends UserDefinedPredicate[Binary] `
      * `  case class StringStartsWithFilter(prefix: String) extends StringFilter `
      * `  case class StringEndsWithFilter(suffix: String) extends StringFilter `
      * `  case class StringContainsFilter(str: String) extends StringFilter `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-151353688
  
    @viirya I think we can add `StringStartsWithFilter` later after #9225 is merged. Also we are considering removing the defensive filtering. But yeah, for now let's keep this one closed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148891462
  
    Are there any performance improvements by pushing this down?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-150904443
  
    If we don't observe performance improvements, it's definitely not worth it. Can you post your how you measured it, and performance results? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148891749
  
    Thanks - that'd be great.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148891670
  
    I can run some performance tests later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145213256
  
    cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-144701972
  
      [Test build #43149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43149/consoleFull) for   PR 8956 at commit [`f27288e`](https://github.com/apache/spark/commit/f27288e29211d47e24767cf7731914cdf9865bc1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-144700717
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42183960
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -48,6 +48,34 @@ private[sql] object ParquetFilters {
         override def inverseCanDrop(statistics: Statistics[T]): Boolean = false
       }
     
    +  object StringFilter extends Enumeration {
    +    type Mode = Value
    +    val STARTS_WITH, ENDS_WITH, CONTAINS = Value
    +  }
    +
    +  case class StringFilter(
    +    v: java.lang.String,
    +    mode: StringFilter.Mode) extends UserDefinedPredicate[Binary] with Serializable {
    +
    +    private val compare = mode match {
    +      case StringFilter.STARTS_WITH =>
    +        (x: java.lang.String) => x.startsWith(v)
    +      case StringFilter.ENDS_WITH =>
    +        (x: java.lang.String) => x.endsWith(v)
    +      case StringFilter.CONTAINS =>
    +        (x: java.lang.String) => x.contains(v)
    +    }
    +
    +    override def keep(value: Binary): Boolean = {
    +      val str = value.toStringUsingUTF8()
    +      compare(str)
    +    }
    --- End diff --
    
    Another thing that concerns me is performance. The `keep` method is called over *every* single string value, while `Binary.toStringUsingUTF8` can be pretty slow. I think we can apply the same trick used in `CatalystStringConverter` and use a mutable `UTF8String` here to do all the string operations. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-144700733
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148884863
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43878/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-150905344
  
    Hmm, I am not sure about that. Because I supposed that Parquet relation will read all data first if no pushdown filters are applied. Then Spark's `Filter` operation will be applied later. Maybe @liancheng can answer this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148883918
  
    @liancheng Thank you for your detailed comments. I've updated this patch. When the tests are passed, please review it again to see if there is any problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-144726522
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43149/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42305463
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -112,6 +112,11 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           checkFilterPredicate('_1 === true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 <=> true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 !== true, classOf[NotEq[_]], false)
    +
    +      checkFilterPredicate(('_1 in(true)).asInstanceOf[Predicate],
    --- End diff --
    
    It is reasonable. I will try to make them as `Predicate` in another PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148891157
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-149368515
  
    @viirya do you mind closing this and reopening it when it's ready?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148885239
  
      [Test build #43880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43880/consoleFull) for   PR 8956 at commit [`02bbab8`](https://github.com/apache/spark/commit/02bbab84a82d645250b2afe9c58bed194523da38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145254205
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-144726375
  
      [Test build #43149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43149/console) for   PR 8956 at commit [`f27288e`](https://github.com/apache/spark/commit/f27288e29211d47e24767cf7731914cdf9865bc1).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class StringFilter(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145264246
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145264590
  
      [Test build #43215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43215/consoleFull) for   PR 8956 at commit [`4d00ed0`](https://github.com/apache/spark/commit/4d00ed08ab1019d69948685e5c099fa5eaa2156a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145255609
  
      [Test build #43213 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43213/console) for   PR 8956 at commit [`4d00ed0`](https://github.com/apache/spark/commit/4d00ed08ab1019d69948685e5c099fa5eaa2156a).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class StringFilter(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-151134514
  
    Thank you @liancheng for clear explanation!
    
    So looks like the only benefit of this patch is the reduced memory footprint under certain cases. If you all think it is not worth merging this, we should keep it closed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148884856
  
      [Test build #43878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43878/console) for   PR 8956 at commit [`02bbab8`](https://github.com/apache/spark/commit/02bbab84a82d645250b2afe9c58bed194523da38).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])`
      * `  abstract class StringFilter extends UserDefinedPredicate[Binary] `
      * `  case class StringStartsWithFilter(prefix: String) extends StringFilter `
      * `  case class StringEndsWithFilter(suffix: String) extends StringFilter `
      * `  case class StringContainsFilter(str: String) extends StringFilter `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42174368
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -48,6 +48,34 @@ private[sql] object ParquetFilters {
         override def inverseCanDrop(statistics: Statistics[T]): Boolean = false
       }
     
    +  object StringFilter extends Enumeration {
    +    type Mode = Value
    +    val STARTS_WITH, ENDS_WITH, CONTAINS = Value
    +  }
    +
    +  case class StringFilter(
    +    v: java.lang.String,
    +    mode: StringFilter.Mode) extends UserDefinedPredicate[Binary] with Serializable {
    +
    +    private val compare = mode match {
    +      case StringFilter.STARTS_WITH =>
    +        (x: java.lang.String) => x.startsWith(v)
    +      case StringFilter.ENDS_WITH =>
    +        (x: java.lang.String) => x.endsWith(v)
    +      case StringFilter.CONTAINS =>
    +        (x: java.lang.String) => x.contains(v)
    +    }
    +
    +    override def keep(value: Binary): Boolean = {
    +      val str = value.toStringUsingUTF8()
    --- End diff --
    
    Nit: remove `()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145254455
  
      [Test build #43213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43213/consoleFull) for   PR 8956 at commit [`4d00ed0`](https://github.com/apache/spark/commit/4d00ed08ab1019d69948685e5c099fa5eaa2156a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145571210
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-149410904
  
    Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42177620
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -112,6 +112,11 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           checkFilterPredicate('_1 === true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 <=> true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 !== true, classOf[NotEq[_]], false)
    +
    +      checkFilterPredicate(('_1 in(true)).asInstanceOf[Predicate],
    --- End diff --
    
    Another nit, usually we prefer the following styles if we have to wrap method arguments:
    
    ```scala
    checkFilterPredicate(
      '_1.in(true).asInstanceOf[Predicate],
      classOf[UserDefinedByInstance[_, _]],
      true)
    ```
    
    or
    
    ```scala
    checkFilterPredicate(
      '_1.in(true).asInstanceOf[Predicate], classOf[UserDefinedByInstance[_, _]], true)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-150905206
  
    Because we can pre-filtering the data? Without pushdown, the whole data will be loaded into memory and then has been filtered later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-150904969
  
    ok. Thanks. Because we found that with pushdown filters, we can avoid the OOM problem when processing large data in our daily usage. I am wondering if it is helpful to others too.
    
    I will post the the performance test later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148884922
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145278786
  
      [Test build #43215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43215/console) for   PR 8956 at commit [`4d00ed0`](https://github.com/apache/spark/commit/4d00ed08ab1019d69948685e5c099fa5eaa2156a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class StringFilter(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42304900
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -216,6 +255,9 @@ private[sql] object ParquetFilters {
           case sources.IsNotNull(name) =>
             makeNotEq.lift(dataTypeOf(name)).map(_(name, null))
     
    +      case sources.In(name, values) =>
    +        makeInSet.lift(dataTypeOf(name)).map(_(name, values.toSet))
    +
    --- End diff --
    
    OK. I will open another ticket.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148891158
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43880/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145264250
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148884244
  
      [Test build #43878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43878/consoleFull) for   PR 8956 at commit [`02bbab8`](https://github.com/apache/spark/commit/02bbab84a82d645250b2afe9c58bed194523da38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-150900439
  
    @rxin I am curious that although I don't observe significant performance improvement from a simple projection + filter operation by now with simple experiment, by making this filters pushed down to Parquet side, do we retrieve less data and reduce the memory footprint? If so, even under the same performance level, is this patch still worth merging?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-144726518
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-151039952
  
    Well, it depends. The situation is a little bit tricky to explain. In general there are two cases:
    
    1.  String filters with high selectivity, namely most records can be dropped
    
        - Performance
    
          Usually, I'd expect there's no noticeable performance gain, because each record is checked against the filter pushed down, and string operations themselves are CPU bound. So the performance should be similar to the case no filter is pushed down at all.
    
          However, a properly implemented `StringStartsWithFilter.canDrop` (as I mentioned in [this comment][1]) can bring big performance win since it can drop entire row groups whenever possible. But this requires us to bump parquet-mr to 1.8+ first, which is done in #9225.
    
        - Memory footprint
    
          What @viirya observed is reasonable.  One benefit of Parquet filters is that, they are performed before record assembling, namely we can drop a record before converting the underlying Parquet column values into an `InternalRow`.  I think that's the reason why @viirya observed that OOM was gone.
    
          (BTW, `ParquetRelation` processes all the data using iterators, so we don't read all the data first and then apply the filters. My theory is that it's the `InternalRow` materialization costs more memory.)
    
    1.  String filters with low selectivity, namely most records can NOT be dropped
    
        - Performance
    
          In this case, I'd expect performance regression. This is because currently Spark SQL tends to be pessimistic, and always applies all the filters again even if some of them are pushed down. In this case, almost all records are filtered twice. Since string operations are CPU bound, this can be time consuming.
    
        - Memory footprint
    
          Since the string filters in this PR "steal" the underlying byte arrays without copying them, I'd expect the memory footprint is similar to the normal case.
    
    [1]: https://github.com/apache/spark/pull/8956#discussion_r42176719



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145606678
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43238/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-150905113
  
    How does pushdown avoid OOM?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145255617
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148884862
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42176731
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -164,7 +192,18 @@ private[sql] object ParquetFilters {
             FilterApi.gtEq(binaryColumn(n), Binary.fromByteArray(v.asInstanceOf[Array[Byte]]))
       }
     
    +  private val makeStringFilter: PartialFunction[DataType,
    +      (String, String, StringFilter.Mode) => FilterPredicate] = {
    --- End diff --
    
    For types as complex as this partial function one, you may want to define a type alias for better readability. But as mentioned in my comment above, we probably need specialized filter makers for each string UDP here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145263979
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145278883
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43215/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145278882
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42177215
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -112,6 +112,11 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           checkFilterPredicate('_1 === true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 <=> true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 !== true, classOf[NotEq[_]], false)
    +
    +      checkFilterPredicate(('_1 in(true)).asInstanceOf[Predicate],
    --- End diff --
    
    BTW, I noticed that `In`, `StartsWith`, `EndsWith`, `Contains`, `RLike` and `Like` are all `Expression` but not `Predicate`. I don't see a good reason for this, and we can probably make all of them `Predicate`, but I may miss something here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42176707
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -48,6 +48,34 @@ private[sql] object ParquetFilters {
         override def inverseCanDrop(statistics: Statistics[T]): Boolean = false
       }
     
    +  object StringFilter extends Enumeration {
    +    type Mode = Value
    +    val STARTS_WITH, ENDS_WITH, CONTAINS = Value
    +  }
    +
    +  case class StringFilter(
    +    v: java.lang.String,
    +    mode: StringFilter.Mode) extends UserDefinedPredicate[Binary] with Serializable {
    --- End diff --
    
    We don't need to extend from `Serializable` explicitly since case classes are inherently serializable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-145255619
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43213/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148883954
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42176890
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -112,6 +112,11 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           checkFilterPredicate('_1 === true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 <=> true, classOf[Eq[_]], true)
           checkFilterPredicate('_1 !== true, classOf[NotEq[_]], false)
    +
    +      checkFilterPredicate(('_1 in(true)).asInstanceOf[Predicate],
    --- End diff --
    
    Nit: `('_1 in(true))` looks a little bit weird, maybe `'_1.in(true)`? This also applies to all the `in` predicates below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SQL] Push down string filters to...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8956#discussion_r42176877
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -216,6 +255,9 @@ private[sql] object ParquetFilters {
           case sources.IsNotNull(name) =>
             makeNotEq.lift(dataTypeOf(name)).map(_(name, null))
     
    +      case sources.In(name, values) =>
    +        makeInSet.lift(dataTypeOf(name)).map(_(name, values.toSet))
    +
    --- End diff --
    
    Thanks for adding this back :) I think this is worth a separate JIRA ticket to track. Would you mind to open one and add the ID to the PR title?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-148883952
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10895][SPARK-11164][SQL] Push down InSe...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-150905272
  
    Is that the case? I thought we load them one by one (or small batch at a time) and then apply the filter directly on them?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org