You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/06/23 13:45:15 UTC

[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/21623

    [SPARK-24638][SQL] StringStartsWith support push down

    ## What changes were proposed in this pull request?
    
    `StringStartsWith` support push down. About 50% savings in compute time.
    
    ## How was this patch tested?
    unit tests and manual tests.
    Performance test:
    ```scala
    cat <<EOF > SPARK-24638.scala
    spark.range(10000000).selectExpr("concat(id, 'str', id) as id").coalesce(1).write.option("parquet.block.size", 1048576).parquet("/tmp/spark/parquet/string")
    val df = spark.read.parquet("/tmp/spark/parquet/string/")
    spark.sql("set spark.sql.parquet.filterPushdown=true")
    val pushdownEnableStart = System.currentTimeMillis()
    for(i <- 0 until 100) {
      df.where("id like '999998%'").count()
    }
    val pushdownEnable = System.currentTimeMillis() - pushdownEnableStart
    
    spark.sql("set spark.sql.parquet.filterPushdown=false")
    val pushdownDisableStart = System.currentTimeMillis()
    for(i <- 0 until 100) {
      df.where("id like '999998%'").count()
    }
    val pushdownDisable = System.currentTimeMillis() - pushdownDisableStart
    
    val improvements = pushdownDisable.toDouble - pushdownEnable.toDouble
    
    println(s"improvements: ${improvements}")
    
    EOF
    
    bin/spark-shell -i SPARK-24638.scala
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-24638

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21623.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21623
    
----
commit 5b52ace44c8a41631535c883b7a5c8545959e5e5
Author: Yuming Wang <yu...@...>
Date:   2018-06-23T13:27:30Z

    StringStartsWith support push down

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199043210
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -22,16 +22,23 @@ import java.sql.Date
     import org.apache.parquet.filter2.predicate._
     import org.apache.parquet.filter2.predicate.FilterApi._
     import org.apache.parquet.io.api.Binary
    +import org.apache.parquet.schema.PrimitiveComparator
     
     import org.apache.spark.sql.catalyst.util.DateTimeUtils
     import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.sources
     import org.apache.spark.sql.types._
    +import org.apache.spark.unsafe.types.UTF8String
     
     /**
      * Some utility function to convert Spark data source filters to Parquet filters.
      */
    -private[parquet] class ParquetFilters(pushDownDate: Boolean) {
    +private[parquet] class ParquetFilters() {
    +
    +  val sqlConf: SQLConf = SQLConf.get
    --- End diff --
    
    You are right. I hit a bug here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92474/testReport)** for PR 21623 at commit [`800fde7`](https://github.com/apache/spark/commit/800fde7b074c3d6bffeef8feebd344c0c26b3168).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by attilapiros <gi...@git.apache.org>.
Github user attilapiros commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198245076
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if pushDownStartWith && canMakeFilterOn(name) =>
    +        Option(prefix).map { v =>
    +          FilterApi.userDefined(binaryColumn(name),
    +            new UserDefinedPredicate[Binary] with Serializable {
    +              private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
    +              private val size = strToBinary.length
    +
    +              override def canDrop(statistics: Statistics[Binary]): Boolean = {
    +                val comparator = PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) < 0 ||
    +                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
    +              }
    +
    +              override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false
    --- End diff --
    
    There is one rare case when you can drop it with inverse when both min and max starts the perfix. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199078056
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -378,6 +378,14 @@ object SQLConf {
         .booleanConf
         .createWithDefault(true)
     
    +  val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
    +    buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
    +    .doc("If true, enables Parquet filter push-down optimization for string starts with. " +
    --- End diff --
    
    `for string startsWith function`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r197992151
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -660,6 +660,30 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           assert(df.where("col > 0").count() === 2)
         }
       }
    +
    +  test("filter pushdown - StringStartsWith") {
    +    withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
    +      Seq("2", "2s", "2st", "2str", "2str2").foreach { prefix =>
    +        checkFilterPredicate(
    +          '_1.startsWith(prefix).asInstanceOf[Predicate],
    +          classOf[UserDefinedByInstance[_, _]],
    +          "2str2")
    +      }
    +
    +      Seq("2S", "null", "2str22").foreach { prefix =>
    +        checkFilterPredicate(
    +          '_1.startsWith(prefix).asInstanceOf[Predicate],
    +          classOf[UserDefinedByInstance[_, _]],
    +          Seq.empty[Row])
    +      }
    +
    +      assertResult(None) {
    +        parquetFilters.createFilter(
    +          df.schema,
    +          sources.StringStartsWith("_1", null))
    --- End diff --
    
    Thanks @attilapiros , `sources.StringStartsWith("_1", null)` will not matches them, same as before.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by attilapiros <gi...@git.apache.org>.
Github user attilapiros commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198269733
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if pushDownStartWith && canMakeFilterOn(name) =>
    +        Option(prefix).map { v =>
    +          FilterApi.userDefined(binaryColumn(name),
    +            new UserDefinedPredicate[Binary] with Serializable {
    +              private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
    +              private val size = strToBinary.length
    +
    +              override def canDrop(statistics: Statistics[Binary]): Boolean = {
    +                val comparator = PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) < 0 ||
    +                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
    +              }
    +
    +              override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false
    --- End diff --
    
    @rdblue oh, sorry I have not seen your reply. Yes, in that case we can and you are right it is worth to do.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92454/testReport)** for PR 21623 at commit [`536610e`](https://github.com/apache/spark/commit/536610e2de3addfbb570b1f17e2f854febc70080).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92454/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/483/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/566/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92258 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92258/testReport)** for PR 21623 at commit [`5b52ace`](https://github.com/apache/spark/commit/5b52ace44c8a41631535c883b7a5c8545959e5e5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21623


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92362 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92362/testReport)** for PR 21623 at commit [`4f25a33`](https://github.com/apache/spark/commit/4f25a331836d0c093b96a791a2c7a860d6f78a21).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198551889
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           assert(df.where("col > 0").count() === 2)
         }
       }
    +
    +  test("filter pushdown - StringStartsWith") {
    +    withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
    --- End diff --
    
    I think that all of these tests go through the `keep` method instead of the `canDrop` and `inverseCanDrop`. I think those methods need to be tested. You can do that by constructing a Parquet file with row groups that have predictable statistics, but that would be difficult. An easier way to do this is to define the predicate class elsewhere and create a unit test for it that passes in different statistics values.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by attilapiros <gi...@git.apache.org>.
Github user attilapiros commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198241792
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if pushDownStartWith && canMakeFilterOn(name) =>
    +        Option(prefix).map { v =>
    +          FilterApi.userDefined(binaryColumn(name),
    +            new UserDefinedPredicate[Binary] with Serializable {
    +              private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
    +              private val size = strToBinary.length
    +
    +              override def canDrop(statistics: Statistics[Binary]): Boolean = {
    +                val comparator = PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) < 0 ||
    +                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
    +              }
    +
    +              override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false
    --- End diff --
    
    No. 
    
    Let me illustrate this with an example: let's assume min="BBB", max="DDD" canDrop() means if your prefix is before "BBB" (like "A") we can stop as there is no reason to search within this range. This is also true for prefixes after "DDD" (like "E").
    
    Now if your operator is negated. What can you say when your prefix is "C" and the range is "BBB" and "DDD"? Can you drop it? No. And if the prefix is "A" or "E". Still not. You see you should check the range.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by stanzhai <gi...@git.apache.org>.
Github user stanzhai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199062132
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -378,6 +378,14 @@ object SQLConf {
         .booleanConf
         .createWithDefault(true)
     
    +  val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
    +    buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
    --- End diff --
    
    It would be better if we added `.enabled` postfix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92362/testReport)** for PR 21623 at commit [`4f25a33`](https://github.com/apache/spark/commit/4f25a331836d0c093b96a791a2c7a860d6f78a21).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92449/testReport)** for PR 21623 at commit [`e959d1a`](https://github.com/apache/spark/commit/e959d1abb43a60dd8dae9ef8b716e616012b2a6c).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Benchmark result:
    ```
    ###########################[ Pushdown benchmark for StringStartsWith ]###########################
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
    Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    
    StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                          10104 / 11125          1.6         642.4       1.0X
    Parquet Vectorized (Pushdown)                 3002 / 3608          5.2         190.8       3.4X
    Native ORC Vectorized                        9589 / 10454          1.6         609.7       1.1X
    Native ORC Vectorized (Pushdown)             9798 / 10509          1.6         622.9       1.0X
    
    StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            8437 / 8563          1.9         536.4       1.0X
    Parquet Vectorized (Pushdown)                  279 /  289         56.3          17.8      30.2X
    Native ORC Vectorized                         7354 / 7568          2.1         467.5       1.1X
    Native ORC Vectorized (Pushdown)              7730 / 7972          2.0         491.4       1.1X
    
    StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Parquet Vectorized                            8290 / 8510          1.9         527.0       1.0X
    Parquet Vectorized (Pushdown)                  260 /  272         60.5          16.5      31.9X
    Native ORC Vectorized                         7361 / 7395          2.1         468.0       1.1X
    Native ORC Vectorized (Pushdown)              7694 / 7811          2.0         489.2       1.1X
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92258/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198230713
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if pushDownStartWith && canMakeFilterOn(name) =>
    +        Option(prefix).map { v =>
    +          FilterApi.userDefined(binaryColumn(name),
    +            new UserDefinedPredicate[Binary] with Serializable {
    +              private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
    +              private val size = strToBinary.length
    +
    +              override def canDrop(statistics: Statistics[Binary]): Boolean = {
    +                val comparator = PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) < 0 ||
    +                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
    +              }
    +
    +              override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false
    --- End diff --
    
    Why can't this evaluate the inverse of `StartsWith`? If the min and max values exclude the prefix, then this should be able to filter.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92342 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92342/testReport)** for PR 21623 at commit [`02f41cc`](https://github.com/apache/spark/commit/02f41cc7190ba5280e8575ec540b1445476242ef).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by attilapiros <gi...@git.apache.org>.
Github user attilapiros commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Just a question regarding  `sources.StringStartsWith("_1", null))`: if you have a nullable string column and some of the values are null will this operator (parameterized with null) matches against them? 
    
    LGTM otherwise.
    
    
     


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92463/testReport)** for PR 21623 at commit [`536610e`](https://github.com/apache/spark/commit/536610e2de3addfbb570b1f17e2f854febc70080).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/498/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92463/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92463/testReport)** for PR 21623 at commit [`536610e`](https://github.com/apache/spark/commit/536610e2de3addfbb570b1f17e2f854febc70080).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92342/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199081727
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -660,6 +688,62 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           assert(df.where("col > 0").count() === 2)
         }
       }
    +
    +  test("filter pushdown - StringStartsWith") {
    +    withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
    +      // Test canDrop()
    --- End diff --
    
    to confirm, do they test `canDrop` or `keep`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/586/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92454/testReport)** for PR 21623 at commit [`536610e`](https://github.com/apache/spark/commit/536610e2de3addfbb570b1f17e2f854febc70080).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r197988392
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -270,6 +272,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if canMakeFilterOn(name) =>
    --- End diff --
    
    What do you think about adding a configuration to control this and set it true by default? It's basically dependent on an user defined predicate we manually wrote here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199078944
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -270,6 +273,36 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if pushDownStartWith && canMakeFilterOn(name) =>
    +        Option(prefix).map { v =>
    +          FilterApi.userDefined(binaryColumn(name),
    +            new UserDefinedPredicate[Binary] with Serializable {
    +              private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
    +              private val size = strToBinary.length
    +
    +              override def canDrop(statistics: Statistics[Binary]): Boolean = {
    +                val comparator = PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) < 0 ||
    +                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
    +              }
    +
    +              override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = {
    +                val comparator = PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) == 0 &&
    +                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) == 0
    +              }
    +
    +              override def keep(value: Binary): Boolean = {
    +                UTF8String.fromBytes(value.getBytes).startsWith(UTF8String.fromString(v))
    --- End diff --
    
    `UTF8String.fromString(v)` -> `UTF8String.fromBytes(strToBinary.getBytes)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92449/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    @cloud-fan . AFAIK, ORC doesn't support custom filter yet. I'll follow up that stuff in ORC.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92362/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    cc @gszadovszky @nandorKollar


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198553569
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -22,16 +22,23 @@ import java.sql.Date
     import org.apache.parquet.filter2.predicate._
     import org.apache.parquet.filter2.predicate.FilterApi._
     import org.apache.parquet.io.api.Binary
    +import org.apache.parquet.schema.PrimitiveComparator
     
     import org.apache.spark.sql.catalyst.util.DateTimeUtils
     import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
    +import org.apache.spark.sql.internal.SQLConf
     import org.apache.spark.sql.sources
     import org.apache.spark.sql.types._
    +import org.apache.spark.unsafe.types.UTF8String
     
     /**
      * Some utility function to convert Spark data source filters to Parquet filters.
      */
    -private[parquet] class ParquetFilters(pushDownDate: Boolean) {
    +private[parquet] class ParquetFilters() {
    +
    +  val sqlConf: SQLConf = SQLConf.get
    --- End diff --
    
    This should pass in `pushDownDate` and `pushDownStartWith` like the previous version did with just the date setting.
    
    The SQLConf is already available in ParquetFileFormat and it *would* be better to pass it in. The problem is that this class is instantiated in the function (`(file: PartitionedFile) => { ... }`) that gets serialized and sent to executors. That means we don't want SQLConf and its references in the function's closure. The way we got around this before was to put boolean config vals in the closure instead. I think you should go with that approach.
    
    I'm not sure what `SQLConf.get` is for or what a correct use would be. @gatorsmile, can you comment on use of `SQLConf.get`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92342 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92342/testReport)** for PR 21623 at commit [`02f41cc`](https://github.com/apache/spark/commit/02f41cc7190ba5280e8575ec540b1445476242ef).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198244664
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala ---
    @@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if pushDownStartWith && canMakeFilterOn(name) =>
    +        Option(prefix).map { v =>
    +          FilterApi.userDefined(binaryColumn(name),
    +            new UserDefinedPredicate[Binary] with Serializable {
    +              private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
    +              private val size = strToBinary.length
    +
    +              override def canDrop(statistics: Statistics[Binary]): Boolean = {
    +                val comparator = PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) < 0 ||
    +                  comparator.compare(min.slice(0, math.min(size, min.length)), strToBinary) > 0
    +              }
    +
    +              override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = false
    --- End diff --
    
    Sorry, I meant if the min and max both *include* the prefix, then we should be able to drop the range. The situation is where both min and max match, so all values must also match the filter. If we are looking for values that do not match the filter, then we can eliminate the row group.
    
    The example is prefix=CCC and values are between min=CCCa and max=CCCZ: all values start with CCC, so the entire row group can be skipped.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    thanks, merging to master!
    
    @dongjoon-hyun is there something similar in ORC?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92449/testReport)** for PR 21623 at commit [`e959d1a`](https://github.com/apache/spark/commit/e959d1abb43a60dd8dae9ef8b716e616012b2a6c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92258 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92258/testReport)** for PR 21623 at commit [`5b52ace`](https://github.com/apache/spark/commit/5b52ace44c8a41631535c883b7a5c8545959e5e5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199043411
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           assert(df.where("col > 0").count() === 2)
         }
       }
    +
    +  test("filter pushdown - StringStartsWith") {
    +    withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
    --- End diff --
    
    Added `testStringStartsWith` to test that exactly go through the `canDrop` and `inverseCanDrop`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/576/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199116993
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -660,6 +688,62 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
           assert(df.where("col > 0").count() === 2)
         }
       }
    +
    +  test("filter pushdown - StringStartsWith") {
    +    withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df =>
    +      // Test canDrop()
    --- End diff --
    
    Both methods have been executed but it can't be confirmed which method  has taken effect.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    cc @rdblue 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    **[Test build #92474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92474/testReport)** for PR 21623 at commit [`800fde7`](https://github.com/apache/spark/commit/800fde7b074c3d6bffeef8feebd344c0c26b3168).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r199080962
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala ---
    @@ -140,6 +142,32 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex
         checkBinaryFilterPredicate(predicate, filterClass, Seq(Row(expected)))(df)
       }
     
    +  // This function tests that exactly go through the `canDrop` and `inverseCanDrop`.
    +  private def testStringStartsWith(dataFrame: DataFrame, filter: String): Unit = {
    +    withTempPath { dir =>
    +      val path = dir.getCanonicalPath
    +      dataFrame.write.option("parquet.block.size", 512).parquet(path)
    +      Seq(true, false).foreach { pushDown =>
    +        withSQLConf(
    +          SQLConf.PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED.key -> pushDown.toString) {
    +          val accu = new NumRowGroupsAcc
    +          sparkContext.register(accu)
    +
    +          val df = spark.read.parquet(path).filter(filter)
    +          df.foreachPartition((it: Iterator[Row]) => it.foreach(v => accu.add(0)))
    +          df.collect
    --- End diff --
    
    what does this `collect` do? `foreachPartition` is already an action


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/438/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/561/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Jenkins, retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by rdblue <gi...@git.apache.org>.
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Overall, I think this is close. The tests need to cover the row group stats case and we should update how configuration is passed to the filters. Thanks for working on this, @wangyum!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21623
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92474/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org