You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/11/11 04:04:23 UTC

[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/23004

    [SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down

    ## What changes were proposed in this pull request?
    
    [SPARK-24638](https://issues.apache.org/jira/browse/SPARK-24638) adds support for Parquet file `StartsWith` predicate push down.
    `InMemoryTable` can also support this feature.
    
    
    ## How was this patch tested?
    
     unit tests and benchmark tests
    
    benchmark test result:
    ```
    ================================================================================================
    Pushdown benchmark for StringStartsWith
    ================================================================================================
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
    Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    InMemoryTable Vectorized                    12068 / 14198          1.3         767.3       1.0X
    InMemoryTable Vectorized (Pushdown)           5457 / 8662          2.9         347.0       2.2X
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
    Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    InMemoryTable Vectorized                      5246 / 5355          3.0         333.5       1.0X
    InMemoryTable Vectorized (Pushdown)           2185 / 2346          7.2         138.9       2.4X
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
    Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
    StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    InMemoryTable Vectorized                      5112 / 5312          3.1         325.0       1.0X
    InMemoryTable Vectorized (Pushdown)           2292 / 2522          6.9         145.7       2.2X
    ```
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-26004

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23004.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23004
    
----
commit 7bbdb0713056f387e49cf3921a226554e9af5557
Author: Yuming Wang <yu...@...>
Date:   2018-11-11T03:56:36Z

    InMemoryTable support StartsWith predicate push down

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5168/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23004#discussion_r233033597
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
           if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
           list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
             l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
    +
    +    case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
    +      statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
    +        l <= statsFor(a).upperBound.substr(0, Length(l))
    +    case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
    --- End diff --
    
    same question


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23004#discussion_r232945864
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
           if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
           list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
             l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
    +
    +    case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
    --- End diff --
    
    can you add some comment to explain it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23004#discussion_r233272718
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
           if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
           list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
             l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
    +
    +    case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
    +      statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
    +        l <= statsFor(a).upperBound.substr(0, Length(l))
    +    case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
    --- End diff --
    
    Good question, The last one should be removed, `DataSourceStrategy` has the same logic:  https://github.com/apache/spark/blob/3d6b68b030ee85a0f639dd8e9b68aedf5f27b46f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L512-L513


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23004#discussion_r234857747
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
           if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
           list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
             l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
    +
    +    case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
    --- End diff --
    
    Added to pr description.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    **[Test build #98689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98689/testReport)** for PR 23004 at commit [`7bbdb07`](https://github.com/apache/spark/commit/7bbdb0713056f387e49cf3921a226554e9af5557).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99040/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    **[Test build #99040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99040/testReport)** for PR 23004 at commit [`0748deb`](https://github.com/apache/spark/commit/0748deb01a2f2ee5d17a814983b174eeebf3f6c1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23004#discussion_r233012392
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
           if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
           list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
             l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
    +
    +    case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
    +      statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
    +        l <= statsFor(a).upperBound.substr(0, Length(l))
    +    case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
    --- End diff --
    
    BTW,  a.startswith(b) and b.startswith(a) are not same but why are they same here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    **[Test build #98689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98689/testReport)** for PR 23004 at commit [`7bbdb07`](https://github.com/apache/spark/commit/7bbdb0713056f387e49cf3921a226554e9af5557).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Looks fine to me


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98689/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    **[Test build #99040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99040/testReport)** for PR 23004 at commit [`0748deb`](https://github.com/apache/spark/commit/0748deb01a2f2ee5d17a814983b174eeebf3f6c1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    cc @cloud-fan @HyukjinKwon @kiszk


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4921/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23004
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org