You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/11/11 04:04:23 UTC
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
GitHub user wangyum opened a pull request:
https://github.com/apache/spark/pull/23004
[SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down
## What changes were proposed in this pull request?
[SPARK-24638](https://issues.apache.org/jira/browse/SPARK-24638) adds support for Parquet file `StartsWith` predicate push down.
`InMemoryTable` can also support this feature.
## How was this patch tested?
unit tests and benchmark tests
benchmark test result:
```
================================================================================================
Pushdown benchmark for StringStartsWith
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '10%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized 12068 / 14198 1.3 767.3 1.0X
InMemoryTable Vectorized (Pushdown) 5457 / 8662 2.9 347.0 2.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized 5246 / 5355 3.0 333.5 1.0X
InMemoryTable Vectorized (Pushdown) 2185 / 2346 7.2 138.9 2.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized 5112 / 5312 3.1 325.0 1.0X
InMemoryTable Vectorized (Pushdown) 2292 / 2522 6.9 145.7 2.2X
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangyum/spark SPARK-26004
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23004.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23004
----
commit 7bbdb0713056f387e49cf3921a226554e9af5557
Author: Yuming Wang <yu...@...>
Date: 2018-11-11T03:56:36Z
InMemoryTable support StartsWith predicate push down
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5168/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/23004#discussion_r233033597
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+ case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
+ statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
+ l <= statsFor(a).upperBound.substr(0, Length(l))
+ case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
--- End diff --
same question
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/23004#discussion_r232945864
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+ case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
--- End diff --
can you add some comment to explain it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:
https://github.com/apache/spark/pull/23004#discussion_r233272718
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+ case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
+ statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
+ l <= statsFor(a).upperBound.substr(0, Length(l))
+ case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
--- End diff --
Good question, The last one should be removed, `DataSourceStrategy` has the same logic: https://github.com/apache/spark/blob/3d6b68b030ee85a0f639dd8e9b68aedf5f27b46f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L512-L513
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:
https://github.com/apache/spark/pull/23004#discussion_r234857747
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+ case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
--- End diff --
Added to pr description.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23004
**[Test build #98689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98689/testReport)** for PR 23004 at commit [`7bbdb07`](https://github.com/apache/spark/commit/7bbdb0713056f387e49cf3921a226554e9af5557).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99040/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23004
**[Test build #99040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99040/testReport)** for PR 23004 at commit [`0748deb`](https://github.com/apache/spark/commit/0748deb01a2f2ee5d17a814983b174eeebf3f6c1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/23004#discussion_r233012392
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
if list.forall(ExtractableLiteral.unapply(_).isDefined) && list.nonEmpty =>
list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+ case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
+ statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
+ l <= statsFor(a).upperBound.substr(0, Length(l))
+ case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
--- End diff --
BTW, a.startswith(b) and b.startswith(a) are not same but why are they same here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23004
**[Test build #98689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98689/testReport)** for PR 23004 at commit [`7bbdb07`](https://github.com/apache/spark/commit/7bbdb0713056f387e49cf3921a226554e9af5557).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23004
Looks fine to me
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98689/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23004
**[Test build #99040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99040/testReport)** for PR 23004 at commit [`0748deb`](https://github.com/apache/spark/commit/0748deb01a2f2ee5d17a814983b174eeebf3f6c1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/23004
cc @cloud-fan @HyukjinKwon @kiszk
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4921/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23004
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org