You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/03 16:59:15 UTC
[GitHub] [spark] wangyum opened a new pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
wangyum opened a new pull request #29642:
URL: https://github.com/apache/spark/pull/29642
### What changes were proposed in this pull request?
Improve in filter pushdown for ParquetFilters:
- Remove `distinct` operation because it is expensive and duplicate values should be removed by `OptimizeIn`
```scala
import org.apache.spark.benchmark.Benchmark
val N = 5000000
val array = Range(1, N).toArray
val benchmark = new Benchmark(s"Benchmark distinct", valuesPerIteration = N, minNumIters = 30)
benchmark.addCase("array.length") { _ =>
array.length
}
benchmark.addCase("array.distinct.length") { _ =>
array.distinct.length
}
benchmark.run()
```
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.6
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark distinct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
array.length 0 0 0 48076923.1 0.0 1.0X
array.distinct.length 711 1389 498 7.0 142.2 0.0X
```
- Add an empty check because `values` may be empty.
### Why are the changes needed?
Enhance code robustness.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742603252
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37178/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735503695
**[Test build #131941 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131941/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738973065
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132235/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738736322
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36821/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737596383
**[Test build #132077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132077/testReport)** for PR 29642 at commit [`af9d7d6`](https://github.com/apache/spark/commit/af9d7d66d1a3c221163f56ba322ec277b4498fed).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738804876
retest this please.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737610313
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36676/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287
Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.
One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit. Do we need to add some specific additional benchmark case for your contribution?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312379
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739511388
**[Test build #132295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132295/testReport)** for PR 29642 at commit [`a98b354`](https://github.com/apache/spark/commit/a98b354a1ff18815cd6aa6f268e4a7959e961f26).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628965380
##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
Pushdown for many distinct value case
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 0 string row (value IS NULL): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
-Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
-Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
-Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 0 string row ('7864320' < value < '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10406 10461 50 1.5 661.6 1.0X
-Parquet Vectorized (Pushdown) 619 641 22 25.4 39.4 16.8X
-Native ORC Vectorized 8787 8834 57 1.8 558.6 1.2X
-Native ORC Vectorized (Pushdown) 592 608 11 26.6 37.6 17.6X
+Parquet Vectorized 9861 9880 16 1.6 626.9 1.0X
+Parquet Vectorized (Pushdown) 507 529 21 31.0 32.3 19.4X
+Native ORC Vectorized 6871 6938 63 2.3 436.8 1.4X
+Native ORC Vectorized (Pushdown) 453 470 13 34.7 28.8 21.8X
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 1 string row (value = '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10632 10694 60 1.5 676.0 1.0X
-Parquet Vectorized (Pushdown) 608 635 22 25.9 38.6 17.5X
-Native ORC Vectorized 8790 8838 37 1.8 558.9 1.2X
-Native ORC Vectorized (Pushdown) 559 584 22 28.1 35.5 19.0X
+Parquet Vectorized 10228 10471 167 1.5 650.3 1.0X
+Parquet Vectorized (Pushdown) 511 519 5 30.8 32.5 20.0X
+Native ORC Vectorized 6700 6865 119 2.3 426.0 1.5X
+Native ORC Vectorized (Pushdown) 436 454 12 36.1 27.7 23.5X
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 1 string row (value <=> '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10529 10624 74 1.5 669.4 1.0X
-Parquet Vectorized (Pushdown) 613 631 16 25.7 39.0 17.2X
-Native ORC Vectorized 8746 8816 63 1.8 556.1 1.2X
-Native ORC Vectorized (Pushdown) 589 600 11 26.7 37.5 17.9X
+Parquet Vectorized 10287 10449 144 1.5 654.0 1.0X
+Parquet Vectorized (Pushdown) 467 494 20 33.7 29.7 22.0X
+Native ORC Vectorized 6781 6848 58 2.3 431.1 1.5X
+Native ORC Vectorized (Pushdown) 428 440 10 36.8 27.2 24.1X
Review comment:
No. Github action runs on different machines, there is a performance difference between them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042
> No. Github action runs on different machines, there is a performance difference between them.
No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
```
- Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
- Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
- Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
- Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+ Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+ Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+ Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+ Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
```
Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was my question.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803818662
Shall we do this in a early place so that it applies to both hive partition pruning and data predicates?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809862832
**[Test build #136682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136682/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738736356
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36821/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841186474
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43078/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199
**[Test build #138582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729300576
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835877675
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138307/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729927486
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835833242
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738811606
**[Test build #132235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132235/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841160451
**[Test build #138557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138557/testReport)** for PR 29642 at commit [`27a2bf6`](https://github.com/apache/spark/commit/27a2bf615eb158c7c25aa5bfaa04caa939c237da).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834270229
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42769/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737596383
**[Test build #132077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132077/testReport)** for PR 29642 at commit [`af9d7d6`](https://github.com/apache/spark/commit/af9d7d66d1a3c221163f56ba322ec277b4498fed).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840259632
@dongjoon-hyun Do you have more comments?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199
**[Test build #138582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738973065
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132235/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835836092
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42829/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809833487
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41258/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735523298
**[Test build #131953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131953/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729275927
**[Test build #131236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131236/testReport)** for PR 29642 at commit [`ebb13cc`](https://github.com/apache/spark/commit/ebb13cceb5b6840d4c15ec488ef350c23a5daa6c).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742603252
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37178/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834234834
**[Test build #138247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138247/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841781338
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138582/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809925233
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136682/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686782831
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r483336491
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,9 +597,9 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
Review comment:
Hm, it's better to don't rely on optimizer actually. 5000000 N in the benchmark you did looks a bit unrealistic and the perf degradation seems not very serious compared to execution time.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742566759
**[Test build #132574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132574/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832409169
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809917727
**[Test build #136682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136682/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
* This patch **fails Spark unit tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742597635
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628066242
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
##########
@@ -188,6 +188,15 @@ abstract class ParquetFilterSuite extends QueryTest with ParquetTest with Shared
checkFilterPredicate(!(tsAttr < ts4.ts), classOf[GtEq[_]], resultFun(ts4))
checkFilterPredicate(tsAttr < ts2.ts || tsAttr > ts3.ts, classOf[Operators.Or],
Seq(Row(resultFun(ts1)), Row(resultFun(ts4))))
+
+ Seq(3, 20).foreach { threshold =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD.key -> s"$threshold") {
Review comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841982214
uhoh .. seems like there's a logical conflict with https://github.com/apache/spark/pull/31776:
```
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:489:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error] case ParquetSchemaType(DECIMAL, INT32, _, _) if pushDownDecimal =>
[error] ^
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:496:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error] case ParquetSchemaType(DECIMAL, INT64, _, _) if pushDownDecimal =>
[error] ^
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:503:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error] case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, length, _) if pushDownDecimal =>
[error] ^
[warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:127:39: [deprecation @ org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
[warn] && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
[warn] ^
[warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWrite.scala:89:39: [deprecation @ org.apache.spark.sql.execution.datasources.v2.parquet.ParquetWrite.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
[warn] && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729777039
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729300592
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43103/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628899597
##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
Pushdown for many distinct value case
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 0 string row (value IS NULL): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
-Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
-Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
-Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
Review comment:
Just a question. Does these results show Parquet's performance regression? This PR touches Parquet only and the ratios of Parquet and ORC were 17.6x and 17.8x respectively before. And, now it becomes 19.9x and 22.4. I expected `19.x and 19.x` or `22.x and 22.x` similarily.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729746475
**[Test build #131289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131289/testReport)** for PR 29642 at commit [`b8cb1f4`](https://github.com/apache/spark/commit/b8cb1f48f1b38d74475c067a28426502c4e4a87a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738559980
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716153233
**[Test build #130244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130244/testReport)** for PR 29642 at commit [`c5ab656`](https://github.com/apache/spark/commit/c5ab6569f4b175066613d02b787dc8aaa83ca8d9).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803808606
Is this patch still needed? IIRC we already have this in hive partition pruning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809882190
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41264/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737640996
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739535237
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132295/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742735942
**[Test build #132574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132574/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
* This patch **fails SparkR unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626143
**[Test build #128266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128266/testReport)** for PR 29642 at commit [`648e8e5`](https://github.com/apache/spark/commit/648e8e58a552ab2072123cdefbe5d106091ce293).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735553226
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738852315
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36835/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809818374
**[Test build #136676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136676/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287
Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.
One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit. Do we need to add some specific additional benchmark case for your contribution?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628899709
##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
Pushdown for many distinct value case
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 0 string row (value IS NULL): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
-Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
-Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
-Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 0 string row ('7864320' < value < '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10406 10461 50 1.5 661.6 1.0X
-Parquet Vectorized (Pushdown) 619 641 22 25.4 39.4 16.8X
-Native ORC Vectorized 8787 8834 57 1.8 558.6 1.2X
-Native ORC Vectorized (Pushdown) 592 608 11 26.6 37.6 17.6X
+Parquet Vectorized 9861 9880 16 1.6 626.9 1.0X
+Parquet Vectorized (Pushdown) 507 529 21 31.0 32.3 19.4X
+Native ORC Vectorized 6871 6938 63 2.3 436.8 1.4X
+Native ORC Vectorized (Pushdown) 453 470 13 34.7 28.8 21.8X
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 1 string row (value = '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10632 10694 60 1.5 676.0 1.0X
-Parquet Vectorized (Pushdown) 608 635 22 25.9 38.6 17.5X
-Native ORC Vectorized 8790 8838 37 1.8 558.9 1.2X
-Native ORC Vectorized (Pushdown) 559 584 22 28.1 35.5 19.0X
+Parquet Vectorized 10228 10471 167 1.5 650.3 1.0X
+Parquet Vectorized (Pushdown) 511 519 5 30.8 32.5 20.0X
+Native ORC Vectorized 6700 6865 119 2.3 426.0 1.5X
+Native ORC Vectorized (Pushdown) 436 454 12 36.1 27.7 23.5X
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 1 string row (value <=> '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10529 10624 74 1.5 669.4 1.0X
-Parquet Vectorized (Pushdown) 613 631 16 25.7 39.0 17.2X
-Native ORC Vectorized 8746 8816 63 1.8 556.1 1.2X
-Native ORC Vectorized (Pushdown) 589 600 11 26.7 37.5 17.9X
+Parquet Vectorized 10287 10449 144 1.5 654.0 1.0X
+Parquet Vectorized (Pushdown) 467 494 20 33.7 29.7 22.0X
+Native ORC Vectorized 6781 6848 58 2.3 431.1 1.5X
+Native ORC Vectorized (Pushdown) 428 440 10 36.8 27.2 24.1X
Review comment:
ditto. `17 vs 17` -> `22 vs 24`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834491086
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138247/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-731904386
@wangyum @cloud-fan @HyukjinKwon
I got some concerns about this optimization. What if the range is huge and the filter becomes less selective? E.g.
```
SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, Int.Max)
```
=>
```
SELECT * FROM t WHERE id > 1 and id < ${Int.Max}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735560173
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716246936
cc @cloud-fan @HyukjinKwon @gengliangwang
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626800
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729746475
**[Test build #131289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131289/testReport)** for PR 29642 at commit [`b8cb1f4`](https://github.com/apache/spark/commit/b8cb1f48f1b38d74475c067a28426502c4e4a87a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729300592
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809880752
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136676/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528436623
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
- makeEq.lift(nameToParquetField(name).fieldType)
- .map(_(nameToParquetField(name).fieldNames, v))
- }.reduceLeftOption(FilterApi.or)
+ case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
Review comment:
It seems only Parquet is not well supported `In` predicate pushdown.
Parquet vs ORC: https://github.com/apache/spark/blob/f5118f81e395bde0cd8253dbef6a9e6455c3958a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L439-L482
CSV:
https://github.com/apache/spark/pull/29642#issuecomment-730869008
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809836947
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41258/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735553226
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735560173
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716162655
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809882190
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41264/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738672069
**[Test build #132221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132221/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738736356
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36821/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841002294
Yea these benchmark results are not updated in time. Let's post the benchmark result before and after this PR in the PR description.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835826320
**[Test build #138307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138307/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841160451
**[Test build #138557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138557/testReport)** for PR 29642 at commit [`27a2bf6`](https://github.com/apache/spark/commit/27a2bf615eb158c7c25aa5bfaa04caa939c237da).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832411899
**[Test build #138164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138164/testReport)** for PR 29642 at commit [`f269f8d`](https://github.com/apache/spark/commit/f269f8d9d883e96182ff363276b589584a109aad).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841296880
**[Test build #138557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138557/testReport)** for PR 29642 at commit [`27a2bf6`](https://github.com/apache/spark/commit/27a2bf615eb158c7c25aa5bfaa04caa939c237da).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528452609
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
.doc("The maximum number of values to filter push-down optimization for IN predicate. " +
- "Large threshold won't necessarily provide much better performance. " +
- "The experiment argued that 300 is the limit threshold. " +
+ "Spark will push-down a value greater than or equal to its minimum value and " +
Review comment:
I think the default value `10` is too smaller here. What is the default threshold in IMPLA?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528452609
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
.doc("The maximum number of values to filter push-down optimization for IN predicate. " +
- "Large threshold won't necessarily provide much better performance. " +
- "The experiment argued that 300 is the limit threshold. " +
+ "Spark will push-down a value greater than or equal to its minimum value and " +
Review comment:
I think the default value `10` is small here. What is the default threshold in IMPLA?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809880752
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136676/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739535237
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132295/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735495202
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730869008
It seems only Parquet not well supported `In` predicate pushdown. @MaxGekk What do you think?
This is the benchmark of CSV:
```scala
val rowsNum = 100 * 1000
val numIters = 3
val colsNum = 100
val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
val schema = StructType(StructField("key", IntegerType) +: fields)
def columns(): Seq[Column] = {
val ts = Seq.tabulate(colsNum) { i =>
lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
}
($"id" % 1000).as("key") +: ts
}
withTempPath { path =>
spark.range(rowsNum).select(columns(): _*)
.write.option("header", true)
.csv(path.getAbsolutePath)
def readback = {
spark.read
.option("header", true)
.schema(schema)
.csv(path.getAbsolutePath)
}
def withFilter(filer: String, configEnabled: Boolean): Unit = {
withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> configEnabled.toString()) {
readback.filter(filer).noop()
}
}
Seq(5, 10, 50, 100, 500).foreach { count =>
Seq(10, 50).foreach { distribution =>
val title = s"InSet -> InFilters (values count: $count, distribution: $distribution)"
val benchmark = new Benchmark(title, rowsNum, output = output)
Seq(false, true).foreach { pushDownEnabled =>
val name = s"Native CSV Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
benchmark.addCase(name, numIters) { _ =>
val filter =
Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum * distribution / 100))
val whereExpr = s"key in(${filter.mkString(",")})"
withFilter(whereExpr, configEnabled = pushDownEnabled)
}
}
benchmark.run()
}
}
}
```
Result:
```
================================================================================================
Benchmark to measure CSV read performance
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 13082 17077 1674 0.0 130815.6 1.0X
Native CSV Vectorized (Pushdown) 1172 1192 35 0.1 11719.5 11.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11858 12028 237 0.0 118576.9 1.0X
Native CSV Vectorized (Pushdown) 1165 1172 6 0.1 11652.4 10.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11883 12180 494 0.0 118834.3 1.0X
Native CSV Vectorized (Pushdown) 1142 1156 21 0.1 11418.6 10.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11857 11878 19 0.0 118570.4 1.0X
Native CSV Vectorized (Pushdown) 1169 1174 7 0.1 11692.9 10.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11923 11962 66 0.0 119228.0 1.0X
Native CSV Vectorized (Pushdown) 1196 1225 26 0.1 11960.7 10.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11910 11917 7 0.0 119095.3 1.0X
Native CSV Vectorized (Pushdown) 1191 1194 5 0.1 11908.0 10.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11948 12097 201 0.0 119484.5 1.0X
Native CSV Vectorized (Pushdown) 1250 1284 32 0.1 12501.4 9.6X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11938 11978 39 0.0 119378.8 1.0X
Native CSV Vectorized (Pushdown) 1176 1188 11 0.1 11756.0 10.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11954 12051 124 0.0 119542.9 1.0X
Native CSV Vectorized (Pushdown) 1762 1833 104 0.1 17620.6 6.8X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11860 12166 484 0.0 118597.8 1.0X
Native CSV Vectorized (Pushdown) 1417 1434 15 0.1 14171.7 8.4X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287
Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.
One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737621198
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43103/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832328455
**[Test build #138164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138164/testReport)** for PR 29642 at commit [`f269f8d`](https://github.com/apache/spark/commit/f269f8d9d883e96182ff363276b589584a109aad).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841982214
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738672069
**[Test build #132221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132221/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841183742
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735483068
**[Test build #131941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131941/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735495202
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735559996
**[Test build #131953 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131953/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738779564
**[Test build #132221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132221/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735508307
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716166190
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34845/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716162655
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835877675
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138307/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312386
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131236/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737640409
**[Test build #132077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132077/testReport)** for PR 29642 at commit [`af9d7d6`](https://github.com/apache/spark/commit/af9d7d66d1a3c221163f56ba322ec277b4498fed).
* This patch **fails PySpark unit tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841986529
Oops.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312310
**[Test build #131236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131236/testReport)** for PR 29642 at commit [`ebb13cc`](https://github.com/apache/spark/commit/ebb13cceb5b6840d4c15ec488ef350c23a5daa6c).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738661573
```scala
package org.apache.spark.sql.execution.benchmark
import java.io.File
import scala.util.Random
import org.apache.spark.SparkConf
import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.monotonically_increasing_id
/**
* Benchmark to measure read performance InSet Filter pushdown.
* To run this benchmark:
* {{{
* 1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
* 2. build/sbt "sql/test:runMain <this class>"
* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
* Results will be written to "benchmarks/InSetFilterPushdownBenchmark-results.txt".
* }}}
*/
object InSetFilterPushdownBenchmark extends SqlBasedBenchmark {
override def getSparkSession: SparkSession = {
val conf = new SparkConf()
.setAppName(this.getClass.getSimpleName)
// Since `spark.master` always exists, overrides this value
.set("spark.master", "local[1]")
.setIfMissing("spark.driver.memory", "3g")
.setIfMissing("orc.compression", "snappy")
.setIfMissing("spark.sql.parquet.compression.codec", "snappy")
SparkSession.builder().config(conf).getOrCreate()
}
private val numRows = 1024 * 1024 * 15
private val width = 5
// For Parquet/ORC, we will use the same value for block size and compression size
private val blockSize = org.apache.parquet.hadoop.ParquetWriter.DEFAULT_PAGE_SIZE
def withTempTable(tableNames: String*)(f: => Unit): Unit = {
try f finally tableNames.foreach(spark.catalog.dropTempView)
}
private def prepareTable(dir: File, numRows: Int): Unit = {
import spark.implicits._
val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
.withColumn("value", monotonically_increasing_id())
.sort("value")
df.write.mode("overwrite")
.option("orc.compress.size", blockSize)
.option("orc.stripe.size", blockSize).format("orc").saveAsTable("orcTable")
df.write.mode("overwrite")
.option("parquet.block.size", blockSize).format("parquet").saveAsTable("parquetTable")
df.write.mode("overwrite").format("csv").saveAsTable("csvTable")
}
def filterPushDownBenchmark(
values: Int,
title: String,
whereExpr: String,
selectExpr: String = "*"): Unit = {
val benchmark = new Benchmark(title, values, minNumIters = 5, output = output)
Seq(Int.MaxValue, 10).foreach { pushDownEnabled =>
val name = s"Parquet ${if (pushDownEnabled == 10) s"(Rewrite InSet)" else ""}"
benchmark.addCase(name) { _ =>
withSQLConf("spark.sql.optimizer.inSetRewriteMinMaxThreshold" -> s"$pushDownEnabled") {
spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").noop()
}
}
}
Seq(Int.MaxValue, 10).foreach { pushDownEnabled =>
val name = s"ORC ${if (pushDownEnabled == 10) s"(Rewrite InSet)" else ""}"
benchmark.addCase(name) { _ =>
withSQLConf("spark.sql.optimizer.inSetRewriteMinMaxThreshold" -> s"$pushDownEnabled") {
spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").noop()
}
}
}
Seq(Int.MaxValue, 10).foreach { pushDownEnabled =>
val name = s"CSV ${if (pushDownEnabled == 10) s"(Rewrite InSet)" else ""}"
benchmark.addCase(name) { _ =>
withSQLConf("spark.sql.optimizer.inSetRewriteMinMaxThreshold" -> s"$pushDownEnabled") {
spark.sql(s"SELECT $selectExpr FROM csvTable WHERE $whereExpr").noop()
}
}
}
benchmark.run()
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
runBenchmark("Pushdown benchmark for rewrite InSet") {
withTempPath { dir =>
withTempTable("orcTable", "parquetTable") {
prepareTable(dir, numRows)
Seq(50, 1000, 5000, 20000).foreach { count =>
Seq(1, 10, 50, 90).foreach { distribution =>
val filter =
Range(0, count).map(r => scala.util.Random.nextInt(numRows * distribution / 100))
val whereExpr = s"value in(${filter.mkString(",")})"
val title = s"Rewrite InSet (values count: $count, distribution: $distribution)"
filterPushDownBenchmark(numRows, title, whereExpr)
}
}
}
}
}
}
}
```
Result:
```
================================================================================================
Pushdown benchmark for rewrite InSet
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 50, distribution: 1): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------
Parquet 8289 8371 68 1.9 527.0 1.0X
Parquet (Rewrite InSet) 598 614 14 26.3 38.0 13.9X
ORC 442 454 20 35.6 28.1 18.8X
ORC (Rewrite InSet) 411 431 20 38.2 26.1 20.2X
CSV 23399 23618 154 0.7 1487.7 0.4X
CSV (Rewrite InSet) 23437 24070 744 0.7 1490.1 0.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 50, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
Parquet 8191 8244 50 1.9 520.8 1.0X
Parquet (Rewrite InSet) 1166 1178 13 13.5 74.1 7.0X
ORC 500 521 16 31.5 31.8 16.4X
ORC (Rewrite InSet) 514 526 8 30.6 32.7 15.9X
CSV 23447 23704 316 0.7 1490.7 0.3X
CSV (Rewrite InSet) 23639 23821 153 0.7 1502.9 0.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 50, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
Parquet 8157 8233 56 1.9 518.6 1.0X
Parquet (Rewrite InSet) 4224 4257 42 3.7 268.6 1.9X
ORC 513 536 25 30.7 32.6 15.9X
ORC (Rewrite InSet) 511 530 18 30.8 32.5 16.0X
CSV 23665 24270 795 0.7 1504.6 0.3X
CSV (Rewrite InSet) 23321 23596 221 0.7 1482.7 0.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 50, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------
Parquet 8225 8335 84 1.9 522.9 1.0X
Parquet (Rewrite InSet) 7138 7218 115 2.2 453.8 1.2X
ORC 526 559 36 29.9 33.4 15.6X
ORC (Rewrite InSet) 507 538 24 31.1 32.2 16.2X
CSV 23411 23731 496 0.7 1488.4 0.4X
CSV (Rewrite InSet) 23470 23546 82 0.7 1492.2 0.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 1000, distribution: 1): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------
Parquet 8744 8845 90 1.8 555.9 1.0X
Parquet (Rewrite InSet) 650 656 4 24.2 41.3 13.5X
ORC 535 559 16 29.4 34.0 16.4X
ORC (Rewrite InSet) 532 551 16 29.5 33.9 16.4X
CSV 30467 32289 1496 0.5 1937.0 0.3X
CSV (Rewrite InSet) 23981 24614 596 0.7 1524.7 0.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 1000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------
Parquet 8383 8468 82 1.9 533.0 1.0X
Parquet (Rewrite InSet) 1351 1362 9 11.6 85.9 6.2X
ORC 1048 1069 19 15.0 66.6 8.0X
ORC (Rewrite InSet) 1052 1071 28 15.0 66.9 8.0X
CSV 30950 32767 1238 0.5 1967.7 0.3X
CSV (Rewrite InSet) 24209 24513 396 0.6 1539.2 0.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 1000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------
Parquet 8402 8481 55 1.9 534.2 1.0X
Parquet (Rewrite InSet) 4532 4677 186 3.5 288.1 1.9X
ORC 2621 2659 46 6.0 166.6 3.2X
ORC (Rewrite InSet) 2631 2738 193 6.0 167.2 3.2X
CSV 30098 30226 79 0.5 1913.6 0.3X
CSV (Rewrite InSet) 27913 28481 693 0.6 1774.7 0.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 1000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------
Parquet 8420 8468 65 1.9 535.4 1.0X
Parquet (Rewrite InSet) 7621 7781 191 2.1 484.5 1.1X
ORC 3108 3167 53 5.1 197.6 2.7X
ORC (Rewrite InSet) 3089 3175 59 5.1 196.4 2.7X
CSV 30555 32254 1187 0.5 1942.6 0.3X
CSV (Rewrite InSet) 31091 31607 480 0.5 1976.7 0.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 5000, distribution: 1): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------
Parquet 9125 9170 55 1.7 580.2 1.0X
Parquet (Rewrite InSet) 1206 1234 18 13.0 76.6 7.6X
ORC 1244 1254 7 12.6 79.1 7.3X
ORC (Rewrite InSet) 1236 1250 12 12.7 78.6 7.4X
CSV 350424 355583 1016 0.0 22279.3 0.0X
CSV (Rewrite InSet) 28577 28875 458 0.6 1816.9 0.3X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 5000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------
Parquet 9162 9408 253 1.7 582.5 1.0X
Parquet (Rewrite InSet) 1911 1930 13 8.2 121.5 4.8X
ORC 1774 1809 41 8.9 112.8 5.2X
ORC (Rewrite InSet) 1769 1785 24 8.9 112.5 5.2X
CSV 364909 368618 NaN 0.0 23200.3 0.0X
CSV (Rewrite InSet) 58985 59425 287 0.3 3750.1 0.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 5000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------
Parquet 9218 9499 173 1.7 586.1 1.0X
Parquet (Rewrite InSet) 5109 5139 32 3.1 324.8 1.8X
ORC 4089 4137 72 3.8 260.0 2.3X
ORC (Rewrite InSet) 4056 4121 93 3.9 257.9 2.3X
CSV 359994 364490 790 0.0 22887.8 0.0X
CSV (Rewrite InSet) 196472 202225 721 0.1 12491.4 0.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 5000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------
Parquet 9147 9247 64 1.7 581.6 1.0X
Parquet (Rewrite InSet) 8369 8520 179 1.9 532.1 1.1X
ORC 6267 6305 47 2.5 398.4 1.5X
ORC (Rewrite InSet) 6289 6435 199 2.5 399.8 1.5X
CSV 369254 371915 697 0.0 23476.6 0.0X
CSV (Rewrite InSet) 326837 329082 NaN 0.0 20779.7 0.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 20000, distribution: 1): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------------------
Parquet 11866 11944 105 1.3 754.4 1.0X
Parquet (Rewrite InSet) 3578 3670 81 4.4 227.5 3.3X
ORC 4119 4152 33 3.8 261.9 2.9X
ORC (Rewrite InSet) 4054 4181 84 3.9 257.7 2.9X
CSV 2319345 2350577 153 0.0 147460.0 0.0X
CSV (Rewrite InSet) 55273 56287 821 0.3 3514.2 0.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 20000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------------
Parquet 12194 12287 91 1.3 775.3 1.0X
Parquet (Rewrite InSet) 4442 4479 42 3.5 282.4 2.7X
ORC 4805 4847 53 3.3 305.5 2.5X
ORC (Rewrite InSet) 4746 4838 94 3.3 301.7 2.6X
CSV 2958262 2979920 967 0.0 188081.2 0.0X
CSV (Rewrite InSet) 322782 329114 1177 0.0 20521.9 0.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 20000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------------
Parquet 12138 12205 69 1.3 771.7 1.0X
Parquet (Rewrite InSet) 7760 7901 160 2.0 493.3 1.6X
ORC 7072 7263 148 2.2 449.6 1.7X
ORC (Rewrite InSet) 7094 7225 87 2.2 451.0 1.7X
CSV 2906664 2948342 220 0.0 184800.7 0.0X
CSV (Rewrite InSet) 1367893 1393413 1348 0.0 86968.3 0.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
Intel Core Processor (Broadwell, IBRS)
Rewrite InSet (values count: 20000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------------
Parquet 12230 12593 387 1.3 777.5 1.0X
Parquet (Rewrite InSet) 11262 11580 263 1.4 716.0 1.1X
ORC 9712 9794 75 1.6 617.5 1.3X
ORC (Rewrite InSet) 9658 9763 109 1.6 614.1 1.3X
CSV 2776344 2807999 1140 0.0 176515.2 0.0X
CSV (Rewrite InSet) 2506408 2519162 802 0.0 159353.1 0.0X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716166198
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686782135
**[Test build #128266 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128266/testReport)** for PR 29642 at commit [`648e8e5`](https://github.com/apache/spark/commit/648e8e58a552ab2072123cdefbe5d106091ce293).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r526092610
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
- makeEq.lift(nameToParquetField(name).fieldType)
- .map(_(nameToParquetField(name).fieldNames, v))
- }.reduceLeftOption(FilterApi.or)
+ case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
+ values.nonEmpty && canMakeFilterOn(name, values.head) =>
+ if (values.length <= pushDownInFilterThreshold) {
+ values.flatMap { v =>
+ makeEq.lift(nameToParquetField(name).fieldType)
+ .map(_(nameToParquetField(name).fieldNames, v))
+ }.reduceLeftOption(FilterApi.or)
+ } else {
+ sparkSchema.find { f =>
+ if (caseSensitive) f.name.equals(name) else f.name.equalsIgnoreCase(name)
+ }.map(_.dataType) match {
+ case Some(dataType) =>
+ val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
+ createFilterHelper(
+ sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
+ sources.LessThanOrEqual(name, sortedValues.last)),
+ canPartialPushDownConjuncts)
Review comment:
ah, then can we turn it into a util method and use it in all the filter pushdown place?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809839180
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41258/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528457604
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
.doc("The maximum number of values to filter push-down optimization for IN predicate. " +
- "Large threshold won't necessarily provide much better performance. " +
- "The experiment argued that 300 is the limit threshold. " +
+ "Spark will push-down a value greater than or equal to its minimum value and " +
Review comment:
Impala only optimize it to `>= minimum value` and `<= maximum value`: https://github.com/apache/impala/commit/aa05c6493b0ff8bbf422a4c38cf780bde34d51c7
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835819039
Retest this please.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716155498
**[Test build #130245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130245/testReport)** for PR 29642 at commit [`0169114`](https://github.com/apache/spark/commit/0169114d7f71d3a1fc63cf9faa114cff4b415077).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-743109098
Production real case test:
Before this PR | After this PR
--- | ---
![image](https://user-images.githubusercontent.com/5399861/101891559-2d5fdb00-3bdd-11eb-8dc3-8e5854654660.png) | ![image](https://user-images.githubusercontent.com/5399861/101891620-436d9b80-3bdd-11eb-9290-c6226e76b7c2.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716188402
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716188402
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042
> No. Github action runs on different machines, there is a performance difference between them.
No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
```
- Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
- Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
- Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
- Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+ Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+ Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+ Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+ Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
```
Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was [my question](https://github.com/apache/spark/pull/29642#discussion_r628899597).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042
> No. Github action runs on different machines, there is a performance difference between them.
No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
```
- Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
- Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
- Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
- Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+ Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+ Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+ Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+ Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803695381
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803815504
This patch is used to push down the data column when the `InSet` value exceeds `spark.sql.parquet.pushdown.inFilterThreshold`. This is benchmark and benchmark result:
https://github.com/apache/spark/blob/3aa659ce29877f386a24da9d04e66069d04afaa8/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L281-L296
Before:
https://github.com/apache/spark/blob/f5118f81e395bde0cd8253dbef6a9e6455c3958a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L430-L482
After:
https://github.com/apache/spark/blob/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L439-L482
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739511388
**[Test build #132295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132295/testReport)** for PR 29642 at commit [`a98b354`](https://github.com/apache/spark/commit/a98b354a1ff18815cd6aa6f268e4a7959e961f26).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738852288
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36835/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r493226163
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,9 +597,9 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
Review comment:
@HyukjinKwon @gengliangwang If we do not rely on the optimizer, we should add an empty check. otherwise `values.head` will throw an exception.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716190940
**[Test build #130245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130245/testReport)** for PR 29642 at commit [`0169114`](https://github.com/apache/spark/commit/0169114d7f71d3a1fc63cf9faa114cff4b415077).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-731937316
It can improve in most cases base on the [benchmark result](https://github.com/apache/spark/blob/b8cb1f48f1b38d74475c067a28426502c4e4a87a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L457-L482):
100 values | Relative
-- | --
Top 10% of data | 6.6X
Top 50% of data | 1.9X
Top 90% of data | 1.1X
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735508307
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841759928
Kubernetes integration test unable to build dist.
exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43103/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834491086
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138247/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809879605
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41264/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809818374
**[Test build #136676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136676/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716153233
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841186474
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43078/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832412736
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138164/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737621198
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729792609
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738561954
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36785/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739521014
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36896/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835972741
@dongjoon-hyun This pr only improve the `In` predicate. I have added the improvement part to PR description.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730359987
shall we implement the logic in `FileSourceStrategy`? Then it's not parquet only.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626800
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716191230
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum closed pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626143
**[Test build #128266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128266/testReport)** for PR 29642 at commit [`648e8e5`](https://github.com/apache/spark/commit/648e8e58a552ab2072123cdefbe5d106091ce293).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312379
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832328455
**[Test build #138164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138164/testReport)** for PR 29642 at commit [`f269f8d`](https://github.com/apache/spark/commit/f269f8d9d883e96182ff363276b589584a109aad).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum edited a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-743109098
Production real case, `InSet` size = 1918:
Before this PR | After this PR
--- | ---
![image](https://user-images.githubusercontent.com/5399861/101891559-2d5fdb00-3bdd-11eb-8dc3-8e5854654660.png) | ![image](https://user-images.githubusercontent.com/5399861/101891620-436d9b80-3bdd-11eb-9290-c6226e76b7c2.png)
Table statistics:
```
+-------------+-----------------+-----------------+--+
| count(1) | min(SELLER_ID) | max(SELLER_ID) |
+-------------+-----------------+-----------------+--+
| 8344448448 | 9 | 2234460898 |
+-------------+-----------------+-----------------+--+
```
Query statistics:
```
+-----------+-----------------+-----------------+--+
| count(1) | min(SELLER_ID) | max(SELLER_ID) |
+-----------+-----------------+-----------------+--+
| 33978532 | 153377548 | 2180252014 |
+-----------+-----------------+-----------------+--+
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528432010
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
- makeEq.lift(nameToParquetField(name).fieldType)
- .map(_(nameToParquetField(name).fieldNames, v))
- }.reduceLeftOption(FilterApi.or)
+ case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
Review comment:
If this is supposed to be beneficial in other sources as well, I think it makes more sense to push it to other sources as well anyway.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739516555
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36896/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042
> No. Github action runs on different machines, there is a performance difference between them.
No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
```
- Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
- Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
- Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
- Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+ Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+ Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+ Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+ Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
```
Although the value is too small, this generate result shows a slowdown of Parquet compared with ORC. That was my questions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729291752
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809882171
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41264/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809839180
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41258/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841316402
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138557/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738780315
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132221/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509
@dongjoon-hyun I think [current benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297) is enough. I have updated the benchmark to PR description.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834257115
Kubernetes integration test unable to build dist.
exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42769/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742753903
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132574/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738811606
**[Test build #132235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132235/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738559827
**[Test build #132185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132185/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739535000
**[Test build #132295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132295/testReport)** for PR 29642 at commit [`a98b354`](https://github.com/apache/spark/commit/a98b354a1ff18815cd6aa6f268e4a7959e961f26).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738559980
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132185/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738852315
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36835/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697151626
@wangyum Do you have any further comments? If not, shall we close this one?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739521014
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36896/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834234834
**[Test build #138247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138247/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809925233
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136682/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832409564
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42685/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r627978970
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
##########
@@ -188,6 +188,15 @@ abstract class ParquetFilterSuite extends QueryTest with ParquetTest with Shared
checkFilterPredicate(!(tsAttr < ts4.ts), classOf[GtEq[_]], resultFun(ts4))
checkFilterPredicate(tsAttr < ts2.ts || tsAttr > ts3.ts, classOf[Operators.Or],
Seq(Row(resultFun(ts1)), Row(resultFun(ts4))))
+
+ Seq(3, 20).foreach { threshold =>
+ withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD.key -> s"$threshold") {
Review comment:
shall we update the conf doc of `PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD`? We have a new feature now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835871614
**[Test build #138307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138307/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738555193
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36785/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729926433
**[Test build #131289 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131289/testReport)** for PR 29642 at commit [`b8cb1f4`](https://github.com/apache/spark/commit/b8cb1f48f1b38d74475c067a28426502c4e4a87a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835836092
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42829/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841779612
**[Test build #138582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716166198
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742566759
**[Test build #132574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132574/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738541400
**[Test build #132185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132185/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-743135912
@cloud-fan @HyukjinKwon @gengliangwang Do you have more comments?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r511606498
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
Review comment:
Sort performance:
```scala
import org.apache.spark.benchmark.Benchmark
val N = 20000000
val array = Range(1, N).map(_.%(10000000)).toArray
val benchmark = new Benchmark(s"Benchmark distinct", valuesPerIteration = N, minNumIters = 30)
benchmark.addCase("array.sorted") { _ =>
array.sorted
}
benchmark.addCase("array.distinct") { _ =>
array.distinct
}
benchmark.run()
```
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.6
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark distinct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
array.sorted 296 821 NaN 67.7 14.8 1.0X
array.distinct 3005 3933 330 6.7 150.2 0.1X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716159240
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34844/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832412736
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138164/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841781338
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138582/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841316402
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138557/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735483068
**[Test build #131941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131941/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834456817
**[Test build #138247 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138247/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738780315
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132221/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738835079
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36835/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r525929388
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
- makeEq.lift(nameToParquetField(name).fieldType)
- .map(_(nameToParquetField(name).fieldNames, v))
- }.reduceLeftOption(FilterApi.or)
+ case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
+ values.nonEmpty && canMakeFilterOn(name, values.head) =>
+ if (values.length <= pushDownInFilterThreshold) {
+ values.flatMap { v =>
+ makeEq.lift(nameToParquetField(name).fieldType)
+ .map(_(nameToParquetField(name).fieldNames, v))
+ }.reduceLeftOption(FilterApi.or)
+ } else {
+ sparkSchema.find { f =>
+ if (caseSensitive) f.name.equals(name) else f.name.equalsIgnoreCase(name)
+ }.map(_.dataType) match {
+ case Some(dataType) =>
+ val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
+ createFilterHelper(
+ sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
+ sources.LessThanOrEqual(name, sortedValues.last)),
+ canPartialPushDownConjuncts)
Review comment:
The logic is same to HiveShim.scala#L746-L750.
https://github.com/apache/spark/blob/09bb9bedcd27e08b86d63a6aed90d42ca4c606be/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L746-L750
@cloud-fan @dongjoon-hyun @HyukjinKwon It can be improved by 6.6X in `InSet -> InFilters (values count: 100, distribution: 10)`:
```
Parquet Vectorized (Pushdown) 9520 9560 27 1.7 605.3 1.0X
Parquet Vectorized (Pushdown) 873 885 11 18.0 55.5 6.6X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738663385
retest this please.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729792557
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738541400
**[Test build #132185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132185/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729792609
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737640996
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809869327
**[Test build #136676 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136676/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
* This patch **fails PySpark unit tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841982543
@wangyum are you online? can you take a quick look and fix or revert?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739521006
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36896/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686782831
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735523298
**[Test build #131953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131953/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738571117
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36785/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716162644
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528431912
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
- makeEq.lift(nameToParquetField(name).fieldType)
- .map(_(nameToParquetField(name).fieldNames, v))
- }.reduceLeftOption(FilterApi.or)
+ case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
Review comment:
@wangyum, the impala reference sounds good. Can we make it general and push the range filter to other data sources as well?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835826320
**[Test build #138307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138307/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840985630
@dongjoon-hyun I think this performance issue is not caused by this change. This PR only changes the `In` predicate. It is also slow without this change:
```
OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1047-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Select 0 string row (value IS NULL): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 10623 10994 272 1.5 675.4 1.0X
Parquet Vectorized (Pushdown) 627 657 24 25.1 39.9 16.9X
Native ORC Vectorized 7490 7653 203 2.1 476.2 1.4X
Native ORC Vectorized (Pushdown) 553 606 34 28.4 35.2 19.2X
```
https://github.com/wangyum/spark/runs/2580852093
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509
@dongjoon-hyun I think [current benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297) is enough. I have updated the benchmark to PR description.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738722940
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36821/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738971991
**[Test build #132235 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132235/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r483633146
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,9 +597,9 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
Review comment:
+1 with @HyukjinKwon
@wangyum I think this PR can cause perf regression on filter pushdown in Parquet. After the changes, `In` filters with redundant elements might not be able to be pushed down.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742753903
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132574/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729275927
**[Test build #131236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131236/testReport)** for PR 29642 at commit [`ebb13cc`](https://github.com/apache/spark/commit/ebb13cceb5b6840d4c15ec488ef350c23a5daa6c).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841178201
@dongjoon-hyun @cloud-fan Please see the latest benchmark result: https://github.com/apache/spark/pull/29642/commits/27a2bf615eb158c7c25aa5bfaa04caa939c237da
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199
**[Test build #138582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628964580
##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
Pushdown for many distinct value case
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Select 0 string row (value IS NULL): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
-Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
-Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
-Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
Review comment:
No. Github action runs on different machines, there is a performance difference between them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832409564
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42685/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809862832
**[Test build #136682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136682/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042
> No. Github action runs on different machines, there is a performance difference between them.
No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet looks like more slower than ORC after this PR by increasing the gap. For example, the following.
```
- Parquet Vectorized 10512 10572 58 1.5 668.4 1.0X
- Parquet Vectorized (Pushdown) 596 621 19 26.4 37.9 17.6X
- Native ORC Vectorized 8555 8723 97 1.8 543.9 1.2X
- Native ORC Vectorized (Pushdown) 592 609 11 26.6 37.7 17.8X
+ Parquet Vectorized 9788 10231 259 1.6 622.3 1.0X
+ Parquet Vectorized (Pushdown) 493 536 29 31.9 31.3 19.9X
+ Native ORC Vectorized 6487 6575 137 2.4 412.4 1.5X
+ Native ORC Vectorized (Pushdown) 436 447 14 36.1 27.7 22.4X
```
Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was [my question](https://github.com/apache/spark/pull/29642#discussion_r628899597).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang edited a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
gengliangwang edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-731904386
@wangyum @cloud-fan @HyukjinKwon
I got some concerns about this optimization. What if the range is huge and the filter becomes less selective? E.g.
```
SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, Int.Max)
```
=>
```
SELECT * FROM t WHERE id >= 1 and id <= ${Int.Max}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528452609
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
.doc("The maximum number of values to filter push-down optimization for IN predicate. " +
- "Large threshold won't necessarily provide much better performance. " +
- "The experiment argued that 300 is the limit threshold. " +
+ "Spark will push-down a value greater than or equal to its minimum value and " +
Review comment:
I think the default value `10` is too small here. What is the default threshold in IMPLA?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r526165602
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
createFilterHelper(pred, canPartialPushDownConjuncts = false)
.map(FilterApi.not)
- case sources.In(name, values) if canMakeFilterOn(name, values.head)
- && values.distinct.length <= pushDownInFilterThreshold =>
- values.distinct.flatMap { v =>
- makeEq.lift(nameToParquetField(name).fieldType)
- .map(_(nameToParquetField(name).fieldNames, v))
- }.reduceLeftOption(FilterApi.or)
+ case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
+ values.nonEmpty && canMakeFilterOn(name, values.head) =>
+ if (values.length <= pushDownInFilterThreshold) {
+ values.flatMap { v =>
+ makeEq.lift(nameToParquetField(name).fieldType)
+ .map(_(nameToParquetField(name).fieldNames, v))
+ }.reduceLeftOption(FilterApi.or)
+ } else {
+ sparkSchema.find { f =>
+ if (caseSensitive) f.name.equals(name) else f.name.equalsIgnoreCase(name)
+ }.map(_.dataType) match {
+ case Some(dataType) =>
+ val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
+ createFilterHelper(
+ sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
+ sources.LessThanOrEqual(name, sortedValues.last)),
+ canPartialPushDownConjuncts)
Review comment:
ok, Added a new function to `TypeUtils`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735515368
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834270229
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42769/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716188129
**[Test build #130244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130244/testReport)** for PR 29642 at commit [`c5ab656`](https://github.com/apache/spark/commit/c5ab6569f4b175066613d02b787dc8aaa83ca8d9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729927486
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737621178
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36676/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org