You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/03 16:59:15 UTC

[GitHub] [spark] wangyum opened a new pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

wangyum opened a new pull request #29642:
URL: https://github.com/apache/spark/pull/29642


   ### What changes were proposed in this pull request?
   
   Improve in filter pushdown for ParquetFilters:
   - Remove `distinct` operation because it is expensive and duplicate values should be removed by `OptimizeIn`
      ```scala
      import org.apache.spark.benchmark.Benchmark
      val N = 5000000
      val array = Range(1, N).toArray
      val benchmark = new Benchmark(s"Benchmark distinct", valuesPerIteration = N, minNumIters = 30)
      benchmark.addCase("array.length") { _ =>
        array.length
      }
      benchmark.addCase("array.distinct.length") { _ =>
        array.distinct.length
      }
      benchmark.run()
      ```
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.6
      Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
      Benchmark distinct:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------------------------------
      array.length                                          0              0           0   48076923.1           0.0       1.0X
      array.distinct.length                               711           1389         498          7.0         142.2       0.0X
      ```
   - Add an empty check because `values` may be empty.
   
   ### Why are the changes needed?
   
   Enhance code robustness.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742603252


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37178/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735503695


   **[Test build #131941 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131941/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738973065


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132235/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738736322


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36821/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737596383


   **[Test build #132077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132077/testReport)** for PR 29642 at commit [`af9d7d6`](https://github.com/apache/spark/commit/af9d7d66d1a3c221163f56ba322ec277b4498fed).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738804876


   retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737610313


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36676/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287


   Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.
   
   One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit. Do we need to add some specific additional benchmark case for your contribution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312379






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739511388


   **[Test build #132295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132295/testReport)** for PR 29642 at commit [`a98b354`](https://github.com/apache/spark/commit/a98b354a1ff18815cd6aa6f268e4a7959e961f26).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628965380



##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
 Pushdown for many distinct value case
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row (value IS NULL):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
-Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
-Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
-Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
+Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
+Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
+Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
+Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row ('7864320' < value < '7864320'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -----------------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                           10406          10461          50          1.5         661.6       1.0X
-Parquet Vectorized (Pushdown)                                  619            641          22         25.4          39.4      16.8X
-Native ORC Vectorized                                         8787           8834          57          1.8         558.6       1.2X
-Native ORC Vectorized (Pushdown)                               592            608          11         26.6          37.6      17.6X
+Parquet Vectorized                                            9861           9880          16          1.6         626.9       1.0X
+Parquet Vectorized (Pushdown)                                  507            529          21         31.0          32.3      19.4X
+Native ORC Vectorized                                         6871           6938          63          2.3         436.8       1.4X
+Native ORC Vectorized (Pushdown)                               453            470          13         34.7          28.8      21.8X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 1 string row (value = '7864320'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                10632          10694          60          1.5         676.0       1.0X
-Parquet Vectorized (Pushdown)                       608            635          22         25.9          38.6      17.5X
-Native ORC Vectorized                              8790           8838          37          1.8         558.9       1.2X
-Native ORC Vectorized (Pushdown)                    559            584          22         28.1          35.5      19.0X
+Parquet Vectorized                                10228          10471         167          1.5         650.3       1.0X
+Parquet Vectorized (Pushdown)                       511            519           5         30.8          32.5      20.0X
+Native ORC Vectorized                              6700           6865         119          2.3         426.0       1.5X
+Native ORC Vectorized (Pushdown)                    436            454          12         36.1          27.7      23.5X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 1 string row (value <=> '7864320'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                 10529          10624          74          1.5         669.4       1.0X
-Parquet Vectorized (Pushdown)                        613            631          16         25.7          39.0      17.2X
-Native ORC Vectorized                               8746           8816          63          1.8         556.1       1.2X
-Native ORC Vectorized (Pushdown)                     589            600          11         26.7          37.5      17.9X
+Parquet Vectorized                                 10287          10449         144          1.5         654.0       1.0X
+Parquet Vectorized (Pushdown)                        467            494          20         33.7          29.7      22.0X
+Native ORC Vectorized                               6781           6848          58          2.3         431.1       1.5X
+Native ORC Vectorized (Pushdown)                     428            440          10         36.8          27.2      24.1X

Review comment:
       No. Github action runs on different machines, there is a performance difference between them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042


   > No. Github action runs on different machines, there is a performance difference between them.
   
   No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
   
   ```
   - Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
   - Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
   - Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
   - Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
   + Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
   + Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
   + Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
   + Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X
   ```
   
   Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was my question.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803818662


   Shall we do this in a early place so that it applies to both hive partition pruning and data predicates?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809862832


   **[Test build #136682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136682/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738736356


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36821/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841186474


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43078/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199


   **[Test build #138582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729300576


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835877675


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138307/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729927486






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835833242






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738811606


   **[Test build #132235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132235/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841160451


   **[Test build #138557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138557/testReport)** for PR 29642 at commit [`27a2bf6`](https://github.com/apache/spark/commit/27a2bf615eb158c7c25aa5bfaa04caa939c237da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834270229


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42769/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737596383


   **[Test build #132077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132077/testReport)** for PR 29642 at commit [`af9d7d6`](https://github.com/apache/spark/commit/af9d7d66d1a3c221163f56ba322ec277b4498fed).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840259632


   @dongjoon-hyun Do you have more comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199


   **[Test build #138582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738973065


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132235/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835836092


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42829/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809833487


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41258/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735523298


   **[Test build #131953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131953/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729275927


   **[Test build #131236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131236/testReport)** for PR 29642 at commit [`ebb13cc`](https://github.com/apache/spark/commit/ebb13cceb5b6840d4c15ec488ef350c23a5daa6c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742603252


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37178/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834234834


   **[Test build #138247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138247/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841781338


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138582/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809925233


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136682/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686782831






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r483336491



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,9 +597,9 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>

Review comment:
       Hm, it's better to don't rely on optimizer actually. 5000000 N in the benchmark you did looks a bit unrealistic and the perf degradation seems not very serious compared to execution time.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742566759


   **[Test build #132574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132574/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832409169






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809917727


   **[Test build #136682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136682/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
    * This patch **fails Spark unit tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742597635






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628066242



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
##########
@@ -188,6 +188,15 @@ abstract class ParquetFilterSuite extends QueryTest with ParquetTest with Shared
       checkFilterPredicate(!(tsAttr < ts4.ts), classOf[GtEq[_]], resultFun(ts4))
       checkFilterPredicate(tsAttr < ts2.ts || tsAttr > ts3.ts, classOf[Operators.Or],
         Seq(Row(resultFun(ts1)), Row(resultFun(ts4))))
+
+      Seq(3, 20).foreach { threshold =>
+        withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD.key -> s"$threshold") {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841982214


   uhoh .. seems like there's a logical conflict with https://github.com/apache/spark/pull/31776:
   
   ```
   [error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:489:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
   [error]     case ParquetSchemaType(DECIMAL, INT32, _, _) if pushDownDecimal =>
   [error]                           ^
   [error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:496:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
   [error]     case ParquetSchemaType(DECIMAL, INT64, _, _) if pushDownDecimal =>
   [error]                           ^
   [error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:503:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
   [error]     case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, length, _) if pushDownDecimal =>
   [error]                           ^
   [warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:127:39: [deprecation @ org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
   [warn]       && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
   [warn]                                       ^
   [warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWrite.scala:89:39: [deprecation @ org.apache.spark.sql.execution.datasources.v2.parquet.ParquetWrite.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
   [warn]       && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729777039


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729300592






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43103/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628899597



##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
 Pushdown for many distinct value case
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row (value IS NULL):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
-Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
-Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
-Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
+Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
+Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
+Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
+Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X

Review comment:
       Just a question. Does these results show Parquet's performance regression? This PR touches Parquet only and the ratios of Parquet and ORC were 17.6x and 17.8x respectively before. And, now it becomes 19.9x and 22.4. I expected `19.x and 19.x` or `22.x and 22.x` similarily.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729746475


   **[Test build #131289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131289/testReport)** for PR 29642 at commit [`b8cb1f4`](https://github.com/apache/spark/commit/b8cb1f48f1b38d74475c067a28426502c4e4a87a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738559980






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716153233


   **[Test build #130244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130244/testReport)** for PR 29642 at commit [`c5ab656`](https://github.com/apache/spark/commit/c5ab6569f4b175066613d02b787dc8aaa83ca8d9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803808606


   Is this patch still needed? IIRC we already have this in hive partition pruning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809882190


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41264/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737640996






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739535237


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132295/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742735942


   **[Test build #132574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132574/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
    * This patch **fails SparkR unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626143


   **[Test build #128266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128266/testReport)** for PR 29642 at commit [`648e8e5`](https://github.com/apache/spark/commit/648e8e58a552ab2072123cdefbe5d106091ce293).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735553226






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738852315


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36835/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809818374


   **[Test build #136676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136676/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287


   Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.
   
   One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit. Do we need to add some specific additional benchmark case for your contribution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628899709



##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
 Pushdown for many distinct value case
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row (value IS NULL):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
-Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
-Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
-Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
+Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
+Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
+Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
+Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row ('7864320' < value < '7864320'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -----------------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                           10406          10461          50          1.5         661.6       1.0X
-Parquet Vectorized (Pushdown)                                  619            641          22         25.4          39.4      16.8X
-Native ORC Vectorized                                         8787           8834          57          1.8         558.6       1.2X
-Native ORC Vectorized (Pushdown)                               592            608          11         26.6          37.6      17.6X
+Parquet Vectorized                                            9861           9880          16          1.6         626.9       1.0X
+Parquet Vectorized (Pushdown)                                  507            529          21         31.0          32.3      19.4X
+Native ORC Vectorized                                         6871           6938          63          2.3         436.8       1.4X
+Native ORC Vectorized (Pushdown)                               453            470          13         34.7          28.8      21.8X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 1 string row (value = '7864320'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                10632          10694          60          1.5         676.0       1.0X
-Parquet Vectorized (Pushdown)                       608            635          22         25.9          38.6      17.5X
-Native ORC Vectorized                              8790           8838          37          1.8         558.9       1.2X
-Native ORC Vectorized (Pushdown)                    559            584          22         28.1          35.5      19.0X
+Parquet Vectorized                                10228          10471         167          1.5         650.3       1.0X
+Parquet Vectorized (Pushdown)                       511            519           5         30.8          32.5      20.0X
+Native ORC Vectorized                              6700           6865         119          2.3         426.0       1.5X
+Native ORC Vectorized (Pushdown)                    436            454          12         36.1          27.7      23.5X
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 1 string row (value <=> '7864320'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                 10529          10624          74          1.5         669.4       1.0X
-Parquet Vectorized (Pushdown)                        613            631          16         25.7          39.0      17.2X
-Native ORC Vectorized                               8746           8816          63          1.8         556.1       1.2X
-Native ORC Vectorized (Pushdown)                     589            600          11         26.7          37.5      17.9X
+Parquet Vectorized                                 10287          10449         144          1.5         654.0       1.0X
+Parquet Vectorized (Pushdown)                        467            494          20         33.7          29.7      22.0X
+Native ORC Vectorized                               6781           6848          58          2.3         431.1       1.5X
+Native ORC Vectorized (Pushdown)                     428            440          10         36.8          27.2      24.1X

Review comment:
       ditto. `17 vs 17` -> `22 vs 24`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834491086


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138247/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-731904386


   @wangyum @cloud-fan @HyukjinKwon 
   I got some concerns about this optimization. What if the range is huge and the filter becomes less selective? E.g.
   ```
   SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, Int.Max)
   ```
   =>
   ```
   SELECT * FROM t WHERE id > 1 and id < ${Int.Max}
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735560173






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716246936


   cc @cloud-fan @HyukjinKwon  @gengliangwang


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626800






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729746475


   **[Test build #131289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131289/testReport)** for PR 29642 at commit [`b8cb1f4`](https://github.com/apache/spark/commit/b8cb1f48f1b38d74475c067a28426502c4e4a87a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729300592






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809880752


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136676/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528436623



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>
-          makeEq.lift(nameToParquetField(name).fieldType)
-            .map(_(nameToParquetField(name).fieldNames, v))
-        }.reduceLeftOption(FilterApi.or)
+      case sources.In(name, values) if pushDownInFilterThreshold > 0 &&

Review comment:
       It seems only Parquet is not well supported `In` predicate pushdown.
   Parquet vs ORC: https://github.com/apache/spark/blob/f5118f81e395bde0cd8253dbef6a9e6455c3958a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L439-L482
   CSV:
   https://github.com/apache/spark/pull/29642#issuecomment-730869008




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809836947


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41258/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735553226






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735560173






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716162655






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809882190


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41264/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738672069


   **[Test build #132221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132221/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738736356


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36821/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841002294


   Yea these benchmark results are not updated in time. Let's post the benchmark result before and after this PR in the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835826320


   **[Test build #138307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138307/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841160451


   **[Test build #138557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138557/testReport)** for PR 29642 at commit [`27a2bf6`](https://github.com/apache/spark/commit/27a2bf615eb158c7c25aa5bfaa04caa939c237da).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832411899


   **[Test build #138164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138164/testReport)** for PR 29642 at commit [`f269f8d`](https://github.com/apache/spark/commit/f269f8d9d883e96182ff363276b589584a109aad).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841296880


   **[Test build #138557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138557/testReport)** for PR 29642 at commit [`27a2bf6`](https://github.com/apache/spark/commit/27a2bf615eb158c7c25aa5bfaa04caa939c237da).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528452609



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
   val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
     buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
       .doc("The maximum number of values to filter push-down optimization for IN predicate. " +
-        "Large threshold won't necessarily provide much better performance. " +
-        "The experiment argued that 300 is the limit threshold. " +
+        "Spark will push-down a value greater than or equal to its minimum value and " +

Review comment:
       I think the default value `10` is too smaller here.  What is the default threshold in IMPLA? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528452609



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
   val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
     buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
       .doc("The maximum number of values to filter push-down optimization for IN predicate. " +
-        "Large threshold won't necessarily provide much better performance. " +
-        "The experiment argued that 300 is the limit threshold. " +
+        "Spark will push-down a value greater than or equal to its minimum value and " +

Review comment:
       I think the default value `10` is small here.  What is the default threshold in IMPLA? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809880752


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136676/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739535237


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132295/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735495202






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730869008


   It seems only Parquet not well supported `In` predicate pushdown. @MaxGekk What do you think?
   
   This is the benchmark of CSV:
   ```scala
   val rowsNum = 100 * 1000
   val numIters = 3
   val colsNum = 100
   val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
   val schema = StructType(StructField("key", IntegerType) +: fields)
   def columns(): Seq[Column] = {
     val ts = Seq.tabulate(colsNum) { i =>
       lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
     }
     ($"id" % 1000).as("key") +: ts
   }
   withTempPath { path =>
     spark.range(rowsNum).select(columns(): _*)
       .write.option("header", true)
       .csv(path.getAbsolutePath)
     def readback = {
       spark.read
         .option("header", true)
         .schema(schema)
         .csv(path.getAbsolutePath)
     }
   
     def withFilter(filer: String, configEnabled: Boolean): Unit = {
       withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> configEnabled.toString()) {
         readback.filter(filer).noop()
       }
     }
   
     Seq(5, 10, 50, 100, 500).foreach { count =>
       Seq(10, 50).foreach { distribution =>
         val title = s"InSet -> InFilters (values count: $count, distribution: $distribution)"
         val benchmark = new Benchmark(title, rowsNum, output = output)
         Seq(false, true).foreach { pushDownEnabled =>
           val name = s"Native CSV Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
           benchmark.addCase(name, numIters) { _ =>
             val filter =
               Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum * distribution / 100))
             val whereExpr = s"key in(${filter.mkString(",")})"
             withFilter(whereExpr, configEnabled = pushDownEnabled)
           }
         }
         benchmark.run()
       }
     }
   }
   ```
   
   Result:
   ```
   ================================================================================================
   Benchmark to measure CSV read performance
   ================================================================================================
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 5, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                           13082          17077        1674          0.0      130815.6       1.0X
   Native CSV Vectorized (Pushdown)                                 1172           1192          35          0.1       11719.5      11.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 5, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                           11858          12028         237          0.0      118576.9       1.0X
   Native CSV Vectorized (Pushdown)                                 1165           1172           6          0.1       11652.4      10.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11883          12180         494          0.0      118834.3       1.0X
   Native CSV Vectorized (Pushdown)                                  1142           1156          21          0.1       11418.6      10.4X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11857          11878          19          0.0      118570.4       1.0X
   Native CSV Vectorized (Pushdown)                                  1169           1174           7          0.1       11692.9      10.1X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 50, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11923          11962          66          0.0      119228.0       1.0X
   Native CSV Vectorized (Pushdown)                                  1196           1225          26          0.1       11960.7      10.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 50, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11910          11917           7          0.0      119095.3       1.0X
   Native CSV Vectorized (Pushdown)                                  1191           1194           5          0.1       11908.0      10.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11948          12097         201          0.0      119484.5       1.0X
   Native CSV Vectorized (Pushdown)                                   1250           1284          32          0.1       12501.4       9.6X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11938          11978          39          0.0      119378.8       1.0X
   Native CSV Vectorized (Pushdown)                                   1176           1188          11          0.1       11756.0      10.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 500, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11954          12051         124          0.0      119542.9       1.0X
   Native CSV Vectorized (Pushdown)                                   1762           1833         104          0.1       17620.6       6.8X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 500, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11860          12166         484          0.0      118597.8       1.0X
   Native CSV Vectorized (Pushdown)                                   1417           1434          15          0.1       14171.7       8.4X
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287


   Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.
   
   One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737621198






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43103/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832328455


   **[Test build #138164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138164/testReport)** for PR 29642 at commit [`f269f8d`](https://github.com/apache/spark/commit/f269f8d9d883e96182ff363276b589584a109aad).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841982214






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738672069


   **[Test build #132221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132221/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841183742






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735483068


   **[Test build #131941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131941/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735495202






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735559996


   **[Test build #131953 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131953/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738779564


   **[Test build #132221 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132221/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735508307






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716166190


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34845/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716162655






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835877675


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138307/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312386


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131236/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737640409


   **[Test build #132077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132077/testReport)** for PR 29642 at commit [`af9d7d6`](https://github.com/apache/spark/commit/af9d7d66d1a3c221163f56ba322ec277b4498fed).
    * This patch **fails PySpark unit tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841986529


   Oops.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312310


   **[Test build #131236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131236/testReport)** for PR 29642 at commit [`ebb13cc`](https://github.com/apache/spark/commit/ebb13cceb5b6840d4c15ec488ef350c23a5daa6c).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738661573


   ```scala
   package org.apache.spark.sql.execution.benchmark
   
   import java.io.File
   
   import scala.util.Random
   
   import org.apache.spark.SparkConf
   import org.apache.spark.benchmark.Benchmark
   import org.apache.spark.sql.SparkSession
   import org.apache.spark.sql.functions.monotonically_increasing_id
   
   /**
    * Benchmark to measure read performance InSet Filter pushdown.
    * To run this benchmark:
    * {{{
    *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>
    *   2. build/sbt "sql/test:runMain <this class>"
    *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
    *      Results will be written to "benchmarks/InSetFilterPushdownBenchmark-results.txt".
    * }}}
    */
   object InSetFilterPushdownBenchmark extends SqlBasedBenchmark {
   
     override def getSparkSession: SparkSession = {
       val conf = new SparkConf()
         .setAppName(this.getClass.getSimpleName)
         // Since `spark.master` always exists, overrides this value
         .set("spark.master", "local[1]")
         .setIfMissing("spark.driver.memory", "3g")
         .setIfMissing("orc.compression", "snappy")
         .setIfMissing("spark.sql.parquet.compression.codec", "snappy")
   
       SparkSession.builder().config(conf).getOrCreate()
     }
   
     private val numRows = 1024 * 1024 * 15
     private val width = 5
     // For Parquet/ORC, we will use the same value for block size and compression size
     private val blockSize = org.apache.parquet.hadoop.ParquetWriter.DEFAULT_PAGE_SIZE
   
     def withTempTable(tableNames: String*)(f: => Unit): Unit = {
       try f finally tableNames.foreach(spark.catalog.dropTempView)
     }
   
     private def prepareTable(dir: File, numRows: Int): Unit = {
       import spark.implicits._
       val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
       val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*)
         .withColumn("value", monotonically_increasing_id())
         .sort("value")
   
       df.write.mode("overwrite")
         .option("orc.compress.size", blockSize)
         .option("orc.stripe.size", blockSize).format("orc").saveAsTable("orcTable")
   
       df.write.mode("overwrite")
         .option("parquet.block.size", blockSize).format("parquet").saveAsTable("parquetTable")
   
       df.write.mode("overwrite").format("csv").saveAsTable("csvTable")
     }
   
     def filterPushDownBenchmark(
          values: Int,
          title: String,
          whereExpr: String,
          selectExpr: String = "*"): Unit = {
       val benchmark = new Benchmark(title, values, minNumIters = 5, output = output)
   
       Seq(Int.MaxValue, 10).foreach { pushDownEnabled =>
         val name = s"Parquet ${if (pushDownEnabled == 10) s"(Rewrite InSet)" else ""}"
         benchmark.addCase(name) { _ =>
           withSQLConf("spark.sql.optimizer.inSetRewriteMinMaxThreshold" -> s"$pushDownEnabled") {
             spark.sql(s"SELECT $selectExpr FROM parquetTable WHERE $whereExpr").noop()
           }
         }
       }
   
       Seq(Int.MaxValue, 10).foreach { pushDownEnabled =>
         val name = s"ORC ${if (pushDownEnabled == 10) s"(Rewrite InSet)" else ""}"
         benchmark.addCase(name) { _ =>
           withSQLConf("spark.sql.optimizer.inSetRewriteMinMaxThreshold" -> s"$pushDownEnabled") {
             spark.sql(s"SELECT $selectExpr FROM orcTable WHERE $whereExpr").noop()
           }
         }
       }
   
       Seq(Int.MaxValue, 10).foreach { pushDownEnabled =>
         val name = s"CSV ${if (pushDownEnabled == 10) s"(Rewrite InSet)" else ""}"
         benchmark.addCase(name) { _ =>
           withSQLConf("spark.sql.optimizer.inSetRewriteMinMaxThreshold" -> s"$pushDownEnabled") {
             spark.sql(s"SELECT $selectExpr FROM csvTable WHERE $whereExpr").noop()
           }
         }
       }
   
       benchmark.run()
     }
   
     override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
   
       runBenchmark("Pushdown benchmark for rewrite InSet") {
         withTempPath { dir =>
           withTempTable("orcTable", "parquetTable") {
             prepareTable(dir, numRows)
             Seq(50, 1000, 5000, 20000).foreach { count =>
               Seq(1, 10, 50, 90).foreach { distribution =>
                 val filter =
                   Range(0, count).map(r => scala.util.Random.nextInt(numRows * distribution / 100))
                 val whereExpr = s"value in(${filter.mkString(",")})"
                 val title = s"Rewrite InSet (values count: $count, distribution: $distribution)"
                 filterPushDownBenchmark(numRows, title, whereExpr)
               }
             }
           }
         }
       }
     }
   }
   ```
   
   Result:
   ```
   ================================================================================================
   Pushdown benchmark for rewrite InSet
   ================================================================================================
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 50, distribution: 1):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                     8289           8371          68          1.9         527.0       1.0X
   Parquet (Rewrite InSet)                                      598            614          14         26.3          38.0      13.9X
   ORC                                                          442            454          20         35.6          28.1      18.8X
   ORC (Rewrite InSet)                                          411            431          20         38.2          26.1      20.2X
   CSV                                                        23399          23618         154          0.7        1487.7       0.4X
   CSV (Rewrite InSet)                                        23437          24070         744          0.7        1490.1       0.4X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 50, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                      8191           8244          50          1.9         520.8       1.0X
   Parquet (Rewrite InSet)                                      1166           1178          13         13.5          74.1       7.0X
   ORC                                                           500            521          16         31.5          31.8      16.4X
   ORC (Rewrite InSet)                                           514            526           8         30.6          32.7      15.9X
   CSV                                                         23447          23704         316          0.7        1490.7       0.3X
   CSV (Rewrite InSet)                                         23639          23821         153          0.7        1502.9       0.3X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 50, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                      8157           8233          56          1.9         518.6       1.0X
   Parquet (Rewrite InSet)                                      4224           4257          42          3.7         268.6       1.9X
   ORC                                                           513            536          25         30.7          32.6      15.9X
   ORC (Rewrite InSet)                                           511            530          18         30.8          32.5      16.0X
   CSV                                                         23665          24270         795          0.7        1504.6       0.3X
   CSV (Rewrite InSet)                                         23321          23596         221          0.7        1482.7       0.3X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 50, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                      8225           8335          84          1.9         522.9       1.0X
   Parquet (Rewrite InSet)                                      7138           7218         115          2.2         453.8       1.2X
   ORC                                                           526            559          36         29.9          33.4      15.6X
   ORC (Rewrite InSet)                                           507            538          24         31.1          32.2      16.2X
   CSV                                                         23411          23731         496          0.7        1488.4       0.4X
   CSV (Rewrite InSet)                                         23470          23546          82          0.7        1492.2       0.4X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 1000, distribution: 1):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                       8744           8845          90          1.8         555.9       1.0X
   Parquet (Rewrite InSet)                                        650            656           4         24.2          41.3      13.5X
   ORC                                                            535            559          16         29.4          34.0      16.4X
   ORC (Rewrite InSet)                                            532            551          16         29.5          33.9      16.4X
   CSV                                                          30467          32289        1496          0.5        1937.0       0.3X
   CSV (Rewrite InSet)                                          23981          24614         596          0.7        1524.7       0.4X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 1000, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        8383           8468          82          1.9         533.0       1.0X
   Parquet (Rewrite InSet)                                        1351           1362           9         11.6          85.9       6.2X
   ORC                                                            1048           1069          19         15.0          66.6       8.0X
   ORC (Rewrite InSet)                                            1052           1071          28         15.0          66.9       8.0X
   CSV                                                           30950          32767        1238          0.5        1967.7       0.3X
   CSV (Rewrite InSet)                                           24209          24513         396          0.6        1539.2       0.3X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 1000, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        8402           8481          55          1.9         534.2       1.0X
   Parquet (Rewrite InSet)                                        4532           4677         186          3.5         288.1       1.9X
   ORC                                                            2621           2659          46          6.0         166.6       3.2X
   ORC (Rewrite InSet)                                            2631           2738         193          6.0         167.2       3.2X
   CSV                                                           30098          30226          79          0.5        1913.6       0.3X
   CSV (Rewrite InSet)                                           27913          28481         693          0.6        1774.7       0.3X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 1000, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        8420           8468          65          1.9         535.4       1.0X
   Parquet (Rewrite InSet)                                        7621           7781         191          2.1         484.5       1.1X
   ORC                                                            3108           3167          53          5.1         197.6       2.7X
   ORC (Rewrite InSet)                                            3089           3175          59          5.1         196.4       2.7X
   CSV                                                           30555          32254        1187          0.5        1942.6       0.3X
   CSV (Rewrite InSet)                                           31091          31607         480          0.5        1976.7       0.3X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 5000, distribution: 1):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                       9125           9170          55          1.7         580.2       1.0X
   Parquet (Rewrite InSet)                                       1206           1234          18         13.0          76.6       7.6X
   ORC                                                           1244           1254           7         12.6          79.1       7.3X
   ORC (Rewrite InSet)                                           1236           1250          12         12.7          78.6       7.4X
   CSV                                                         350424         355583        1016          0.0       22279.3       0.0X
   CSV (Rewrite InSet)                                          28577          28875         458          0.6        1816.9       0.3X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 5000, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        9162           9408         253          1.7         582.5       1.0X
   Parquet (Rewrite InSet)                                        1911           1930          13          8.2         121.5       4.8X
   ORC                                                            1774           1809          41          8.9         112.8       5.2X
   ORC (Rewrite InSet)                                            1769           1785          24          8.9         112.5       5.2X
   CSV                                                          364909         368618         NaN          0.0       23200.3       0.0X
   CSV (Rewrite InSet)                                           58985          59425         287          0.3        3750.1       0.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 5000, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        9218           9499         173          1.7         586.1       1.0X
   Parquet (Rewrite InSet)                                        5109           5139          32          3.1         324.8       1.8X
   ORC                                                            4089           4137          72          3.8         260.0       2.3X
   ORC (Rewrite InSet)                                            4056           4121          93          3.9         257.9       2.3X
   CSV                                                          359994         364490         790          0.0       22887.8       0.0X
   CSV (Rewrite InSet)                                          196472         202225         721          0.1       12491.4       0.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 5000, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        9147           9247          64          1.7         581.6       1.0X
   Parquet (Rewrite InSet)                                        8369           8520         179          1.9         532.1       1.1X
   ORC                                                            6267           6305          47          2.5         398.4       1.5X
   ORC (Rewrite InSet)                                            6289           6435         199          2.5         399.8       1.5X
   CSV                                                          369254         371915         697          0.0       23476.6       0.0X
   CSV (Rewrite InSet)                                          326837         329082         NaN          0.0       20779.7       0.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 20000, distribution: 1):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                       11866          11944         105          1.3         754.4       1.0X
   Parquet (Rewrite InSet)                                        3578           3670          81          4.4         227.5       3.3X
   ORC                                                            4119           4152          33          3.8         261.9       2.9X
   ORC (Rewrite InSet)                                            4054           4181          84          3.9         257.7       2.9X
   CSV                                                         2319345        2350577         153          0.0      147460.0       0.0X
   CSV (Rewrite InSet)                                           55273          56287         821          0.3        3514.2       0.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 20000, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        12194          12287          91          1.3         775.3       1.0X
   Parquet (Rewrite InSet)                                         4442           4479          42          3.5         282.4       2.7X
   ORC                                                             4805           4847          53          3.3         305.5       2.5X
   ORC (Rewrite InSet)                                             4746           4838          94          3.3         301.7       2.6X
   CSV                                                          2958262        2979920         967          0.0      188081.2       0.0X
   CSV (Rewrite InSet)                                           322782         329114        1177          0.0       20521.9       0.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 20000, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        12138          12205          69          1.3         771.7       1.0X
   Parquet (Rewrite InSet)                                         7760           7901         160          2.0         493.3       1.6X
   ORC                                                             7072           7263         148          2.2         449.6       1.7X
   ORC (Rewrite InSet)                                             7094           7225          87          2.2         451.0       1.7X
   CSV                                                          2906664        2948342         220          0.0      184800.7       0.0X
   CSV (Rewrite InSet)                                          1367893        1393413        1348          0.0       86968.3       0.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_221-b11 on Linux 3.10.0-957.10.1.el7.x86_64
   Intel Core Processor (Broadwell, IBRS)
   Rewrite InSet (values count: 20000, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------------------
   Parquet                                                        12230          12593         387          1.3         777.5       1.0X
   Parquet (Rewrite InSet)                                        11262          11580         263          1.4         716.0       1.1X
   ORC                                                             9712           9794          75          1.6         617.5       1.3X
   ORC (Rewrite InSet)                                             9658           9763         109          1.6         614.1       1.3X
   CSV                                                          2776344        2807999        1140          0.0      176515.2       0.0X
   CSV (Rewrite InSet)                                          2506408        2519162         802          0.0      159353.1       0.0X
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716166198






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686782135


   **[Test build #128266 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128266/testReport)** for PR 29642 at commit [`648e8e5`](https://github.com/apache/spark/commit/648e8e58a552ab2072123cdefbe5d106091ce293).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r526092610



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>
-          makeEq.lift(nameToParquetField(name).fieldType)
-            .map(_(nameToParquetField(name).fieldNames, v))
-        }.reduceLeftOption(FilterApi.or)
+      case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
+        values.nonEmpty && canMakeFilterOn(name, values.head) =>
+        if (values.length <= pushDownInFilterThreshold) {
+          values.flatMap { v =>
+            makeEq.lift(nameToParquetField(name).fieldType)
+              .map(_(nameToParquetField(name).fieldNames, v))
+          }.reduceLeftOption(FilterApi.or)
+        } else {
+          sparkSchema.find { f =>
+            if (caseSensitive) f.name.equals(name) else f.name.equalsIgnoreCase(name)
+          }.map(_.dataType) match {
+            case Some(dataType) =>
+              val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
+              createFilterHelper(
+                sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
+                  sources.LessThanOrEqual(name, sortedValues.last)),
+                canPartialPushDownConjuncts)

Review comment:
       ah, then can we turn it into a util method and use it in all the filter pushdown place?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809839180


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41258/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528457604



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
   val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
     buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
       .doc("The maximum number of values to filter push-down optimization for IN predicate. " +
-        "Large threshold won't necessarily provide much better performance. " +
-        "The experiment argued that 300 is the limit threshold. " +
+        "Spark will push-down a value greater than or equal to its minimum value and " +

Review comment:
       Impala only optimize it to `>= minimum value` and `<= maximum value`: https://github.com/apache/impala/commit/aa05c6493b0ff8bbf422a4c38cf780bde34d51c7




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835819039


   Retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716155498


   **[Test build #130245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130245/testReport)** for PR 29642 at commit [`0169114`](https://github.com/apache/spark/commit/0169114d7f71d3a1fc63cf9faa114cff4b415077).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-743109098


   Production real case test: 
   Before this PR | After this PR
   --- | ---
   ![image](https://user-images.githubusercontent.com/5399861/101891559-2d5fdb00-3bdd-11eb-8dc3-8e5854654660.png) | ![image](https://user-images.githubusercontent.com/5399861/101891620-436d9b80-3bdd-11eb-9290-c6226e76b7c2.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716188402






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716188402






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042


   > No. Github action runs on different machines, there is a performance difference between them.
   
   No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
   
   ```
   - Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
   - Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
   - Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
   - Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
   + Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
   + Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
   + Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
   + Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X
   ```
   
   Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was [my question](https://github.com/apache/spark/pull/29642#discussion_r628899597).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042


   > No. Github action runs on different machines, there is a performance difference between them.
   
   No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
   
   ```
   - Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
   - Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
   - Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
   - Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
   + Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
   + Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
   + Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
   + Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803695381


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-803815504


   This patch is used to push down the data column when the `InSet` value exceeds `spark.sql.parquet.pushdown.inFilterThreshold`. This is benchmark and benchmark result:
   
   https://github.com/apache/spark/blob/3aa659ce29877f386a24da9d04e66069d04afaa8/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L281-L296
   
   Before:
   https://github.com/apache/spark/blob/f5118f81e395bde0cd8253dbef6a9e6455c3958a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L430-L482
   After:
   https://github.com/apache/spark/blob/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L439-L482


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739511388


   **[Test build #132295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132295/testReport)** for PR 29642 at commit [`a98b354`](https://github.com/apache/spark/commit/a98b354a1ff18815cd6aa6f268e4a7959e961f26).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738852288


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36835/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r493226163



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,9 +597,9 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>

Review comment:
       @HyukjinKwon @gengliangwang If we do not rely on the optimizer, we should add an empty check. otherwise `values.head` will throw an exception.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716190940


   **[Test build #130245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130245/testReport)** for PR 29642 at commit [`0169114`](https://github.com/apache/spark/commit/0169114d7f71d3a1fc63cf9faa114cff4b415077).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-731937316


   It can improve in most cases base on the [benchmark result](https://github.com/apache/spark/blob/b8cb1f48f1b38d74475c067a28426502c4e4a87a/sql/core/benchmarks/FilterPushdownBenchmark-results.txt#L457-L482):
   
   100 values | Relative
   -- | --
   Top 10% of data | 6.6X
   Top 50% of data | 1.9X
   Top 90% of data | 1.1X
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735508307






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841759928


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43103/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834491086


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138247/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809879605


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41264/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809818374


   **[Test build #136676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136676/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716153233






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841186474


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43078/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832412736


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138164/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737621198






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729792609






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738561954


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36785/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739521014


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36896/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841741287






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835972741


   @dongjoon-hyun This pr only improve the `In` predicate. I have added the improvement part to PR description.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730359987


   shall we implement the logic in `FileSourceStrategy`? Then it's not parquet only.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626800






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716191230






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686626143


   **[Test build #128266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128266/testReport)** for PR 29642 at commit [`648e8e5`](https://github.com/apache/spark/commit/648e8e58a552ab2072123cdefbe5d106091ce293).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729312379


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832328455


   **[Test build #138164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138164/testReport)** for PR 29642 at commit [`f269f8d`](https://github.com/apache/spark/commit/f269f8d9d883e96182ff363276b589584a109aad).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum edited a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-743109098


   Production real case, `InSet` size = 1918: 
   Before this PR | After this PR
   --- | ---
   ![image](https://user-images.githubusercontent.com/5399861/101891559-2d5fdb00-3bdd-11eb-8dc3-8e5854654660.png) | ![image](https://user-images.githubusercontent.com/5399861/101891620-436d9b80-3bdd-11eb-9290-c6226e76b7c2.png)
   
   Table statistics:
   ```
   +-------------+-----------------+-----------------+--+
   |  count(1)   | min(SELLER_ID)  | max(SELLER_ID)  |
   +-------------+-----------------+-----------------+--+
   | 8344448448  | 9               | 2234460898      |
   +-------------+-----------------+-----------------+--+
   ```
   Query statistics:
   ```
   +-----------+-----------------+-----------------+--+
   | count(1)  | min(SELLER_ID)  | max(SELLER_ID)  |
   +-----------+-----------------+-----------------+--+
   | 33978532  | 153377548       | 2180252014      |
   +-----------+-----------------+-----------------+--+
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528432010



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>
-          makeEq.lift(nameToParquetField(name).fieldType)
-            .map(_(nameToParquetField(name).fieldNames, v))
-        }.reduceLeftOption(FilterApi.or)
+      case sources.In(name, values) if pushDownInFilterThreshold > 0 &&

Review comment:
       If this is supposed to be beneficial in other sources as well, I think it makes more sense to push it to other sources as well anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739516555


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36896/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042


   > No. Github action runs on different machines, there is a performance difference between them.
   
   No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet become slower than ORC after this PR. For example, the following.
   
   ```
   - Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
   - Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
   - Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
   - Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
   + Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
   + Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
   + Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
   + Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X
   ```
   
   Although the value is too small, this generate result shows a slowdown of Parquet compared with ORC. That was my questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841760714






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729291752


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809882171


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41264/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809839180


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41258/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841316402


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138557/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738780315


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132221/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509


   @dongjoon-hyun I think [current benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297) is enough. I have updated the benchmark to PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834257115


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42769/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742753903


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738811606


   **[Test build #132235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132235/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738559827


   **[Test build #132185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132185/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739535000


   **[Test build #132295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132295/testReport)** for PR 29642 at commit [`a98b354`](https://github.com/apache/spark/commit/a98b354a1ff18815cd6aa6f268e4a7959e961f26).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738559980


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132185/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738852315


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36835/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697151626


   @wangyum Do you have any further comments? If not, shall we close this one?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739521014


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36896/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834234834


   **[Test build #138247 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138247/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809925233


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136682/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832409564


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42685/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r627978970



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
##########
@@ -188,6 +188,15 @@ abstract class ParquetFilterSuite extends QueryTest with ParquetTest with Shared
       checkFilterPredicate(!(tsAttr < ts4.ts), classOf[GtEq[_]], resultFun(ts4))
       checkFilterPredicate(tsAttr < ts2.ts || tsAttr > ts3.ts, classOf[Operators.Or],
         Seq(Row(resultFun(ts1)), Row(resultFun(ts4))))
+
+      Seq(3, 20).foreach { threshold =>
+        withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD.key -> s"$threshold") {

Review comment:
       shall we update the conf doc of `PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD`? We have a new feature now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835871614


   **[Test build #138307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138307/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738555193


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36785/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729926433


   **[Test build #131289 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131289/testReport)** for PR 29642 at commit [`b8cb1f4`](https://github.com/apache/spark/commit/b8cb1f48f1b38d74475c067a28426502c4e4a87a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835836092


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42829/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841779612


   **[Test build #138582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716166198






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742566759


   **[Test build #132574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132574/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738541400


   **[Test build #132185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132185/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-697053785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-743135912


   @cloud-fan @HyukjinKwon @gengliangwang Do you have more comments?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r511606498



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>

Review comment:
       Sort performance:
   ```scala
   import org.apache.spark.benchmark.Benchmark
   val N = 20000000
   val array = Range(1, N).map(_.%(10000000)).toArray
   val benchmark = new Benchmark(s"Benchmark distinct", valuesPerIteration = N, minNumIters = 30)
   benchmark.addCase("array.sorted") { _ =>
     array.sorted
   }
   benchmark.addCase("array.distinct") { _ =>
     array.distinct
   }
   benchmark.run()
   ```
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.6
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   Benchmark distinct:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   array.sorted                                        296            821         NaN         67.7          14.8       1.0X
   array.distinct                                     3005           3933         330          6.7         150.2       0.1X
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716159240


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34844/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832412736


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138164/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841781338


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138582/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841316402


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138557/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735483068


   **[Test build #131941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131941/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834456817


   **[Test build #138247 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138247/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738780315


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132221/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738835079


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36835/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r525929388



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>
-          makeEq.lift(nameToParquetField(name).fieldType)
-            .map(_(nameToParquetField(name).fieldNames, v))
-        }.reduceLeftOption(FilterApi.or)
+      case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
+        values.nonEmpty && canMakeFilterOn(name, values.head) =>
+        if (values.length <= pushDownInFilterThreshold) {
+          values.flatMap { v =>
+            makeEq.lift(nameToParquetField(name).fieldType)
+              .map(_(nameToParquetField(name).fieldNames, v))
+          }.reduceLeftOption(FilterApi.or)
+        } else {
+          sparkSchema.find { f =>
+            if (caseSensitive) f.name.equals(name) else f.name.equalsIgnoreCase(name)
+          }.map(_.dataType) match {
+            case Some(dataType) =>
+              val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
+              createFilterHelper(
+                sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
+                  sources.LessThanOrEqual(name, sortedValues.last)),
+                canPartialPushDownConjuncts)

Review comment:
       The logic is same to HiveShim.scala#L746-L750.
   https://github.com/apache/spark/blob/09bb9bedcd27e08b86d63a6aed90d42ca4c606be/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L746-L750
   
   @cloud-fan @dongjoon-hyun @HyukjinKwon  It can be improved by 6.6X in `InSet -> InFilters (values count: 100, distribution: 10)`: 
   ```
   Parquet Vectorized (Pushdown)                      9520           9560          27          1.7         605.3       1.0X
   Parquet Vectorized (Pushdown)                      873             885           11        18.0          55.5       6.6X
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738663385


   retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729792557


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738541400


   **[Test build #132185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132185/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729792609






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737640996






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809869327


   **[Test build #136676 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136676/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).
    * This patch **fails PySpark unit tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841982543


   @wangyum are you online? can you take a quick look and fix or revert?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-739521006


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36896/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-686782831






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735523298


   **[Test build #131953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131953/testReport)** for PR 29642 at commit [`5c3c8ea`](https://github.com/apache/spark/commit/5c3c8ea1b917f4fd252abbf72abb0c533679f871).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738571117


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/36785/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716162644






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528431912



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>
-          makeEq.lift(nameToParquetField(name).fieldType)
-            .map(_(nameToParquetField(name).fieldNames, v))
-        }.reduceLeftOption(FilterApi.or)
+      case sources.In(name, values) if pushDownInFilterThreshold > 0 &&

Review comment:
       @wangyum, the impala reference sounds good. Can we make it general and push the range filter to other data sources as well? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835826320


   **[Test build #138307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138307/testReport)** for PR 29642 at commit [`f0bfb06`](https://github.com/apache/spark/commit/f0bfb06ab9e6569c77a70649bf0ca7af28a05ac5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840985630


   @dongjoon-hyun I think this performance issue is not caused by this change. This PR only changes the `In` predicate. It is also slow without this change:
   
   ```
   OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1047-azure
   Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
   Select 0 string row (value IS NULL):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized                                10623          10994         272          1.5         675.4       1.0X
   Parquet Vectorized (Pushdown)                       627            657          24         25.1          39.9      16.9X
   Native ORC Vectorized                              7490           7653         203          2.1         476.2       1.4X
   Native ORC Vectorized (Pushdown)                    553            606          34         28.4          35.2      19.2X
   ```
   https://github.com/wangyum/spark/runs/2580852093


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509


   @dongjoon-hyun I think [current benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297) is enough. I have updated the benchmark to PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738722940


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36821/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-738971991


   **[Test build #132235 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132235/testReport)** for PR 29642 at commit [`869e37f`](https://github.com/apache/spark/commit/869e37f357d222820580a7868c45b1ef4d48a77f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r483633146



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,9 +597,9 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>

Review comment:
       +1 with @HyukjinKwon 
   @wangyum I think this PR can cause perf regression on filter pushdown in Parquet. After the changes, `In` filters with redundant elements might not be able to be pushed down.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-742753903


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132574/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729275927


   **[Test build #131236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131236/testReport)** for PR 29642 at commit [`ebb13cc`](https://github.com/apache/spark/commit/ebb13cceb5b6840d4c15ec488ef350c23a5daa6c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841178201


   @dongjoon-hyun @cloud-fan Please see the latest benchmark result: https://github.com/apache/spark/pull/29642/commits/27a2bf615eb158c7c25aa5bfaa04caa939c237da


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841757199


   **[Test build #138582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138582/testReport)** for PR 29642 at commit [`2545c1e`](https://github.com/apache/spark/commit/2545c1e28534a2be777915dd63d4e5476c9ff414).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628964580



##########
File path: sql/core/benchmarks/FilterPushdownBenchmark-jdk11-results.txt
##########
@@ -2,669 +2,669 @@
 Pushdown for many distinct value case
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1043-azure
-Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1046-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Select 0 string row (value IS NULL):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
-Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
-Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
-Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
+Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
+Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
+Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
+Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X

Review comment:
       No. Github action runs on different machines, there is a performance difference between them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-832409564


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42685/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-809862832


   **[Test build #136682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136682/testReport)** for PR 29642 at commit [`2310a69`](https://github.com/apache/spark/commit/2310a69cc30dda338e4c5c7f4d1ca2ca03371c30).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840817042


   > No. Github action runs on different machines, there is a performance difference between them.
   
   No, @wangyum . I'm meaning the **ratio** between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet looks like more slower than ORC after this PR by increasing the gap. For example, the following.
   
   ```
   - Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
   - Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
   - Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
   - Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
   + Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
   + Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
   + Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
   + Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X
   ```
   
   Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was [my question](https://github.com/apache/spark/pull/29642#discussion_r628899597).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #29642:
URL: https://github.com/apache/spark/pull/29642


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang edited a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
gengliangwang edited a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-731904386


   @wangyum @cloud-fan @HyukjinKwon 
   I got some concerns about this optimization. What if the range is huge and the filter becomes less selective? E.g.
   ```
   SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, Int.Max)
   ```
   =>
   ```
   SELECT * FROM t WHERE id >= 1 and id <= ${Int.Max}
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r528452609



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -704,8 +704,8 @@ object SQLConf {
   val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
     buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
       .doc("The maximum number of values to filter push-down optimization for IN predicate. " +
-        "Large threshold won't necessarily provide much better performance. " +
-        "The experiment argued that 300 is the limit threshold. " +
+        "Spark will push-down a value greater than or equal to its minimum value and " +

Review comment:
       I think the default value `10` is too small here.  What is the default threshold in IMPLA? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r526165602



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
##########
@@ -597,12 +599,26 @@ class ParquetFilters(
         createFilterHelper(pred, canPartialPushDownConjuncts = false)
           .map(FilterApi.not)
 
-      case sources.In(name, values) if canMakeFilterOn(name, values.head)
-        && values.distinct.length <= pushDownInFilterThreshold =>
-        values.distinct.flatMap { v =>
-          makeEq.lift(nameToParquetField(name).fieldType)
-            .map(_(nameToParquetField(name).fieldNames, v))
-        }.reduceLeftOption(FilterApi.or)
+      case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
+        values.nonEmpty && canMakeFilterOn(name, values.head) =>
+        if (values.length <= pushDownInFilterThreshold) {
+          values.flatMap { v =>
+            makeEq.lift(nameToParquetField(name).fieldType)
+              .map(_(nameToParquetField(name).fieldNames, v))
+          }.reduceLeftOption(FilterApi.or)
+        } else {
+          sparkSchema.find { f =>
+            if (caseSensitive) f.name.equals(name) else f.name.equalsIgnoreCase(name)
+          }.map(_.dataType) match {
+            case Some(dataType) =>
+              val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
+              createFilterHelper(
+                sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
+                  sources.LessThanOrEqual(name, sortedValues.last)),
+                canPartialPushDownConjuncts)

Review comment:
       ok, Added a new function to `TypeUtils`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-735515368


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-834270229


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42769/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-716188129


   **[Test build #130244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130244/testReport)** for PR 29642 at commit [`c5ab656`](https://github.com/apache/spark/commit/c5ab6569f4b175066613d02b787dc8aaa83ca8d9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-729927486






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29642: [SPARK-32792][SQL] Improve InSet filter pushdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-737621178


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36676/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org