You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/03 13:53:13 UTC
[GitHub] [spark] LuciferYang commented on pull request #35669: [SPARK-38041][SQL]DataFilter pushed down with PartitionFilter

LuciferYang commented on pull request #35669:
URL: https://github.com/apache/spark/pull/35669#issuecomment-1058062770


   > > Could we add the evidence of Parquet skipping files/row-groups (either a micro benchmark or some logs during execution or code pointers), when we push down partition filter here?
   > 
   > @c21 I have add some benchmark tests in FilterPushdownBenchmark, and run them in github actions. Test code can be found [here](https://github.com/stczwd/spark/blob/SPARK-38041-2/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L81).
   > 
   > Test result
   > 
   > ```
   > OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
   > Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
   > Data filter with partitions: ((a = 10 and part = 0) or (a = 10240 and part = 1) or (part = 2)):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   > Parquet Vectorized with partition                                                                        3039           3157         122          5.2         193.2       1.0X
   > Parquet Vectorized with partition (Pushdown)                                                             1548           1568          15         10.2          98.4       2.0X
   > 
   > OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
   > Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
   > Data filter with partitions: ((a > 10 and part = 0) or (a <= 10 and part >=1 and part < 3)):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   > Parquet Vectorized with partition                                                                     2942           2997          40          5.3         187.1       1.0X
   > Parquet Vectorized with partition (Pushdown)                                                          1497           1513          15         10.5          95.2       2.0X
   > ```
   
   @stczwd Can you add the benchmark code to this pr and use GA to produce the benchmark results?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org