You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/20 05:54:17 UTC

[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve in filter pushdown for ParquetFilters

wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730869008


   It seems only Parquet not well supported `In` predicate pushdown. @MaxGekk What do you think?
   
   This is the benchmark of CSV:
   ```scala
   val rowsNum = 100 * 1000
   val numIters = 3
   val colsNum = 100
   val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
   val schema = StructType(StructField("key", IntegerType) +: fields)
   def columns(): Seq[Column] = {
     val ts = Seq.tabulate(colsNum) { i =>
       lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
     }
     ($"id" % 1000).as("key") +: ts
   }
   withTempPath { path =>
     spark.range(rowsNum).select(columns(): _*)
       .write.option("header", true)
       .csv(path.getAbsolutePath)
     def readback = {
       spark.read
         .option("header", true)
         .schema(schema)
         .csv(path.getAbsolutePath)
     }
   
     def withFilter(filer: String, configEnabled: Boolean): Unit = {
       withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> configEnabled.toString()) {
         readback.filter(filer).noop()
       }
     }
   
     Seq(5, 10, 50, 100, 500).foreach { count =>
       Seq(10, 50).foreach { distribution =>
         val title = s"InSet -> InFilters (values count: $count, distribution: $distribution)"
         val benchmark = new Benchmark(title, rowsNum, output = output)
         Seq(false, true).foreach { pushDownEnabled =>
           val name = s"Native CSV Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
           benchmark.addCase(name, numIters) { _ =>
             val filter =
               Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum * distribution / 100))
             val whereExpr = s"key in(${filter.mkString(",")})"
             withFilter(whereExpr, configEnabled = pushDownEnabled)
           }
         }
         benchmark.run()
       }
     }
   }
   ```
   
   Result:
   ```
   ================================================================================================
   Benchmark to measure CSV read performance
   ================================================================================================
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 5, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                           13082          17077        1674          0.0      130815.6       1.0X
   Native CSV Vectorized (Pushdown)                                 1172           1192          35          0.1       11719.5      11.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 5, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                           11858          12028         237          0.0      118576.9       1.0X
   Native CSV Vectorized (Pushdown)                                 1165           1172           6          0.1       11652.4      10.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11883          12180         494          0.0      118834.3       1.0X
   Native CSV Vectorized (Pushdown)                                  1142           1156          21          0.1       11418.6      10.4X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11857          11878          19          0.0      118570.4       1.0X
   Native CSV Vectorized (Pushdown)                                  1169           1174           7          0.1       11692.9      10.1X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 50, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11923          11962          66          0.0      119228.0       1.0X
   Native CSV Vectorized (Pushdown)                                  1196           1225          26          0.1       11960.7      10.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 50, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11910          11917           7          0.0      119095.3       1.0X
   Native CSV Vectorized (Pushdown)                                  1191           1194           5          0.1       11908.0      10.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11948          12097         201          0.0      119484.5       1.0X
   Native CSV Vectorized (Pushdown)                                   1250           1284          32          0.1       12501.4       9.6X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11938          11978          39          0.0      119378.8       1.0X
   Native CSV Vectorized (Pushdown)                                   1176           1188          11          0.1       11756.0      10.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 500, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11954          12051         124          0.0      119542.9       1.0X
   Native CSV Vectorized (Pushdown)                                   1762           1833         104          0.1       17620.6       6.8X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 500, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11860          12166         484          0.0      118597.8       1.0X
   Native CSV Vectorized (Pushdown)                                   1417           1434          15          0.1       14171.7       8.4X
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org