You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2018/09/01 22:35:13 UTC

[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/22313

    [SPARK-25306][SQL] Use cache to speed up `createFilter`

    ## What changes were proposed in this pull request?
    
    In ORC data source, `createFilter` function has exponential time complexity due to lack of memoization like the following. This issue aims to improve it.
    
    **REPRODUCE**
    ```
    // Create and read 1 row table with 1000 columns
    sql("set spark.sql.orc.filterPushdown=true")
    val selectExpr = (1 to 1000).map(i => s"id c$i")
    spark.range(1).selectExpr(selectExpr: _*).write.mode("overwrite").orc("/tmp/orc")
    print(s"With 0 filters, ")
    spark.time(spark.read.orc("/tmp/orc").count)
    
    // Increase the number of filters
    (20 to 30).foreach { width =>
      val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
      print(s"With $width filters, ")
      spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
    }
    ```
    
    **RESULT**
    ```
    With 0 filters, Time taken: 653 ms                                              
    With 20 filters, Time taken: 962 ms
    With 21 filters, Time taken: 1282 ms
    With 22 filters, Time taken: 1982 ms
    With 23 filters, Time taken: 3855 ms
    With 24 filters, Time taken: 6719 ms
    With 25 filters, Time taken: 12669 ms
    With 26 filters, Time taken: 25032 ms
    With 27 filters, Time taken: 49585 ms
    With 28 filters, Time taken: 98980 ms     // over 1 min 38 seconds
    With 29 filters, Time taken: 198368 ms   // over 3 mins
    With 30 filters, Time taken: 393744 ms   // over 6 mins
    ```
    
    **AFTER THIS PR**
    ```
    With 0 filters, Time taken: 644 ms                                              
    With 20 filters, Time taken: 638 ms
    With 21 filters, Time taken: 360 ms
    With 22 filters, Time taken: 590 ms
    With 23 filters, Time taken: 318 ms
    With 24 filters, Time taken: 315 ms
    With 25 filters, Time taken: 381 ms
    With 26 filters, Time taken: 304 ms
    With 27 filters, Time taken: 294 ms
    With 28 filters, Time taken: 319 ms
    With 29 filters, Time taken: 288 ms
    With 30 filters, Time taken: 285 ms
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with newly added test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-25306

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22313
    
----
commit ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec
Author: Dongjoon Hyun <do...@...>
Date:   2018-09-01T22:22:10Z

    [SPARK-25306][SQL] Use cache to speed up `createFilter`

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95652 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95652/testReport)** for PR 22313 at commit [`4a372a3`](https://github.com/apache/spark/commit/4a372a328b33961a16ae6ad69bb58ba0720e9b63).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2843/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214778262
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25306 createFilter should not hang") {
    +    import org.apache.spark.sql.sources._
    +    val schema = new StructType(Array(StructField("a", IntegerType, nullable = true)))
    +    val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter]
    +    failAfter(2 seconds) {
    +      OrcFilters.createFilter(schema, filters)
    --- End diff --
    
    I'll choose (2), @cloud-fan .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2836/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95685 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95685/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214810021
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -54,27 +55,27 @@ import org.apache.spark.sql.types._
      * builder methods mentioned above can only be found in test code, where all tested filters are
      * known to be convertible.
      */
    -private[orc] object OrcFilters {
    +private[sql] object OrcFilters {
    +  private[sql] def buildTree(filters: Seq[Filter]): Option[Filter] = {
    +    filters match {
    +      case Seq() => None
    +      case Seq(filter) => Some(filter)
    +      case Seq(filter1, filter2) => Some(And(filter1, filter2))
    +      case _ => // length > 2
    +        val (left, right) = filters.splitAt(filters.length / 2)
    +        Some(And(buildTree(left).get, buildTree(right).get))
    +    }
    +  }
     
       /**
        * Create ORC filter as a SearchArgument instance.
        */
       def createFilter(schema: StructType, filters: Seq[Filter]): Option[SearchArgument] = {
         val dataTypeMap = schema.map(f => f.name -> f.dataType).toMap
     
    -    // First, tries to convert each filter individually to see whether it's convertible, and then
    -    // collect all convertible ones to build the final `SearchArgument`.
    -    val convertibleFilters = for {
    -      filter <- filters
    -      _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())
    -    } yield filter
    -
    -    for {
    -      // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    -      // Then tries to build a single ORC `SearchArgument` for the conjunction predicate
    -      builder <- buildSearchArgument(dataTypeMap, conjunction, SearchArgumentFactory.newBuilder())
    -    } yield builder.build()
    +    buildTree(filters.filter(buildSearchArgument(dataTypeMap, _, newBuilder).isDefined))
    +      .flatMap(buildSearchArgument(dataTypeMap, _, newBuilder))
    +      .map(_.build)
    --- End diff --
    
    ah i see what you mean now. Can we restore to the previous version? That seems better. Sorry for the back and forth!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Also, thank you for review, @xuanyuanking, @kiszk , @viirya , @HyukjinKwon .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95669/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95583/testReport)** for PR 22313 at commit [`ac06b0c`](https://github.com/apache/spark/commit/ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95658/testReport)** for PR 22313 at commit [`4a372a3`](https://github.com/apache/spark/commit/4a372a328b33961a16ae6ad69bb58ba0720e9b63).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2847/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2818/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95652/testReport)** for PR 22313 at commit [`4a372a3`](https://github.com/apache/spark/commit/4a372a328b33961a16ae6ad69bb58ba0720e9b63).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214556040
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
    @@ -55,19 +59,52 @@ import org.apache.spark.sql.types._
      * known to be convertible.
      */
     private[orc] object OrcFilters extends Logging {
    +  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])
    +
    +  private lazy val cacheExpireTimeout =
    +    org.apache.spark.sql.execution.datasources.orc.OrcFilters.cacheExpireTimeout
    +
    +  private lazy val searchArgumentCache = CacheBuilder.newBuilder()
    +    .expireAfterAccess(cacheExpireTimeout, TimeUnit.SECONDS)
    +    .build(
    +      new CacheLoader[FilterWithTypeMap, Option[Builder]]() {
    +        override def load(typeMapAndFilter: FilterWithTypeMap): Option[Builder] = {
    +          buildSearchArgument(
    +            typeMapAndFilter.typeMap, typeMapAndFilter.filter, SearchArgumentFactory.newBuilder())
    +        }
    +      })
    +
    +  private def getOrBuildSearchArgumentWithNewBuilder(
    +      dataTypeMap: Map[String, DataType],
    +      expression: Filter): Option[Builder] = {
    +    // When `spark.sql.orc.cache.sarg.timeout` is 0, cache is disabled.
    +    if (cacheExpireTimeout > 0) {
    +      searchArgumentCache.get(FilterWithTypeMap(expression, dataTypeMap))
    +    } else {
    +      buildSearchArgument(dataTypeMap, expression, SearchArgumentFactory.newBuilder())
    --- End diff --
    
    When we set timeout to zero on the cache, the loaded element can be removed immediately. Maybe we don't need to check timeout like this and we can simplify the code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214765115
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -71,12 +71,24 @@ private[orc] object OrcFilters {
     
         for {
           // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    +      conjunction <- buildTree(convertibleFilters)
    --- End diff --
    
    In parquet, this is done as
    ```
    filters
      .flatMap(ParquetFilters.createFilter(requiredSchema, _))
      .reduceOption(FilterApi.and)
    ```
    
    can we follow it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95651/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214528435
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
    @@ -55,19 +59,52 @@ import org.apache.spark.sql.types._
      * known to be convertible.
      */
     private[orc] object OrcFilters extends Logging {
    +  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])
    +
    +  private lazy val cacheExpireTimeout =
    +    org.apache.spark.sql.execution.datasources.orc.OrcFilters.cacheExpireTimeout
    +
    +  private lazy val searchArgumentCache = CacheBuilder.newBuilder()
    +    .expireAfterAccess(cacheExpireTimeout, TimeUnit.SECONDS)
    +    .build(
    +      new CacheLoader[FilterWithTypeMap, Option[Builder]]() {
    +        override def load(typeMapAndFilter: FilterWithTypeMap): Option[Builder] = {
    +          buildSearchArgument(
    +            typeMapAndFilter.typeMap, typeMapAndFilter.filter, SearchArgumentFactory.newBuilder())
    +        }
    +      })
    +
    +  private def getOrBuildSearchArgumentWithNewBuilder(
    --- End diff --
    
    Just a little question about is any possible to reuse code with https://github.com/apache/spark/pull/22313/files#diff-224b8cbedf286ecbfdd092d1e2e2f237R61?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95680/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95637/testReport)** for PR 22313 at commit [`4acbaf8`](https://github.com/apache/spark/commit/4acbaf8be9e572c5cdbc61c49b488e8aef9e646b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95665 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95665/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95680 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95680/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214743988
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
    @@ -55,19 +59,52 @@ import org.apache.spark.sql.types._
      * known to be convertible.
      */
     private[orc] object OrcFilters extends Logging {
    +  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])
    +
    +  private lazy val cacheExpireTimeout =
    +    org.apache.spark.sql.execution.datasources.orc.OrcFilters.cacheExpireTimeout
    +
    +  private lazy val searchArgumentCache = CacheBuilder.newBuilder()
    +    .expireAfterAccess(cacheExpireTimeout, TimeUnit.SECONDS)
    +    .build(
    +      new CacheLoader[FilterWithTypeMap, Option[Builder]]() {
    +        override def load(typeMapAndFilter: FilterWithTypeMap): Option[Builder] = {
    +          buildSearchArgument(
    +            typeMapAndFilter.typeMap, typeMapAndFilter.filter, SearchArgumentFactory.newBuilder())
    +        }
    +      })
    +
    +  private def getOrBuildSearchArgumentWithNewBuilder(
    +      dataTypeMap: Map[String, DataType],
    +      expression: Filter): Option[Builder] = {
    +    // When `spark.sql.orc.cache.sarg.timeout` is 0, cache is disabled.
    +    if (cacheExpireTimeout > 0) {
    +      searchArgumentCache.get(FilterWithTypeMap(expression, dataTypeMap))
    +    } else {
    +      buildSearchArgument(dataTypeMap, expression, SearchArgumentFactory.newBuilder())
    --- End diff --
    
    Ya. It's possible. But, if we create a Guava loading cache and pass through all the cache management logic in Guava, it means a more overhead than this PR. In this PR, `spark.sql.orc.cache.sarg.timeout=0` means not creating the loading cache at all.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95665/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95669/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Do you know why `createFilter` function has exponential time complexity? Let's make sure the algorithm is good before adding the cache.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214744306
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
    @@ -55,19 +59,52 @@ import org.apache.spark.sql.types._
      * known to be convertible.
      */
     private[orc] object OrcFilters extends Logging {
    +  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])
    +
    +  private lazy val cacheExpireTimeout =
    +    org.apache.spark.sql.execution.datasources.orc.OrcFilters.cacheExpireTimeout
    +
    +  private lazy val searchArgumentCache = CacheBuilder.newBuilder()
    +    .expireAfterAccess(cacheExpireTimeout, TimeUnit.SECONDS)
    +    .build(
    +      new CacheLoader[FilterWithTypeMap, Option[Builder]]() {
    +        override def load(typeMapAndFilter: FilterWithTypeMap): Option[Builder] = {
    +          buildSearchArgument(
    +            typeMapAndFilter.typeMap, typeMapAndFilter.filter, SearchArgumentFactory.newBuilder())
    +        }
    +      })
    +
    +  private def getOrBuildSearchArgumentWithNewBuilder(
    +      dataTypeMap: Map[String, DataType],
    +      expression: Filter): Option[Builder] = {
    +    // When `spark.sql.orc.cache.sarg.timeout` is 0, cache is disabled.
    +    if (cacheExpireTimeout > 0) {
    +      searchArgumentCache.get(FilterWithTypeMap(expression, dataTypeMap))
    +    } else {
    +      buildSearchArgument(dataTypeMap, expression, SearchArgumentFactory.newBuilder())
    +    }
    +  }
    +
       def createFilter(schema: StructType, filters: Array[Filter]): Option[SearchArgument] = {
         val dataTypeMap = schema.map(f => f.name -> f.dataType).toMap
     
         // First, tries to convert each filter individually to see whether it's convertible, and then
         // collect all convertible ones to build the final `SearchArgument`.
         val convertibleFilters = for {
           filter <- filters
    -      _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())
    +      _ <- getOrBuildSearchArgumentWithNewBuilder(dataTypeMap, filter)
         } yield filter
     
         for {
           // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(And)
    +      conjunction <- convertibleFilters.reduceOption { (x, y) =>
    +        val newFilter = org.apache.spark.sql.sources.And(x, y)
    +        if (cacheExpireTimeout > 0) {
    +          // Build in a bottom-up manner
    +          getOrBuildSearchArgumentWithNewBuilder(dataTypeMap, newFilter)
    +        }
    --- End diff --
    
    Final conjunction? All sub function results will be cached in the end.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95651 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95651/testReport)** for PR 22313 at commit [`5c46693`](https://github.com/apache/spark/commit/5c46693e58e0f71fe8e67dce16f4b8c783c80aa6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2817/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2804/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2762/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    At this time, R failure.
    ```
    DONE ===========================================================================
    Had test warnings or failures; see logs.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95680 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95680/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95685/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214765026
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -71,12 +71,24 @@ private[orc] object OrcFilters {
     
         for {
           // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    +      conjunction <- buildTree(convertibleFilters)
    --- End diff --
    
    does parquet has the same problem?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214775155
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -71,12 +71,24 @@ private[orc] object OrcFilters {
     
         for {
           // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    +      conjunction <- buildTree(convertibleFilters)
    --- End diff --
    
    For the first question, I don't think Parquet has the same issue because Parquet uses `canMakeFilterOn` while ORC is trying to build a full result (with a fresh builder) to check if it's okay or not.
    
    For the second question, in ORC, we already did the first half(`flatMap`) to compute `convertibleFilters`, but it can change it with `filters.filter`.
    ```scala
    val convertibleFilters = for {
        filter <- filters
        _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())
    } yield filter
    ```
    
    2. And, the second half `reduceOption(FilterApi.and)` was the original ORC code which generated a skewed tree having exponential time complexity. We need to use `buildTree`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2823/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95583/testReport)** for PR 22313 at commit [`ac06b0c`](https://github.com/apache/spark/commit/ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])`
      * `  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95651 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95651/testReport)** for PR 22313 at commit [`5c46693`](https://github.com/apache/spark/commit/5c46693e58e0f71fe8e67dce16f4b8c783c80aa6).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    General question: Why do we use `time` instead of `entry size` to control cache? I am neutral on this decision. I would like to hear the reason of this decision.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22313


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214556166
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
    @@ -55,19 +59,52 @@ import org.apache.spark.sql.types._
      * known to be convertible.
      */
     private[orc] object OrcFilters extends Logging {
    +  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])
    +
    +  private lazy val cacheExpireTimeout =
    +    org.apache.spark.sql.execution.datasources.orc.OrcFilters.cacheExpireTimeout
    +
    +  private lazy val searchArgumentCache = CacheBuilder.newBuilder()
    +    .expireAfterAccess(cacheExpireTimeout, TimeUnit.SECONDS)
    +    .build(
    +      new CacheLoader[FilterWithTypeMap, Option[Builder]]() {
    +        override def load(typeMapAndFilter: FilterWithTypeMap): Option[Builder] = {
    +          buildSearchArgument(
    +            typeMapAndFilter.typeMap, typeMapAndFilter.filter, SearchArgumentFactory.newBuilder())
    +        }
    +      })
    +
    +  private def getOrBuildSearchArgumentWithNewBuilder(
    +      dataTypeMap: Map[String, DataType],
    +      expression: Filter): Option[Builder] = {
    +    // When `spark.sql.orc.cache.sarg.timeout` is 0, cache is disabled.
    +    if (cacheExpireTimeout > 0) {
    +      searchArgumentCache.get(FilterWithTypeMap(expression, dataTypeMap))
    +    } else {
    +      buildSearchArgument(dataTypeMap, expression, SearchArgumentFactory.newBuilder())
    +    }
    +  }
    +
       def createFilter(schema: StructType, filters: Array[Filter]): Option[SearchArgument] = {
         val dataTypeMap = schema.map(f => f.name -> f.dataType).toMap
     
         // First, tries to convert each filter individually to see whether it's convertible, and then
         // collect all convertible ones to build the final `SearchArgument`.
         val convertibleFilters = for {
           filter <- filters
    -      _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())
    +      _ <- getOrBuildSearchArgumentWithNewBuilder(dataTypeMap, filter)
         } yield filter
     
         for {
           // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(And)
    +      conjunction <- convertibleFilters.reduceOption { (x, y) =>
    +        val newFilter = org.apache.spark.sql.sources.And(x, y)
    +        if (cacheExpireTimeout > 0) {
    +          // Build in a bottom-up manner
    +          getOrBuildSearchArgumentWithNewBuilder(dataTypeMap, newFilter)
    +        }
    --- End diff --
    
    Why we need to cache all sub filters? Don't we just need to cache the final conjunction?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Thank you, @cloud-fan . Sure. I'll update them.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214764993
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25306 createFilter should not hang") {
    +    import org.apache.spark.sql.sources._
    +    val schema = new StructType(Array(StructField("a", IntegerType, nullable = true)))
    +    val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter]
    +    failAfter(2 seconds) {
    +      OrcFilters.createFilter(schema, filters)
    --- End diff --
    
    This test looks tricky... It's a bad practice to assume some code will return in a certain time. Can we just add a microbenchmark for it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95658/testReport)** for PR 22313 at commit [`4a372a3`](https://github.com/apache/spark/commit/4a372a328b33961a16ae6ad69bb58ba0720e9b63).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214787227
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
    @@ -398,6 +398,24 @@ class FilterPushdownBenchmark extends SparkFunSuite with BenchmarkBeforeAndAfter
           }
         }
       }
    +
    +  test(s"Pushdown benchmark with many filters") {
    +    val numRows = 1
    +    val width = 500
    +
    +    withTempPath { dir =>
    +      val columns = (1 to width).map(i => s"id c$i")
    +      val df = spark.range(1).selectExpr(columns: _*)
    +      withTempTable("orcTable", "patquetTable") {
    --- End diff --
    
    nit: a typo, `patquetTable`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214798722
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala ---
    @@ -398,6 +398,24 @@ class FilterPushdownBenchmark extends SparkFunSuite with BenchmarkBeforeAndAfter
           }
         }
       }
    +
    +  test(s"Pushdown benchmark with many filters") {
    +    val numRows = 1
    +    val width = 500
    +
    +    withTempPath { dir =>
    +      val columns = (1 to width).map(i => s"id c$i")
    +      val df = spark.range(1).selectExpr(columns: _*)
    +      withTempTable("orcTable", "patquetTable") {
    --- End diff --
    
    Oh, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95652/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95658/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95685/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2825/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95665/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214769029
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25306 createFilter should not hang") {
    +    import org.apache.spark.sql.sources._
    +    val schema = new StructType(Array(StructField("a", IntegerType, nullable = true)))
    +    val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter]
    +    failAfter(2 seconds) {
    +      OrcFilters.createFilter(schema, filters)
    --- End diff --
    
    Sure. Something like the test code in the PR description? And marked as `ignore(...)` instead of `test(...)`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Thank you for review and advice, @cloud-fan . It turns out that my initial assessment is not enough.
    
    First of all, from the beginning, [SPARK-2883](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-6cac9bc2656e3782b0312dceb8c55d47R75) is designed as a recursive function like the following. Please see `tryLeft` and `tryRight`. It's a purely computation to check if it succeeds. There is no reuse here. So, I tried to cache the first two `tryLeft` and `tryRight` operation since they can be re-used.
    ```scala
    val tryLeft = buildSearchArgument(left, newBuilder)
    val tryRight = buildSearchArgument(right, newBuilder)
    val conjunction = for {
      _ <- tryLeft
      _ <- tryRight
      lhs <- buildSearchArgument(left, builder.startAnd())
      rhs <- buildSearchArgument(right, lhs)
    } yield rhs.end()
    ```
    
    However, before that, `createFilter` generates the target tree with [reduceOption(And)](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-6cac9bc2656e3782b0312dceb8c55d47R35) as a deeply skewed tree. That was the root cause. I'll update this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95669/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95583/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2832/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    @dongjoon-hyun please also update the title of the JIRA ticket, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    The previous failures are irrelevant to this PR.
    - org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision
    - org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a sbt.testing.SuiteSelector)
    - org.apache.spark.sql.hive.client.HiveClientSuites.(It is not a test it is a sbt.testing.SuiteSelector)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214528778
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
    @@ -55,19 +59,52 @@ import org.apache.spark.sql.types._
      * known to be convertible.
      */
     private[orc] object OrcFilters extends Logging {
    +  case class FilterWithTypeMap(filter: Filter, typeMap: Map[String, DataType])
    +
    +  private lazy val cacheExpireTimeout =
    +    org.apache.spark.sql.execution.datasources.orc.OrcFilters.cacheExpireTimeout
    +
    +  private lazy val searchArgumentCache = CacheBuilder.newBuilder()
    +    .expireAfterAccess(cacheExpireTimeout, TimeUnit.SECONDS)
    +    .build(
    +      new CacheLoader[FilterWithTypeMap, Option[Builder]]() {
    +        override def load(typeMapAndFilter: FilterWithTypeMap): Option[Builder] = {
    +          buildSearchArgument(
    +            typeMapAndFilter.typeMap, typeMapAndFilter.filter, SearchArgumentFactory.newBuilder())
    +        }
    +      })
    +
    +  private def getOrBuildSearchArgumentWithNewBuilder(
    --- End diff --
    
    @xuanyuanking . This already reuses `cacheExpireTimeout`.
    
    For the cache value, `SearchArgument`, `SearchArgumentFactory` and `Builder` are different. 
    - Here, they comes from `org.apache.hadoop.hive.ql.io.sarg.*`.
    - There, they comes from `org.apache.orc.storage.ql.io.sarg.*`.
    
    The only exception I made is `FilterWithTypeMap`. I wanted to keep them separately since it's also related to cache key.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214832531
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -54,27 +55,27 @@ import org.apache.spark.sql.types._
      * builder methods mentioned above can only be found in test code, where all tested filters are
      * known to be convertible.
      */
    -private[orc] object OrcFilters {
    +private[sql] object OrcFilters {
    +  private[sql] def buildTree(filters: Seq[Filter]): Option[Filter] = {
    +    filters match {
    +      case Seq() => None
    +      case Seq(filter) => Some(filter)
    +      case Seq(filter1, filter2) => Some(And(filter1, filter2))
    +      case _ => // length > 2
    +        val (left, right) = filters.splitAt(filters.length / 2)
    +        Some(And(buildTree(left).get, buildTree(right).get))
    +    }
    +  }
     
       /**
        * Create ORC filter as a SearchArgument instance.
        */
       def createFilter(schema: StructType, filters: Seq[Filter]): Option[SearchArgument] = {
         val dataTypeMap = schema.map(f => f.name -> f.dataType).toMap
     
    -    // First, tries to convert each filter individually to see whether it's convertible, and then
    -    // collect all convertible ones to build the final `SearchArgument`.
    -    val convertibleFilters = for {
    -      filter <- filters
    -      _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())
    -    } yield filter
    -
    -    for {
    -      // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    -      // Then tries to build a single ORC `SearchArgument` for the conjunction predicate
    -      builder <- buildSearchArgument(dataTypeMap, conjunction, SearchArgumentFactory.newBuilder())
    -    } yield builder.build()
    +    buildTree(filters.filter(buildSearchArgument(dataTypeMap, _, newBuilder).isDefined))
    +      .flatMap(buildSearchArgument(dataTypeMap, _, newBuilder))
    +      .map(_.build)
    --- End diff --
    
    Sure. No problem, @cloud-fan . :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95637/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214778954
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -71,12 +71,24 @@ private[orc] object OrcFilters {
     
         for {
           // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    +      conjunction <- buildTree(convertibleFilters)
    --- End diff --
    
    BTW, Parquet has another issue here due to `.reduceOption(FilterApi.and)`. When I make a benchmark, Parquet seems to be unable to handle 1000 filters, @cloud-fan .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Thank you for review, @kiszk .
    
    First, I don't want to hold the memory up after query completion. If we do, it will be a regression. So, I wanted `time` first.
    
    Second, It's difficult to estimate the enough limit for the number of filters.
      - As we know codegen JVM limit issue. There are several attempts to generate a single complex query for wide tables (thousands of columns).
      - Spark's optimizer like `InferFiltersFromConstraints` adds more constraints like 'NotNull(col1)`. Usually, the number of filters becomes double here.
        - Also, it's not a good design if we need to increase this limitation whenever we add a new optimizer like `InferFiltersFromConstraints`.
      - If the limit is too high, we waste the memory. If the limit is small, the eviction will bite us again.
    
    In short, `time` was enough and the simplest for this purpose.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Could you review this PR, @gatorsmile and @cloud-fan ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95637/testReport)** for PR 22313 at commit [`4acbaf8`](https://github.com/apache/spark/commit/4acbaf8be9e572c5cdbc61c49b488e8aef9e646b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org