You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2018/09/01 22:35:13 UTC

[GitHub] spark pull request #22313: [SPARK-25306][SQL] Use cache to speed up `createF...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/22313

    [SPARK-25306][SQL] Use cache to speed up `createFilter`

    ## What changes were proposed in this pull request?
    
    In ORC data source, `createFilter` function has exponential time complexity due to lack of memoization like the following. This issue aims to improve it.
    
    **REPRODUCE**
    ```
    // Create and read 1 row table with 1000 columns
    sql("set spark.sql.orc.filterPushdown=true")
    val selectExpr = (1 to 1000).map(i => s"id c$i")
    spark.range(1).selectExpr(selectExpr: _*).write.mode("overwrite").orc("/tmp/orc")
    print(s"With 0 filters, ")
    spark.time(spark.read.orc("/tmp/orc").count)
    
    // Increase the number of filters
    (20 to 30).foreach { width =>
      val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
      print(s"With $width filters, ")
      spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
    }
    ```
    
    **RESULT**
    ```
    With 0 filters, Time taken: 653 ms                                              
    With 20 filters, Time taken: 962 ms
    With 21 filters, Time taken: 1282 ms
    With 22 filters, Time taken: 1982 ms
    With 23 filters, Time taken: 3855 ms
    With 24 filters, Time taken: 6719 ms
    With 25 filters, Time taken: 12669 ms
    With 26 filters, Time taken: 25032 ms
    With 27 filters, Time taken: 49585 ms
    With 28 filters, Time taken: 98980 ms     // over 1 min 38 seconds
    With 29 filters, Time taken: 198368 ms   // over 3 mins
    With 30 filters, Time taken: 393744 ms   // over 6 mins
    ```
    
    **AFTER THIS PR**
    ```
    With 0 filters, Time taken: 644 ms                                              
    With 20 filters, Time taken: 638 ms
    With 21 filters, Time taken: 360 ms
    With 22 filters, Time taken: 590 ms
    With 23 filters, Time taken: 318 ms
    With 24 filters, Time taken: 315 ms
    With 25 filters, Time taken: 381 ms
    With 26 filters, Time taken: 304 ms
    With 27 filters, Time taken: 294 ms
    With 28 filters, Time taken: 319 ms
    With 29 filters, Time taken: 288 ms
    With 30 filters, Time taken: 285 ms
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with newly added test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-25306

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22313
    
----
commit ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec
Author: Dongjoon Hyun <do...@...>
Date:   2018-09-01T22:22:10Z

    [SPARK-25306][SQL] Use cache to speed up `createFilter`

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95652 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95652/testReport)** for PR 22313 at commit [`4a372a3`](https://github.com/apache/spark/commit/4a372a328b33961a16ae6ad69bb58ba0720e9b63).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2843/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214778262
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25306 createFilter should not hang") {
    +    import org.apache.spark.sql.sources._
    +    val schema = new StructType(Array(StructField("a", IntegerType, nullable = true)))
    +    val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter]
    +    failAfter(2 seconds) {
    +      OrcFilters.createFilter(schema, filters)
    --- End diff --
    
    I'll choose (2), @cloud-fan .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2836/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95685 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95685/testReport)** for PR 22313 at commit [`3cd4443`](https://github.com/apache/spark/commit/3cd444306c3b8b6c42a74b7cfb0755b8ce209c84).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22313#discussion_r214810021
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -54,27 +55,27 @@ import org.apache.spark.sql.types._
      * builder methods mentioned above can only be found in test code, where all tested filters are
      * known to be convertible.
      */
    -private[orc] object OrcFilters {
    +private[sql] object OrcFilters {
    +  private[sql] def buildTree(filters: Seq[Filter]): Option[Filter] = {
    +    filters match {
    +      case Seq() => None
    +      case Seq(filter) => Some(filter)
    +      case Seq(filter1, filter2) => Some(And(filter1, filter2))
    +      case _ => // length > 2
    +        val (left, right) = filters.splitAt(filters.length / 2)
    +        Some(And(buildTree(left).get, buildTree(right).get))
    +    }
    +  }
     
       /**
        * Create ORC filter as a SearchArgument instance.
        */
       def createFilter(schema: StructType, filters: Seq[Filter]): Option[SearchArgument] = {
         val dataTypeMap = schema.map(f => f.name -> f.dataType).toMap
     
    -    // First, tries to convert each filter individually to see whether it's convertible, and then
    -    // collect all convertible ones to build the final `SearchArgument`.
    -    val convertibleFilters = for {
    -      filter <- filters
    -      _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())
    -    } yield filter
    -
    -    for {
    -      // Combines all convertible filters using `And` to produce a single conjunction
    -      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
    -      // Then tries to build a single ORC `SearchArgument` for the conjunction predicate
    -      builder <- buildSearchArgument(dataTypeMap, conjunction, SearchArgumentFactory.newBuilder())
    -    } yield builder.build()
    +    buildTree(filters.filter(buildSearchArgument(dataTypeMap, _, newBuilder).isDefined))
    +      .flatMap(buildSearchArgument(dataTypeMap, _, newBuilder))
    +      .map(_.build)
    --- End diff --
    
    ah i see what you mean now. Can we restore to the previous version? That seems better. Sorry for the back and forth!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Also, thank you for review, @xuanyuanking, @kiszk , @viirya , @HyukjinKwon .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter` ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95669/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Use cache to speed up `createFilter`

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95583/testReport)** for PR 22313 at commit [`ac06b0c`](https://github.com/apache/spark/commit/ac06b0ca28d1da81fadbe0742a199b5e7b0de1ec).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95658/testReport)** for PR 22313 at commit [`4a372a3`](https://github.com/apache/spark/commit/4a372a328b33961a16ae6ad69bb58ba0720e9b63).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2847/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2818/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22313
  
    **[Test build #95652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95652/testReport)** for PR 22313 at commit [`4a372a3`](https://github.com/apache/spark/commit/4a372a328b33961a16ae6ad69bb58ba0720e9b63).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org