You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/09/28 08:12:56 UTC

[GitHub] spark pull request #22579: [SPARK-25429][SQL] Use Set instead of Array to im...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/22579

    [SPARK-25429][SQL] Use Set instead of Array to improve lookup performance

    ## What changes were proposed in this pull request?
    
    Use `Set` instead of `Array` to improve `accumulatorIds.contains(acc.id)` performance.
    
    This PR close https://github.com/apache/spark/pull/22420
    
    ## How was this patch tested?
    
    manual tests.
    Benchmark code:
    ```scala
    def benchmark(func: () => Unit): Long = {
      val start = System.currentTimeMillis()
      func()
      val end = System.currentTimeMillis()
      end - start
    }
    
    val range = Range(1, 1000000)
    val set = range.toSet
    val array = range.toArray
    
    for (i <- 0 until 5) {
      val setExecutionTime =
        benchmark(() => for (i <- 0 until 500) { set.contains(scala.util.Random.nextInt()) })
      val arrayExecutionTime =
        benchmark(() => for (i <- 0 until 500) { array.contains(scala.util.Random.nextInt()) })
      println(s"set execution time: $setExecutionTime, array execution time: $arrayExecutionTime")
    }
    ```
    
    Benchmark result:
    ```
    set execution time: 4, array execution time: 2760
    set execution time: 1, array execution time: 1911
    set execution time: 3, array execution time: 2043
    set execution time: 12, array execution time: 2214
    set execution time: 6, array execution time: 1770
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-25429

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22579
    
----
commit 5a92eeb636257ff5079f763405cfc6446dcffe09
Author: Yuming Wang <yu...@...>
Date:   2018-09-28T08:04:47Z

    Use Set instead of Array to provide lookup performance

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22579: [SPARK-25429][SQL] Use Set instead of Array to im...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22579#discussion_r221296477
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala ---
    @@ -83,7 +83,7 @@ class SQLAppStatusListener(
         // track of the metrics knows which accumulators to look at.
         val accumIds = exec.metrics.map(_.accumulatorId).sorted.toList
    --- End diff --
    
    Thanks @HeartSaVioR @kiszk


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22579: [SPARK-25429][SQL] Use Set instead of Array to im...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22579#discussion_r221202958
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala ---
    @@ -83,7 +83,7 @@ class SQLAppStatusListener(
         // track of the metrics knows which accumulators to look at.
         val accumIds = exec.metrics.map(_.accumulatorId).sorted.toList
    --- End diff --
    
    Good point. If we use this as `set`, `sort` is not necessary.
    When I have just looked at the code, `accumulatorIds` does not take care of the order of elements. I would appreciate it if someone double-checks it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    **[Test build #96740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96740/testReport)** for PR 22579 at commit [`5a92eeb`](https://github.com/apache/spark/commit/5a92eeb636257ff5079f763405cfc6446dcffe09).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    **[Test build #96740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96740/testReport)** for PR 22579 at commit [`5a92eeb`](https://github.com/apache/spark/commit/5a92eeb636257ff5079f763405cfc6446dcffe09).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    **[Test build #96764 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96764/testReport)** for PR 22579 at commit [`9b3b164`](https://github.com/apache/spark/commit/9b3b164af3e8e43caec7e1cedfcaf3a8b1c6d8f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22579: [SPARK-25429][SQL] Use Set instead of Array to im...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22579


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3573/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    **[Test build #96764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96764/testReport)** for PR 22579 at commit [`9b3b164`](https://github.com/apache/spark/commit/9b3b164af3e8e43caec7e1cedfcaf3a8b1c6d8f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22579: [SPARK-25429][SQL] Use Set instead of Array to im...

Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22579#discussion_r221177626
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala ---
    @@ -83,7 +83,7 @@ class SQLAppStatusListener(
         // track of the metrics knows which accumulators to look at.
         val accumIds = exec.metrics.map(_.accumulatorId).sorted.toList
    --- End diff --
    
    If we are going to make Set, I guess we don't need to sort it, and may also don't need to convert to List and Set again.
    
    Does changing List to Set affect any content on UI page? Just wanted to double check why we have been sorting the accumulator ids, and whether this patch breaks the intention or not.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96740/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96764/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22579: [SPARK-25429][SQL] Use Set instead of Array to improve l...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3561/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org