You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by Koraseg <gi...@git.apache.org> on 2018/10/30 12:41:33 UTC

[GitHub] spark pull request #22894: [SPARK-25885] HighlyCompressedMapStatus deseriali...

GitHub user Koraseg opened a pull request:

    https://github.com/apache/spark/pull/22894

    [SPARK-25885] HighlyCompressedMapStatus deserialization/construction optimization

    
    
    ## What changes were proposed in this pull request?
    https://issues.apache.org/jira/browse/SPARK-25885
    
    ## How was this patch tested?
    
    Additional tests are not necessary for the patch.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Koraseg/spark mapStatusesOptimization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22894.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22894
    
----
commit f6ab4750037876960a172365b6e4b7a8dd3d9677
Author: koraseg <ak...@...>
Date:   2018-10-30T12:12:55Z

    [SPARK-25885] HighlyCompressedMapStatus deserialization/construction optimization

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r231121793
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +188,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = mutable.Map.empty[Int, Byte]
         (0 until count).foreach { _ =>
           val block = in.readInt()
           val size = in.readByte()
    -      hugeBlockSizesArray += Tuple2(block, size)
    +      hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, size)
    --- End diff --
    
    Yes, it is a mutable map and used as a mutable map. Its type must reflect that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    I have also updated benchmarks here: https://github.com/Koraseg/mapstatus-benchmark.
    
    By the way, the best performance has shown gnu.trove.map.hash.TIntByteHashMap. But the boost in comparison with  java.util.HashMap seems not too big for a new library dependency.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    **[Test build #4414 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4414/testReport)** for PR 22894 at commit [`57bdd75`](https://github.com/apache/spark/commit/57bdd7525f3353a6d59772b2a86abbe6a0d5f4ba).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r230556818
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +188,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = mutable.Map.empty[Int, Byte]
         (0 until count).foreach { _ =>
           val block = in.readInt()
           val size = in.readByte()
    -      hugeBlockSizesArray += Tuple2(block, size)
    +      hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, size)
    --- End diff --
    
    Why cast it? it is used as a mutable map and its type is a mutable map, so the type on line 151 is wrong.  Also, just `hugeBlockSizes(block) = size`, no?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r231125168
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
         private[this] var avgSize: Long,
    -    private var hugeBlockSizes: Map[Int, Byte])
    +    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
    --- End diff --
    
    I like that possibility. If the cost of wrapping/building an immutable map isn't higher, it's better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    retest this please



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    > Practically, though, it generates a whole copy of the map at every update, so for 10 items, the implementation in the PR generates 9 copies of 1, 2, 3, ... elements, while the current one generates only 1 copy, at the end of size 10. So the proposed change is worse than the current solution. If you create a benchmark, you can see this.
    
    That is not a way how immutable persistent data structures handle updates and the scala map in particular. Moreover, as I mentioned above, it is exactly the same logic, which lies under the hood of ArrayBuffer -> Map conversion in the current implementation. I have only removed an intermediate layer. 
    
    I created a benchmark with cut versions of HighlyCompressedStatus (with empty blocks bitmap and huge blocks map only) and measured deserialization performance. The proposed version has shown about 10% performance boost for different blocks configurations. 
    
    You can check out the results on the repo below and repeat the test. 
    https://github.com/Koraseg/mapstatus-benchmark


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r231145409
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
         private[this] var avgSize: Long,
    -    private var hugeBlockSizes: Map[Int, Byte])
    +    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
    --- End diff --
    
    Basically, it means that we use immutable Map instead of mutable one, with worse performance characteristics, since MapBuilder just updates inner reference to immutable Map after each insert.  To enforce correctness, what about to use Scala.collection.Map[Int, Byte]? It doesn't allow dangerous mutations operations and mutable.Map is its subtype. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r231118740
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +188,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = mutable.Map.empty[Int, Byte]
         (0 until count).foreach { _ =>
           val block = in.readInt()
           val size = in.readByte()
    -      hugeBlockSizesArray += Tuple2(block, size)
    +      hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, size)
    --- End diff --
    
    @srowen I used more generic map type in HighlyCompressedMapStatus, so cast was to make update operations on it possible. However, it does not seem necessary because HighlyCompressedMapStatus fields are not exposed outside. So we can simply change type of the map right there.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    **[Test build #4416 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4416/testReport)** for PR 22894 at commit [`57bdd75`](https://github.com/apache/spark/commit/57bdd7525f3353a6d59772b2a86abbe6a0d5f4ba).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    ehat if you use a mutable map instead of an immutable one? which is the perf comparison?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r229789518
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
         private[this] var avgSize: Long,
    -    private var hugeBlockSizes: Map[Int, Byte])
    +    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
    --- End diff --
    
    What about to use more generic scala.collection.Map[Int, Byte] type here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r231122195
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
         private[this] var avgSize: Long,
    -    private var hugeBlockSizes: Map[Int, Byte])
    +    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
    --- End diff --
    
    It's mutated though, and needs to be mutable. If it were exposed outside the class, or there was significant danger of accidentally mutating it elsewhere, I think it might be necessary to wrap the result in an immutable wrapper, but here this seems OK to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r229774707
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala
    --- End diff --
    
    we should use a `mutable.Map` instead of the java implementation wrapped by scala, this is not a clean solution IMHO


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    Practically, though, it generates a whole copy of the map at every update, so for 10 items, the implementation in the PR generates 9 copies of 1, 2, 3, ... elements, while the current one generates only 1 copy, at the end of size 10. So the proposed change is worse than the current solution. If you create a benchmark, you can see this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885] HighlyCompressedMapStatus deserialization/...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r231123586
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
         private[this] var avgSize: Long,
    -    private var hugeBlockSizes: Map[Int, Byte])
    +    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
    --- End diff --
    
    I don't think we should change it. Despite we are using a mutable map in order to build it, the result should be an immutable map, as it enforces correctness, avoiding potential bad updates. So I don't think this should be changed. You can just call `toMap` on the mutable one, but in this case, I think the performance would become again like the original one, as there is no override of `toMap` in scala. A better option would then probablt be using a `immutable.Map.Builder`. Could you check it please?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    @Koraseg please check the contribution guide and update this PR accordingly:
     - first, please fill the PR description properly;
     - second, provide a demonstration of the improved performance with a benchmark;
     - third, IMO this is actually worsening performances, since you are using an immutable data structure, so the map is copied every time is modified....


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22894


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r229982100
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala
    --- End diff --
    
    Unfortunately, its default implementation in Builder trait is empty, and it is not overridden for the mutable map.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    Regarding the third one, the proposed implementation is, conceptually, the same as current one. Under the hood, hugeBlockSizesArray.toMap just updates an internal immutable map reference by tuples from the array and eventually returns it back. 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    shall we trigger the tests for this @srowen ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r229774349
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
         private[this] var avgSize: Long,
    -    private var hugeBlockSizes: Map[Int, Byte])
    +    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
    --- End diff --
    
    this shouldn't be changed, we should still have an immutable map here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r229812240
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala
    --- End diff --
    
    Does calling `HashMap.sizeHint(...)` here actually help?
    I'd stick to the scala collection for now


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    that sounds more reasonable and a better implementation, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    I also would not expect updating an immutable data structure to be faster. Building a map once from tuples at the end seems better than rebuilding a map each time. Under the hood the immutable map is going to be a HashTrieMap (a map of smaller optimized immutable maps) and its updated0 method does some clever stuff to avoid recreating the whole map.
    
    But, yeah, why immutable here to begin with? it ought to be better still to update a mutable Map. And then I am still not sure why it would be faster to keep the map invariants over this loop rather than build the map with its size known ahead of time at the end.
    
    Benchmarks are good evidence but we just need to make sure that the difference is material as used in Spark. It may well be.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r229776086
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala
    --- End diff --
    
    How about just scala's mutable Map? I'd expect it's no slower than Java's, given it might specialize for primitives (not sure about this) and sometimes has smarter implementations internally.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    **[Test build #4416 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4416/testReport)** for PR 22894 at commit [`57bdd75`](https://github.com/apache/spark/commit/57bdd7525f3353a6d59772b2a86abbe6a0d5f4ba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r229790796
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private (
         emptyBlocks.readExternal(in)
         avgSize = in.readLong()
         val count = in.readInt()
    -    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
    +    hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala
    --- End diff --
    
    scala.mutable.HashMap implementation does not have a way to set initial capacity out of the box. The performance gets worse, probably because of resizing hash table. 
    
    scala.mutable.OpenHashMap implementation does, but it is still slower than java.util.HashMap 
    
    However, if  a kind of tradeoff between code cleanness and performance os needed, I would use one of the variants above.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    **[Test build #4414 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4414/testReport)** for PR 22894 at commit [`57bdd75`](https://github.com/apache/spark/commit/57bdd7525f3353a6d59772b2a86abbe6a0d5f4ba).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapSta...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22894#discussion_r231155000
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
         private[this] var numNonEmptyBlocks: Int,
         private[this] var emptyBlocks: RoaringBitmap,
         private[this] var avgSize: Long,
    -    private var hugeBlockSizes: Map[Int, Byte])
    +    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])
    --- End diff --
    
    yes, you're right. I just checked the implementation of the `MapBuilder`... Seems like there is no efficient way to build an immutable map... I am fine with the approach you're proposing using `scala.collection.Map`, but still seems a bit hacky to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by Koraseg <gi...@git.apache.org>.

Github user Koraseg commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    Thanks for the remark above.  I have checked  scala.mutable.Map performance, it is essentially better. For some cases, speed up is up to 2 times  I will update the benchmark and the PR soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22894
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org