You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yaooqinn <gi...@git.apache.org> on 2015/11/09 07:37:12 UTC

[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...

GitHub user yaooqinn opened a pull request:

    https://github.com/apache/spark/pull/9559

    [SPARK-11583] Make MapStatus use less memory uage

    In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I said, using BitSet can save ≈20% memory usage compared to RoaringBitMap.
    For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean.
    Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
    So if we use a HashSet[Int] to store reduceId (when non-empty blocks are dense,use reduceId of empty blocks; when sparse, use non-empty ones).
    For dense cases: if HashSet[Int](numNonEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
    For sparse cases: if HashSet[Int](numEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
    sparse case, 299/300 are empty
    sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
    dense case, no block is empty
    sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yaooqinn/spark mapstatus-smaller

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9559.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9559
    
----
commit 79353eb4e1c8ef69278681a9c6eaad7b359c4270
Author: YAOQIN <ya...@huawei.com>
Date:   2015-11-07T07:04:13Z

    map status smaller

commit 492adeb58871c179bf5f6b553111c6f2a2b3ee3b
Author: Kent Yao <ya...@hotmail.com>
Date:   2015-11-09T02:21:51Z

    Let MapStatus be smaller in sparse/tense cases

commit cb4bce57fdfad98a64821870025b8173447cf882
Author: Kent Yao <ya...@hotmail.com>
Date:   2015-11-09T03:22:03Z

    Let MapStatus be smaller in sparse/tense cases 1

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155247009
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155247427
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-154970758
  
    **[Test build #2020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2020/consoleFull)** for PR 9559 at commit [`cb4bce5`](https://github.com/apache/spark/commit/cb4bce57fdfad98a64821870025b8173447cf882).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9559#discussion_r44357467
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,6 +258,16 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    var isSparse = true
    +    if(numNonEmptyBlocks * 32 < totalNumBlocks){
    +      MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse)
    +    }
    +    else if ((totalNumBlocks - numNonEmptyBlocks) * 32 < totalNumBlocks){
    +      isSparse = false
    --- End diff --
    
    this use of a `var` is unnecessarily complicated. Just do
    ```
    if (numNonEmptyBlocks * 32 < totalNumBlocks) {
      MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse = true)
    } else if (...) {
      MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse = false)
    } else {
      ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-154969438
  
    **[Test build #2020 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2020/consoleFull)** for PR 9559 at commit [`cb4bce5`](https://github.com/apache/spark/commit/cb4bce57fdfad98a64821870025b8173447cf882).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9559#discussion_r44357664
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
     
     /**
      * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    + * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
    + * using a HashSet[Int] can save more memory usage than BitSet
    + *
    + * @param loc location where the task is being executed
    + * @param numNonEmptyBlocks the number of non-empty blocks
    + * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
    + * @param avgSize average size of the non-empty blocks
    + */
    +private[spark] class MapStatusTrackingEmptyBlocks private (
    +    private[this] var loc: BlockManagerId,
    +    private[this] var numNonEmptyBlocks: Int,
    +    private[this] var markedBlocks: mutable.HashSet[Int],
    +    private[this] var avgSize: Long,
    +    private[this] var isSparse: Boolean)
    +  extends MapStatus with Externalizable {
    +
    +  // loc could be null when the default constructor is called during deserialization
    +  require(loc == null || avgSize > 0 || numNonEmptyBlocks == 0,
    +    "Average size can only be zero for map stages that produced no output")
    +
    +  protected def this() = this(null, -1, null, -1, false)  // For deserialization only
    +
    +  override def location: BlockManagerId = loc
    +
    +  override def getSizeForBlock(reduceId: Int): Long = {
    +    if (isSparse ^ markedBlocks.contains(reduceId)) {
    +      0
    +    } else {
    +      avgSize
    +    }
    +  }
    +
    +  override def writeExternal(out: ObjectOutput): Unit = Utils.tryOrIOException {
    +    loc.writeExternal(out)
    +    out.writeObject(markedBlocks)
    +    out.writeLong(avgSize)
    +  }
    +
    +  override def readExternal(in: ObjectInput): Unit = Utils.tryOrIOException {
    +    loc = BlockManagerId(in)
    +    markedBlocks = new mutable.HashSet[Int]
    +    markedBlocks = in.readObject().asInstanceOf[mutable.HashSet[Int]]
    --- End diff --
    
    did you mean to immediately override the assignment on the previous line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155252219
  
    @rxin OpenHashSet replace HashSet


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9559#discussion_r44357262
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -141,7 +202,7 @@ private[spark] class HighlyCompressedMapStatus private (
         "Average size can only be zero for map stages that produced no output")
     
       protected def this() = this(null, -1, null, -1)  // For deserialization only
    -
    +  
    --- End diff --
    
    revert


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155247598
  
    **[Test build #45450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45450/consoleFull)** for PR 9559 at commit [`e1d1106`](https://github.com/apache/spark/commit/e1d11067e304f3f2131848d568034ec4cae4e3af).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9559#discussion_r44435230
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -193,6 +254,12 @@ private[spark] object HighlyCompressedMapStatus {
         } else {
           0
         }
    -    new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
    +    if(numNonEmptyBlocks * 32 < totalNumBlocks){
    +      MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse = true)
    +    } else if ((totalNumBlocks - numNonEmptyBlocks) * 32 < totalNumBlocks){
    +      MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, emptyBlocksHashSet, avgSize, isSparse = false)
    --- End diff --
    
    `isSparse` is the wrong name, I think -- both cases are sparse, its a question of whether or not you are storing the empty blocks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9559#discussion_r44357246
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
     
     /**
      * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    + * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
    + * using a HashSet[Int] can save more memory usage than BitSet
    + *
    + * @param loc location where the task is being executed
    + * @param numNonEmptyBlocks the number of non-empty blocks
    + * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
    + * @param avgSize average size of the non-empty blocks
    + */
    +private[spark] class MapStatusTrackingEmptyBlocks private (
    +    private[this] var loc: BlockManagerId,
    +    private[this] var numNonEmptyBlocks: Int,
    +    private[this] var markedBlocks: mutable.HashSet[Int],
    +    private[this] var avgSize: Long,
    +    private[this] var isSparse: Boolean)
    --- End diff --
    
    you can just do
    ```
    private[spark] class MapStatusTrackingEmptyBlocks private (
        loc: BlockManagerId,
        ...)
      extends MapStatus with ...
    ```
    by default that's a private val, which is what you want


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn closed the pull request at:

    https://github.com/apache/spark/pull/9559


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...

Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-154983945
  
    retest plz


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155247985
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155138622
  
    Can you use the OpenHashSet in Spark? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155495152
  
    Sean has raised some important higher level questions on the jira -- I'd like us to resolve the discussion there before moving forward on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-156227663
  
    @yaooqinn Can you close this now that you have the new patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9559#discussion_r44357257
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
     
     /**
      * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    + * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
    + * using a HashSet[Int] can save more memory usage than BitSet
    + *
    + * @param loc location where the task is being executed
    + * @param numNonEmptyBlocks the number of non-empty blocks
    + * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
    + * @param avgSize average size of the non-empty blocks
    + */
    +private[spark] class MapStatusTrackingEmptyBlocks private (
    +    private[this] var loc: BlockManagerId,
    +    private[this] var numNonEmptyBlocks: Int,
    +    private[this] var markedBlocks: mutable.HashSet[Int],
    +    private[this] var avgSize: Long,
    +    private[this] var isSparse: Boolean)
    +  extends MapStatus with Externalizable {
    +
    +  // loc could be null when the default constructor is called during deserialization
    +  require(loc == null || avgSize > 0 || numNonEmptyBlocks == 0,
    +    "Average size can only be zero for map stages that produced no output")
    +
    +  protected def this() = this(null, -1, null, -1, false)  // For deserialization only
    +
    +  override def location: BlockManagerId = loc
    +
    +  override def getSizeForBlock(reduceId: Int): Long = {
    +    if (isSparse ^ markedBlocks.contains(reduceId)) {
    +      0
    +    } else {
    +      avgSize
    +    }
    +  }
    +
    +  override def writeExternal(out: ObjectOutput): Unit = Utils.tryOrIOException {
    +    loc.writeExternal(out)
    +    out.writeObject(markedBlocks)
    +    out.writeLong(avgSize)
    +  }
    +
    +  override def readExternal(in: ObjectInput): Unit = Utils.tryOrIOException {
    +    loc = BlockManagerId(in)
    +    markedBlocks = new mutable.HashSet[Int]
    +    markedBlocks = in.readObject().asInstanceOf[mutable.HashSet[Int]]
    +    avgSize = in.readLong()
    +  }
    +}
    +
    +private[spark] object MapStatusTrackingEmptyBlocks {
    +  def apply(
    +    loc: BlockManagerId,
    +    numNonEmptyBlocks: Int ,
    +    markedBlocks: mutable.HashSet[Int],
    +    avgSize: Long,
    +    flag: Boolean): MapStatusTrackingEmptyBlocks = {
    --- End diff --
    
    indent these by 2 more spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155247982
  
    **[Test build #45450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45450/consoleFull)** for PR 9559 at commit [`e1d1106`](https://github.com/apache/spark/commit/e1d11067e304f3f2131848d568034ec4cae4e3af).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-156309802
  
    OK, close this pr and  see #9661 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9559#discussion_r44357152
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
     
     /**
      * A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
    + * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
    + * using a HashSet[Int] can save more memory usage than BitSet
    + *
    + * @param loc location where the task is being executed
    + * @param numNonEmptyBlocks the number of non-empty blocks
    + * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
    + * @param avgSize average size of the non-empty blocks
    + */
    +private[spark] class MapStatusTrackingEmptyBlocks private (
    +    private[this] var loc: BlockManagerId,
    +    private[this] var numNonEmptyBlocks: Int,
    +    private[this] var markedBlocks: mutable.HashSet[Int],
    +    private[this] var avgSize: Long,
    +    private[this] var isSparse: Boolean)
    --- End diff --
    
    why are these `var`s? You never set them to anything


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155252403
  
    @andrewor14 Thanks for your advices


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-154963061
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...

Posted by watermen <gi...@git.apache.org>.
Github user watermen commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-154998766
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9559#issuecomment-155247407
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org