You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yaooqinn <gi...@git.apache.org> on 2015/11/09 07:37:12 UTC
[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...
GitHub user yaooqinn opened a pull request:
https://github.com/apache/spark/pull/9559
[SPARK-11583] Make MapStatus use less memory uage
In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I said, using BitSet can save ≈20% memory usage compared to RoaringBitMap.
For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean.
Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
So if we use a HashSet[Int] to store reduceId (when non-empty blocks are dense,use reduceId of empty blocks; when sparse, use non-empty ones).
For dense cases: if HashSet[Int](numNonEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
For sparse cases: if HashSet[Int](numEmptyBlocks).size < BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
sparse case, 299/300 are empty
sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
dense case, no block is empty
sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yaooqinn/spark mapstatus-smaller
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9559.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9559
----
commit 79353eb4e1c8ef69278681a9c6eaad7b359c4270
Author: YAOQIN <ya...@huawei.com>
Date: 2015-11-07T07:04:13Z
map status smaller
commit 492adeb58871c179bf5f6b553111c6f2a2b3ee3b
Author: Kent Yao <ya...@hotmail.com>
Date: 2015-11-09T02:21:51Z
Let MapStatus be smaller in sparse/tense cases
commit cb4bce57fdfad98a64821870025b8173447cf882
Author: Kent Yao <ya...@hotmail.com>
Date: 2015-11-09T03:22:03Z
Let MapStatus be smaller in sparse/tense cases 1
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155247009
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155247427
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-154970758
**[Test build #2020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2020/consoleFull)** for PR 9559 at commit [`cb4bce5`](https://github.com/apache/spark/commit/cb4bce57fdfad98a64821870025b8173447cf882).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/9559#discussion_r44357467
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,6 +258,16 @@ private[spark] object HighlyCompressedMapStatus {
} else {
0
}
- new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
+ var isSparse = true
+ if(numNonEmptyBlocks * 32 < totalNumBlocks){
+ MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse)
+ }
+ else if ((totalNumBlocks - numNonEmptyBlocks) * 32 < totalNumBlocks){
+ isSparse = false
--- End diff --
this use of a `var` is unnecessarily complicated. Just do
```
if (numNonEmptyBlocks * 32 < totalNumBlocks) {
MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse = true)
} else if (...) {
MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse = false)
} else {
...
}
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-154969438
**[Test build #2020 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2020/consoleFull)** for PR 9559 at commit [`cb4bce5`](https://github.com/apache/spark/commit/cb4bce57fdfad98a64821870025b8173447cf882).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/9559#discussion_r44357664
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
/**
* A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
+ * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
+ * using a HashSet[Int] can save more memory usage than BitSet
+ *
+ * @param loc location where the task is being executed
+ * @param numNonEmptyBlocks the number of non-empty blocks
+ * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
+ * @param avgSize average size of the non-empty blocks
+ */
+private[spark] class MapStatusTrackingEmptyBlocks private (
+ private[this] var loc: BlockManagerId,
+ private[this] var numNonEmptyBlocks: Int,
+ private[this] var markedBlocks: mutable.HashSet[Int],
+ private[this] var avgSize: Long,
+ private[this] var isSparse: Boolean)
+ extends MapStatus with Externalizable {
+
+ // loc could be null when the default constructor is called during deserialization
+ require(loc == null || avgSize > 0 || numNonEmptyBlocks == 0,
+ "Average size can only be zero for map stages that produced no output")
+
+ protected def this() = this(null, -1, null, -1, false) // For deserialization only
+
+ override def location: BlockManagerId = loc
+
+ override def getSizeForBlock(reduceId: Int): Long = {
+ if (isSparse ^ markedBlocks.contains(reduceId)) {
+ 0
+ } else {
+ avgSize
+ }
+ }
+
+ override def writeExternal(out: ObjectOutput): Unit = Utils.tryOrIOException {
+ loc.writeExternal(out)
+ out.writeObject(markedBlocks)
+ out.writeLong(avgSize)
+ }
+
+ override def readExternal(in: ObjectInput): Unit = Utils.tryOrIOException {
+ loc = BlockManagerId(in)
+ markedBlocks = new mutable.HashSet[Int]
+ markedBlocks = in.readObject().asInstanceOf[mutable.HashSet[Int]]
--- End diff --
did you mean to immediately override the assignment on the previous line?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155252219
@rxin OpenHashSet replace HashSet
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/9559#discussion_r44357262
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,7 +202,7 @@ private[spark] class HighlyCompressedMapStatus private (
"Average size can only be zero for map stages that produced no output")
protected def this() = this(null, -1, null, -1) // For deserialization only
-
+
--- End diff --
revert
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155247598
**[Test build #45450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45450/consoleFull)** for PR 9559 at commit [`e1d1106`](https://github.com/apache/spark/commit/e1d11067e304f3f2131848d568034ec4cae4e3af).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/9559#discussion_r44435230
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,6 +254,12 @@ private[spark] object HighlyCompressedMapStatus {
} else {
0
}
- new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize)
+ if(numNonEmptyBlocks * 32 < totalNumBlocks){
+ MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, nonEmptyBlocks, avgSize, isSparse = true)
+ } else if ((totalNumBlocks - numNonEmptyBlocks) * 32 < totalNumBlocks){
+ MapStatusTrackingEmptyBlocks(loc, numNonEmptyBlocks, emptyBlocksHashSet, avgSize, isSparse = false)
--- End diff --
`isSparse` is the wrong name, I think -- both cases are sparse, its a question of whether or not you are storing the empty blocks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/9559#discussion_r44357246
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
/**
* A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
+ * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
+ * using a HashSet[Int] can save more memory usage than BitSet
+ *
+ * @param loc location where the task is being executed
+ * @param numNonEmptyBlocks the number of non-empty blocks
+ * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
+ * @param avgSize average size of the non-empty blocks
+ */
+private[spark] class MapStatusTrackingEmptyBlocks private (
+ private[this] var loc: BlockManagerId,
+ private[this] var numNonEmptyBlocks: Int,
+ private[this] var markedBlocks: mutable.HashSet[Int],
+ private[this] var avgSize: Long,
+ private[this] var isSparse: Boolean)
--- End diff --
you can just do
```
private[spark] class MapStatusTrackingEmptyBlocks private (
loc: BlockManagerId,
...)
extends MapStatus with ...
```
by default that's a private val, which is what you want
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn closed the pull request at:
https://github.com/apache/spark/pull/9559
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...
Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-154983945
retest plz
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155247985
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155138622
Can you use the OpenHashSet in Spark?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by squito <gi...@git.apache.org>.
Github user squito commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155495152
Sean has raised some important higher level questions on the jira -- I'd like us to resolve the discussion there before moving forward on this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-156227663
@yaooqinn Can you close this now that you have the new patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/9559#discussion_r44357257
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
/**
* A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
+ * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
+ * using a HashSet[Int] can save more memory usage than BitSet
+ *
+ * @param loc location where the task is being executed
+ * @param numNonEmptyBlocks the number of non-empty blocks
+ * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
+ * @param avgSize average size of the non-empty blocks
+ */
+private[spark] class MapStatusTrackingEmptyBlocks private (
+ private[this] var loc: BlockManagerId,
+ private[this] var numNonEmptyBlocks: Int,
+ private[this] var markedBlocks: mutable.HashSet[Int],
+ private[this] var avgSize: Long,
+ private[this] var isSparse: Boolean)
+ extends MapStatus with Externalizable {
+
+ // loc could be null when the default constructor is called during deserialization
+ require(loc == null || avgSize > 0 || numNonEmptyBlocks == 0,
+ "Average size can only be zero for map stages that produced no output")
+
+ protected def this() = this(null, -1, null, -1, false) // For deserialization only
+
+ override def location: BlockManagerId = loc
+
+ override def getSizeForBlock(reduceId: Int): Long = {
+ if (isSparse ^ markedBlocks.contains(reduceId)) {
+ 0
+ } else {
+ avgSize
+ }
+ }
+
+ override def writeExternal(out: ObjectOutput): Unit = Utils.tryOrIOException {
+ loc.writeExternal(out)
+ out.writeObject(markedBlocks)
+ out.writeLong(avgSize)
+ }
+
+ override def readExternal(in: ObjectInput): Unit = Utils.tryOrIOException {
+ loc = BlockManagerId(in)
+ markedBlocks = new mutable.HashSet[Int]
+ markedBlocks = in.readObject().asInstanceOf[mutable.HashSet[Int]]
+ avgSize = in.readLong()
+ }
+}
+
+private[spark] object MapStatusTrackingEmptyBlocks {
+ def apply(
+ loc: BlockManagerId,
+ numNonEmptyBlocks: Int ,
+ markedBlocks: mutable.HashSet[Int],
+ avgSize: Long,
+ flag: Boolean): MapStatusTrackingEmptyBlocks = {
--- End diff --
indent these by 2 more spaces
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155247982
**[Test build #45450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45450/consoleFull)** for PR 9559 at commit [`e1d1106`](https://github.com/apache/spark/commit/e1d11067e304f3f2131848d568034ec4cae4e3af).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-156309802
OK, close this pr and see #9661
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/9559#discussion_r44357152
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -121,6 +123,65 @@ private[spark] class CompressedMapStatus(
/**
* A [[MapStatus]] implementation that only stores the average size of non-empty blocks,
+ * plus a hashset for tracking which blocks are empty(dense) / non-empty(sparse).
+ * using a HashSet[Int] can save more memory usage than BitSet
+ *
+ * @param loc location where the task is being executed
+ * @param numNonEmptyBlocks the number of non-empty blocks
+ * @param markedBlocks a HashSet tracking which blocks are empty(dense) / non-empty(sparse)
+ * @param avgSize average size of the non-empty blocks
+ */
+private[spark] class MapStatusTrackingEmptyBlocks private (
+ private[this] var loc: BlockManagerId,
+ private[this] var numNonEmptyBlocks: Int,
+ private[this] var markedBlocks: mutable.HashSet[Int],
+ private[this] var avgSize: Long,
+ private[this] var isSparse: Boolean)
--- End diff --
why are these `var`s? You never set them to anything
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by yaooqinn <gi...@git.apache.org>.
Github user yaooqinn commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155252403
@andrewor14 Thanks for your advices
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-154963061
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583] Make MapStatus use less memory u...
Posted by watermen <gi...@git.apache.org>.
Github user watermen commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-154998766
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-11583][Scheduler][Core] Make MapStatus ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/9559#issuecomment-155247407
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org