You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2016/09/29 22:41:20 UTC

[jira] [Comment Edited] (SYSTEMML-994) GC OOM: Binary Matrix to Frame Conversion

    [ https://issues.apache.org/jira/browse/SYSTEMML-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534269#comment-15534269 ] 

Matthias Boehm edited comment on SYSTEMML-994 at 9/29/16 10:41 PM:
-------------------------------------------------------------------

couple of comments: 
(1) please do not specify the default parallelism - taking the number of input partitions or total number of cores for parallelize (both are the defaults) usually works very well, 
(2) in my experiments, the simple configuration of 1 executor per node with large memory actually works the best (which has advantages in terms of resource consumption as, for example, broadcasts can be shared in memory)
(3) over-committing CPU is usually a good idea too, so there is no need to spare a core for the OS.
(4) since you are running standalone, you might want to explicitly specify 'spark local dirs'
(5) make sure the driver runs on a single node without anything on it - I've seen huge scheduler delays (that affect the entire cluster), if an executor runs on the same node.
(6) regarding the GC issue, keep in mind that is is NOT a OOM but simply an executed GC limit that indicates unnecessary object allocations, etc, please try to set the -Xmn to 10% of the max heap, if this does not help please change the GC limit.

Apart from the configuration issue, it is certainly useful to also do another pass over the matrix to frame converter to eliminate any remaining inefficiencies.


was (Author: mboehm7):
couple of comments: 
(1) please do not specify the default parallelism - taking the number of input partitions or total number of cores for parallelize (both are the defaults) usually works very well, 
(2) in my experiments, the simple configuration of 1 executor per node with large memory actually works the best (which has advantages in terms of resource consumption as, for example, broadcasts can be shared in memory)
(3) over-committing CPU is usually a good idea too, so there is no need to spare a core for the OS.
(4) since you are running standalone, you might want to explicitly specify 'spark local dirs'
(5) make sure the driver runs on a single node without anything on it - I've seen huge scheduler delays (that affect the entire cluster), if an executor runs on the same node.
(6) I usually run both the driver and executors with '-server' flag which has different JIT and GC configurations that really pay off for the long running executors.

> GC OOM: Binary Matrix to Frame Conversion
> -----------------------------------------
>
>                 Key: SYSTEMML-994
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-994
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Mike Dusenberry
>            Priority: Blocker
>
> I currently have a SystemML matrix saved to HDFS in binary block format, and am attempting to read it in, convert it to a {{frame}}, and then pass that to an algorithm so that I can pull batches out of it with minimal overhead.
> When attempting to run this, I am repeatedly hitting the following GC limit:
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 	at org.apache.sysml.runtime.matrix.data.FrameBlock.ensureAllocatedColumns(FrameBlock.java:281)
> 	at org.apache.sysml.runtime.matrix.data.FrameBlock.copy(FrameBlock.java:979)
> 	at org.apache.sysml.runtime.matrix.data.FrameBlock.copy(FrameBlock.java:965)
> 	at org.apache.sysml.runtime.matrix.data.FrameBlock.<init>(FrameBlock.java:91)
> 	at org.apache.sysml.runtime.instructions.spark.utils.FrameRDDAggregateUtils$CreateBlockCombinerFunction.call(FrameRDDAggregateUtils.java:57)
> 	at org.apache.sysml.runtime.instructions.spark.utils.FrameRDDAggregateUtils$CreateBlockCombinerFunction.call(FrameRDDAggregateUtils.java:48)
> 	at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
> 	at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:187)
> 	at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:186)
> 	at org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:148)
> 	at org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
> 	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> 	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> Script:
> {code}
> train = read("train")
> val = read("val")
> trainf = as.frame(train)
> valf = as.frame(val)
> // Rest of algorithm, which passes the frames to DML functions, and performs row indexing to pull out batches, convert to matrices, and train.
> {code}
> Cluster setup:
> * Spark Standalone
> * 1 Master, 9 Workers
> * 47 cores, 124 GB available to Spark on each Worker (1 core + 1GB saved for OS)
> * spark.driver.memory 80g
> * spark.executor.memory 21g
> * spark.executor.cores 3
> * spark.default.parallelism 20000
> * spark.driver.maxResultSize 0
> * spark.akka.frameSize 128
> * spark.network.timeout 1000s
> Note: This is using today's latest build as of 09.29.16 1:30PM PST.
> cc [~mboehm7], [~acs_s]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)