You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/07/25 22:39:20 UTC
[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory
footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392779#comment-15392779 ]
Joseph K. Bradley commented on SPARK-13434:
-------------------------------------------
I agree it's very important. That JIRA had gotten lost for a while, but it is now linked from the umbrella: [SPARK-3162]
> Reduce Spark RandomForest memory footprint
> ------------------------------------------
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.6.0
> Environment: Linux
> Reporter: Ewan Higgs
> Labels: decisiontree, mllib, randomforest
> Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate datasets. This was raised in the a user's benchmarking game on github (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the RandomForest training on largish datasets on machines with 64G memory and the following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell --driver-memory 30G --executor-memory 30G}} and have a heap profile from a single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample every 5 seconds and at the peak it looks like this:
> {code}
> num #instances #bytes class name
> ----------------------------------------------
> 1: 5428073 8458773496 [D
> 2: 12293653 4124641992 [I
> 3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node
> 4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict
> 5: 72853787 1165660592 scala.Some
> 6: 16263408 910750848 org.apache.spark.mllib.tree.model.InformationGainStats
> 7: 72969 390492744 [B
> 8: 3327008 133080320 org.apache.spark.mllib.tree.impl.DTStatsAggregator
> 9: 3754500 120144000 scala.collection.immutable.HashMap$HashMap1
> 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split
> 11: 3534946 84838704 org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
> 12: 3764745 60235920 java.lang.Integer
> 13: 3327008 53232128 org.apache.spark.mllib.tree.impurity.EntropyAggregator
> 14: 380804 45361144 [C
> 15: 268887 34877128 <constMethodKlass>
> 16: 268887 34431568 <methodKlass>
> 17: 908377 34042760 [Lscala.collection.immutable.HashMap;
> 18: 1100000 26400000 org.apache.spark.mllib.regression.LabeledPoint
> 19: 1100000 26400000 org.apache.spark.mllib.linalg.SparseVector
> 20: 20206 25979864 <constantPoolKlass>
> 21: 1000000 24000000 org.apache.spark.mllib.tree.impl.TreePoint
> 22: 1000000 24000000 org.apache.spark.mllib.tree.impl.BaggedPoint
> 23: 908332 21799968 scala.collection.immutable.HashMap$HashTrieMap
> 24: 20206 20158864 <instanceKlassKlass>
> 25: 17023 14380352 <constantPoolCacheKlass>
> 26: 16 13308288 [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
> 27: 445797 10699128 scala.Tuple2
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org