You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/07/25 22:39:20 UTC

[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint

    [ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392779#comment-15392779 ] 

Joseph K. Bradley commented on SPARK-13434:
-------------------------------------------

I agree it's very important.  That JIRA had gotten lost for a while, but it is now linked from the umbrella: [SPARK-3162]

> Reduce Spark RandomForest memory footprint
> ------------------------------------------
>
>                 Key: SPARK-13434
>                 URL: https://issues.apache.org/jira/browse/SPARK-13434
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.0
>         Environment: Linux
>            Reporter: Ewan Higgs
>              Labels: decisiontree, mllib, randomforest
>         Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate datasets. This was raised in the a user's benchmarking game on github (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the RandomForest training on largish datasets on machines with 64G memory and the following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell --driver-memory 30G --executor-memory 30G}} and have a heap profile from a single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample every 5 seconds and at the peak it looks like this:
> {code}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:       5428073     8458773496  [D
>    2:      12293653     4124641992  [I
>    3:      32508964     1820501984  org.apache.spark.mllib.tree.model.Node
>    4:      53068426     1698189632  org.apache.spark.mllib.tree.model.Predict
>    5:      72853787     1165660592  scala.Some
>    6:      16263408      910750848  org.apache.spark.mllib.tree.model.InformationGainStats
>    7:         72969      390492744  [B
>    8:       3327008      133080320  org.apache.spark.mllib.tree.impl.DTStatsAggregator
>    9:       3754500      120144000  scala.collection.immutable.HashMap$HashMap1
>   10:       3318349      106187168  org.apache.spark.mllib.tree.model.Split
>   11:       3534946       84838704  org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:       3764745       60235920  java.lang.Integer
>   13:       3327008       53232128  org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:        380804       45361144  [C
>   15:        268887       34877128  <constMethodKlass>
>   16:        268887       34431568  <methodKlass>
>   17:        908377       34042760  [Lscala.collection.immutable.HashMap;
>   18:       1100000       26400000  org.apache.spark.mllib.regression.LabeledPoint
>   19:       1100000       26400000  org.apache.spark.mllib.linalg.SparseVector
>   20:         20206       25979864  <constantPoolKlass>
>   21:       1000000       24000000  org.apache.spark.mllib.tree.impl.TreePoint
>   22:       1000000       24000000  org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:        908332       21799968  scala.collection.immutable.HashMap$HashTrieMap
>   24:         20206       20158864  <instanceKlassKlass>
>   25:         17023       14380352  <constantPoolCacheKlass>
>   26:            16       13308288  [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:        445797       10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org