You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enzo Bonnal (Jira)" <ji...@apache.org> on 2019/11/12 09:40:00 UTC
[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

    [ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ] 

Enzo Bonnal edited comment on SPARK-29856 at 11/12/19 9:39 AM:
---------------------------------------------------------------

Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ?


was (Author: enzobnl):
Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ?

> Conditional unnecessary persist on RDDs in ML algorithms
> --------------------------------------------------------
>
>                 Key: SPARK-29856
>                 URL: https://issues.apache.org/jira/browse/SPARK-29856
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 3.0.0
>            Reporter: Dong Wang
>            Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
>     val baggedInput = BaggedPoint
>       .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
>         (tp: TreePoint) => tp.weight, seed = seed)
>       .persist(StorageLevel.MEMORY_AND_DISK)
>       ...
>    while (nodeStack.nonEmpty) {
>       ...
>       timer.start("findBestSplits")
>       RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup,
>         treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>       timer.stop("findBestSplits")
>     }
>     baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which means {color:#DE350B}_baggedInput_{color} will be used in many actions. So the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org