You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enzo Bonnal (Jira)" <ji...@apache.org> on 2019/11/12 09:40:00 UTC
[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary
persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ]
Enzo Bonnal edited comment on SPARK-29856 at 11/12/19 9:39 AM:
---------------------------------------------------------------
Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ?
was (Author: enzobnl):
Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ?
> Conditional unnecessary persist on RDDs in ML algorithms
> --------------------------------------------------------
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
> Issue Type: Improvement
> Components: ML, MLlib
> Affects Versions: 3.0.0
> Reporter: Dong Wang
> Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
> .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
> .persist(StorageLevel.MEMORY_AND_DISK)
> ...
> while (nodeStack.nonEmpty) {
> ...
> timer.start("findBestSplits")
> RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
> timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop.
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which means {color:#DE350B}_baggedInput_{color} will be used in many actions. So the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org