You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dong Wang (Jira)" <ji...@apache.org> on 2019/11/12 07:14:00 UTC
[jira] [Created] (SPARK-29856) Conditional unnecessary persist on
RDDs in ML algorithms
Dong Wang created SPARK-29856:
---------------------------------
Summary: Conditional unnecessary persist on RDDs in ML algorithms
Key: SPARK-29856
URL: https://issues.apache.org/jira/browse/SPARK-29856
Project: Spark
Issue Type: Improvement
Components: ML, MLlib
Affects Versions: 3.0.0
Reporter: Dong Wang
When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is persisted, but it only used once. So this persist operation is unnecessary.
{code:scala}
val baggedInput = BaggedPoint
.convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
(tp: TreePoint) => tp.weight, seed = seed)
.persist(StorageLevel.MEMORY_AND_DISK)
...
while (nodeStack.nonEmpty) {
...
timer.start("findBestSplits")
RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup,
treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
timer.stop("findBestSplits")
}
baggedInput.unpersist()
{code}
However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop.
In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses {color:#DE350B}_baggedInput_{color}.
In most of ML applications, the loop will executes for many times, which means {color:#DE350B}_baggedInput_{color} will be used in many actions. So the persist is necessary now.
That's the point why the persist operation is "conditional" unnecessary.
Same situations exist in many other ML algorithms, e.g., RDD {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org