You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dong Wang (Jira)" <ji...@apache.org> on 2019/11/12 07:14:00 UTC

[jira] [Created] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

Dong Wang created SPARK-29856:
---------------------------------

             Summary: Conditional unnecessary persist on RDDs in ML algorithms
                 Key: SPARK-29856
                 URL: https://issues.apache.org/jira/browse/SPARK-29856
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 3.0.0
            Reporter: Dong Wang


When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is persisted, but it only used once. So this persist operation is unnecessary.

{code:scala}
    val baggedInput = BaggedPoint
      .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
        (tp: TreePoint) => tp.weight, seed = seed)
      .persist(StorageLevel.MEMORY_AND_DISK)
      ...
   while (nodeStack.nonEmpty) {
      ...
      timer.start("findBestSplits")
      RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup,
        treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
      timer.stop("findBestSplits")
    }
    baggedInput.unpersist()
{code}

However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop. 
In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses {color:#DE350B}_baggedInput_{color}.
In most of ML applications, the loop will executes for many times, which means {color:#DE350B}_baggedInput_{color} will be used in many actions. So the persist is necessary now.
That's the point why the persist operation is "conditional" unnecessary.

Same situations exist in many other ML algorithms, e.g., RDD {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().

This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org