You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aman Omer (Jira)" <ji...@apache.org> on 2019/11/09 18:17:00 UTC
[jira] [Commented] (SPARK-29810) Missing persist on retaggedInput
in RandomForest.run()
[ https://issues.apache.org/jira/browse/SPARK-29810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970909#comment-16970909 ]
Aman Omer commented on SPARK-29810:
-----------------------------------
Thanks [~spark_cachecheck] for reporting. I will raise a PR for this.
> Missing persist on retaggedInput in RandomForest.run()
> ------------------------------------------------------
>
> Key: SPARK-29810
> URL: https://issues.apache.org/jira/browse/SPARK-29810
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.4.3
> Reporter: Dong Wang
> Priority: Major
>
> The rdd retaggedInput should be persisted in ml.tree.impl.RandomForest.run(), because it will be used more than one actions.
> {code:scala}
> def run(
> input: RDD[LabeledPoint],
> strategy: OldStrategy,
> numTrees: Int,
> featureSubsetStrategy: String,
> seed: Long,
> instr: Option[Instrumentation],
> prune: Boolean = true, // exposed for testing only, real trees are always pruned
> parentUID: Option[String] = None): Array[DecisionTreeModel] = {
> val timer = new TimeTracker()
> timer.start("total")
> timer.start("init")
> val retaggedInput = input.retag(classOf[LabeledPoint]) // it needs to be persisted
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org