You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dong Wang (Jira)" <ji...@apache.org> on 2019/11/10 11:58:00 UTC
[jira] [Created] (SPARK-29823) Wrong persist strategy in
mllib.clustering.KMeans.run()
Dong Wang created SPARK-29823:
---------------------------------
Summary: Wrong persist strategy in mllib.clustering.KMeans.run()
Key: SPARK-29823
URL: https://issues.apache.org/jira/browse/SPARK-29823
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 2.4.3
Reporter: Dong Wang
In mllib.clustering.KMeans.run(), the rdd norms is persisted. But it only has a single child rdd zippedData, so it's a unnecessary persist. On the other hand, norms's child rdd zippedData will be used by multi times in runAlgorithm, so zippedData should be persisted.
{code:scala}
private[spark] def run(
data: RDD[Vector],
instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
logWarning("The input data is not directly cached, which may hurt performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org