You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/11/08 16:04:58 UTC

[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

    [ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647970#comment-15647970 ] 

Sean Owen commented on SPARK-18356:
-----------------------------------

Yes, the warning is a little bit ominous, but I think safe to ignore. The immediate parent is in fact cached, which means that just the brief transformation at the outset needs to be recomputed regularly, and that's not that expensive.

The problem with just calling cache() is that it forces another whole copy of the data set to be persisted, and always to memory. We don't even really necessarily want to force persistence of the input, even though at least persisting the parent RDD is pretty important.

It would be nice to avoid the warning if the parent RDD is cached, though that might be a little tricky. Otherwise I think this can be left as is.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> ----------------------------------------------------------
>
>                 Key: SPARK-18356
>                 URL: https://issues.apache.org/jira/browse/SPARK-18356
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: zakaria hili
>            Priority: Minor
>              Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)    
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
>                     kmeans = KMeans().setK(k)
>                     model = kmeans.fit(df_Part)
>                     wssse = model.computeCost(df_Part)
>                     k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org