You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "YuQiang Ye (Jira)" <ji...@apache.org> on 2020/08/20 02:49:00 UTC

[jira] [Commented] (SPARK-29967) KMeans support instance weighting

    [ https://issues.apache.org/jira/browse/SPARK-29967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180932#comment-17180932 ] 

YuQiang Ye commented on SPARK-29967:
------------------------------------

{code:java}
  def run(data: RDD[Vector]): KMeansModel = {
-    run(data, None)
+    val instances: RDD[(Vector, Double)] = data.map {
+      case (point) => (point, 1.0)
+    }
+    runWithWeight(instances, None)
  }
{code}

Hi, I was testing KMeans performance from Spark 2.4 to Spark 3.0. The perf becomes quite worse than Spark 2.4. Will above code in this PR cause the instances' storage level always be NONE and in runWithWeight, the instances will be cached again
[~srowen][~huaxingao]

> KMeans support instance weighting
> ---------------------------------
>
>                 Key: SPARK-29967
>                 URL: https://issues.apache.org/jira/browse/SPARK-29967
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Assignee: Huaxin Gao
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Since https://issues.apache.org/jira/browse/SPARK-9610, we start to support instance weighting in ML.
> However, Clustering and other impl in features still do not support instance weighting.
> I think we need to start support weighting in KMeans, like what scikit-learn does.
> It will contains three parts:
> 1, move the impl from .mllib to .ml
> 2, make .mllib.KMeans as a wrapper of .ml.KMeans
> 3, support instance weighting in the .ml.KMeans



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org