You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Kevin-Ferret <gi...@git.apache.org> on 2017/12/19 15:50:59 UTC

[GitHub] spark issue #19340: [SPARK-22119][ML] Add cosine distance to KMeans

Github user Kevin-Ferret commented on the issue:

    https://github.com/apache/spark/pull/19340
  
     @mgaido91 I actually needed something like that recently and I stumbled upon your PR (and JIRA, that I cannot update unfortunately). 
    Your approach looks good to me but I was wondering : are you keeping the existing code to find the new centers, in the sense "newCenter = compute the arithmetic mean of all points in the cluster" ? (couldn't find any git diff on this piece https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L291-L298)
    This will lead to misleading results (your centroids won't minimize the within-cluster cosine distance anymore) unless you make sure to normalize all your input vectors AND your newly computed centroid (ie spherical k-means). 
    I might be wrong/have misread the code, forgive me if that's the case. If you want to discuss further, what's the best place (assuming you don't want this PR derailed with my blabbering) ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org