You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yu Ishikawa (JIRA)" <ji...@apache.org> on 2014/10/08 12:53:35 UTC

[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans

    [ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163318#comment-14163318 ] 

Yu Ishikawa edited comment on SPARK-2429 at 10/8/14 10:52 AM:
--------------------------------------------------------------

Hi [~rnowling],

I'm sorry for the delay in my response. I implemented a hierarchical clustering algorithm and benchmarked it. Could you review it?

In my opinion, the performance of my implementation was slower than I had expected. So There are two questions which I ask you.

1. Do you think the performance is bad?
2. If you think so, is there any other good approach to improve it?

I only checked the performance for training. I have not checked the accuracy of the trained model yet. If you have any good idea to test a hierarchical clustering, please tell me.

h3. Algorithm Matter

The user sets the number of clusters he wants. However if the data can not be divided anymore, stop the clustering at the time.

h3. Scala Code

https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/master/src/main/scala/org.apache.spark.mllib.clustering/HierarchicalClustering.scala

h3. The Benchmark Result

Please check the attached PDF file.


was (Author: yuu.ishikawa@gmail.com):
Hi [~rnowling],

I'm sorry for the delay in my response. I implemented a hierarchical clustering algorithm and benchmarked it. Could you review it?

In my opinion, the performance of my implementation was slower than I had expected. So There are two questions which I ask you.

1. Do you think the performance is bad?
2. If you think so, is there any other good approach to improve it?

And, I am checking the accuracy of my implementation. If you have any good idea to test a hierarchical clustering, please tell me.

h3. Algorithm Matter

The user sets the number of clusters he wants. However if the data can not be divided anymore, stop the clustering at the time.

h3. Scala Code

https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/master/src/main/scala/org.apache.spark.mllib.clustering/HierarchicalClustering.scala

h3. The Benchmark Result

Please check the attached PDF file.

> Hierarchical Implementation of KMeans
> -------------------------------------
>
>                 Key: SPARK-2429
>                 URL: https://issues.apache.org/jira/browse/SPARK-2429
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: Yu Ishikawa
>            Priority: Minor
>         Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf
>
>
> Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib.  Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org