You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2016/10/31 18:14:59 UTC

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

    [ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15622915#comment-15622915 ] 

Joseph K. Bradley commented on SPARK-15784:
-------------------------------------------

[~wangmiao1981] Sorry for the slow response here.  I do want us to add PIC to spark.ml, but we should discuss the design before the PR.  Could you please close the PR for now but save the branch to re-open after discussion?

Let's have a design discussion first.

I agree that the big issue is that there isn't a clear way to make predictions on new data points.  In fact, I've never heard of people trying to do so.  Has anyone else?

Assuming that prediction is not meaningful for PIC, then I don't think the algorithm fits within the Pipeline framework, though it's debatable.  I see a few options:
* Put PIC in Pipelines as a Transformer, not an Estimator.  We would just need to document that it is a very expensive Transformer.
* Put PIC in spark.ml as a static method.  We may have to do this anyways to support all of spark.mllib's Statistics.
* Put PIC in GraphFrames (and push harder for GraphFrames to be merged back into Spark, which will include a much longer set of improvements).

My top choice is PIC as a Transformer.  What do you think?

CC [~yanboliang] [~sethah] [~mlnick] opinions?

> Add Power Iteration Clustering to spark.ml
> ------------------------------------------
>
>                 Key: SPARK-15784
>                 URL: https://issues.apache.org/jira/browse/SPARK-15784
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org