You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Felix Cheung (JIRA)" <ji...@apache.org> on 2016/04/22 05:46:12 UTC
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

    [ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253285#comment-15253285 ] 

Felix Cheung commented on SPARK-14831:
--------------------------------------

I'd argue it is more important that they are like the existing R functions? Granted they are not consistent and they don't always match what Spark support, but I think we are expecting a large number of long time R users who are very familiar with how to call kmeans, to try to use Spark.

However, take kmeans as an example, these are S4 methods, it should be possible to define them in such a way that they would look like R's kmeans by default, for example
{code}
setMethod("kmeans", signature(x = "DataFrame"),
          function(x, centers, iter.max = 10, algorithm = c("random", "k-means||"))
{code}

could be changed to as you later suggested (DataFrame to follow by Formula)
{code}
setMethod("kmeans", signature(data = "DataFrame"),
          function(data, formula = NULL, centers, iter.max = 10, algorithm = c("random", "k-means||"))
{code}


> Make ML APIs in SparkR consistent
> ---------------------------------
>
>                 Key: SPARK-14831
>                 URL: https://issues.apache.org/jira/browse/SPARK-14831
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SparkR
>    Affects Versions: 2.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we put them together, they are not consistent. One example is k-means, which doesn't accept a formula. Instead of looking at each method independently, we might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts `family` before `data` while `kmeans` puts `centers` after `data`. This is not consistent. And logically, the formula doesn't mean anything without associating with a DataFrame. So it makes more sense to me to have the following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org