You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yanbo Liang (JIRA)" <ji...@apache.org> on 2015/12/03 07:52:11 UTC

[jira] [Commented] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

    [ https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037387#comment-15037387 ] 

Yanbo Liang commented on SPARK-10523:
-------------------------------------

SPARK-11349 has solved the issue that supporting label with String type, so you can use SparkR to train binary classification model after 1.6 release. If you want to train multiple classification model, you should wait SPARK-7159.

> SparkR formula syntax to turn strings/factors into numerics
> -----------------------------------------------------------
>
>                 Key: SPARK-10523
>                 URL: https://issues.apache.org/jira/browse/SPARK-10523
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SparkR
>            Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be turned into dummy variables immediately when calling a classifier. This way, the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it appropriately by casting it to a 0/1 array before applying any machine learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
> 	at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
> 	at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
> 	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
> 	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 	at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
> 	at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
> 	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
> 	at org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
> 	at org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models that are using multiple classes that need classification. This is perhaps less relevant for logistic regression (because it is a bit more like a one-off classification approach) but it certainly is relevant if you would want to use a formula for a randomforest and a column denotes, say, a type of flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? I am aware of issue 8774, which looks like it is adressing a similar theme but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org