You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vincent Warmerdam (JIRA)" <ji...@apache.org> on 2015/09/10 01:13:45 UTC
[jira] [Created] (SPARK-10523) SparkR formula syntax to turn
strings/factors into numerics
Vincent Warmerdam created SPARK-10523:
-----------------------------------------
Summary: SparkR formula syntax to turn strings/factors into numerics
Key: SPARK-10523
URL: https://issues.apache.org/jira/browse/SPARK-10523
Project: Spark
Issue Type: Bug
Reporter: Vincent Warmerdam
In normal (non SparkR) R the formula syntax enables strings or factors to be turned into dummy variables immediately when calling a classifier. This way, the following Rcode is legal and often used.
{code}
library(magrittr)
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}
SparkR doesn't allow this.
{code}
> ddf <- sqlContext %>%
createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.IllegalArgumentException: Unsupported type for label: StringType
at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
at org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
at org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.refl
{code}
This can be fixed by doing a bit of manual labor. SparkR does accept booleans as if they are integers here.
{code}
> ddf <- ddf %>%
withColumn("to_pred", .$class == "a")
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}
But this can become quite tedious, especially when you want to have models that are using multiple classes that need classification.
Is there a good reason why this should not be a feature of formulas in Spark? I am aware of issue 8774, which looks like it is adressing a similar theme but a different issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org