You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vincent Warmerdam (JIRA)" <ji...@apache.org> on 2015/09/10 01:13:45 UTC

[jira] [Created] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

Vincent Warmerdam created SPARK-10523:
-----------------------------------------

             Summary: SparkR formula syntax to turn strings/factors into numerics
                 Key: SPARK-10523
                 URL: https://issues.apache.org/jira/browse/SPARK-10523
             Project: Spark
          Issue Type: Bug
            Reporter: Vincent Warmerdam


In normal (non SparkR) R the formula syntax enables strings or factors to be turned into dummy variables immediately when calling a classifier. This way, the following Rcode is legal and often used. 

{code}
library(magrittr) 
df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
glm(class ~ i, family = "binomial", data = df)
{code}

SparkR doesn't allow this. 

{code}
> ddf <- sqlContext %>% 
  createDataFrame(df)
> glm(class ~ i, family = "binomial", data = ddf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.IllegalArgumentException: Unsupported type for label: StringType
	at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
	at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
	at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
	at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
	at org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
	at org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.refl
{code}

This can be fixed by doing a bit of manual labor. SparkR does accept booleans as if they are integers here. 

{code}
> ddf <- ddf %>% 
  withColumn("to_pred", .$class == "a") 
> glm(to_pred ~ i, family = "binomial", data = ddf)
{code}

But this can become quite tedious, especially when you want to have models that are using multiple classes that need classification. 

Is there a good reason why this should not be a feature of formulas in Spark? I am aware of issue 8774, which looks like it is adressing a similar theme but a different issue. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org