You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steffen Herbold (JIRA)" <ji...@apache.org> on 2017/01/02 10:13:58 UTC
[jira] [Commented] (SPARK-18301) VectorAssembler does not support StructTypes

    [ https://issues.apache.org/jira/browse/SPARK-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792615#comment-15792615 ] 

Steffen Herbold commented on SPARK-18301:
-----------------------------------------

I think if this is a bug or a feature request depends on the point of view. 

Since structured types are natively supported by Spark the simple assumption is, that they are supported by all features of Spark. If they are not supported by specific features (e.g., transformers), then there should either be a good reason for this, or it is a bug. 

In case there is a reason, this should be part of the documentation and this should be changed to a feature request. If not, then this constitutes a bug and if other transformers are also not able to work with them, it might actually be major instead of minor.

> VectorAssembler does not support StructTypes
> --------------------------------------------
>
>                 Key: SPARK-18301
>                 URL: https://issues.apache.org/jira/browse/SPARK-18301
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.0.1
>         Environment: Windows Standalone Mode, Java
>            Reporter: Steffen Herbold
>            Priority: Minor
>
> I tried to transform a structured type using the VectorAssembler as follows:
> {code:java}
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
>             { "metrics.Line", "metrics.McCC" }).setOutputCol("features");
>         dataframe= va.transform(dataframe);
> {code}
> This yields the following exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Field "metrics.McCC" does not exist.
> 	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
> 	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
> 	at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> 	at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> 	at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
> 	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
> 	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> 	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> 	at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:116)
> 	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
> 	at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
> 	at de.ugoe.cs.smartshark.jobs.DefectPredictionExample.main(DefectPredictionExample.java:53)
> {code}
> The schema of the dataframe is:
> {noformat}
>  |-- metrics: struct (nullable = true)
>  |    |-- Line: double (nullable = true)
>  |    |-- McCC: double (nullable = true)
> ...
> {noformat}
> The transfomation works, if I first use withColumn to make "metrics.Line" and "metrics.McCC" into columns of the dataframe:
> {code:java}
> dataframe.withColumn("Line", data.col("metrics.Line")
> dataframe.withColumn("McCC", data.col("metrics.McCC")
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
>             { "metrics.McCC", "metrics.NL" }).setOutputCol("features");
>         fileState = va.transform(dataframe);
> {code}
> However, this workaround is quite costly and direct support to access the nested values would be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org