You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/07/27 09:53:22 UTC
[jira] [Commented] (SPARK-16750) ML GaussianMixture training failed
due to feature column type mistake
[ https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395338#comment-15395338 ]
Apache Spark commented on SPARK-16750:
--------------------------------------
User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14378
> ML GaussianMixture training failed due to feature column type mistake
> ---------------------------------------------------------------------
>
> Key: SPARK-16750
> URL: https://issues.apache.org/jira/browse/SPARK-16750
> Project: Spark
> Issue Type: Bug
> Components: ML
> Reporter: Yanbo Liang
> Assignee: Yanbo Liang
>
> ML GaussianMixture training failed due to feature column type mistake. The feature column type should be {{ml.linalg.VectorUDT}} but got {{mllib.linalg.VectorUDT}} by mistake.
> This bug is easy to reproduce by the following code:
> {code}
> val df = spark.createDataFrame(
> Seq(
> (1, Vectors.dense(0.0, 1.0, 4.0)),
> (2, Vectors.dense(1.0, 0.0, 4.0)),
> (3, Vectors.dense(1.0, 0.0, 5.0)),
> (4, Vectors.dense(0.0, 0.0, 5.0)))
> ).toDF("id", "features")
> val scaler = new MinMaxScaler()
> .setInputCol("features")
> .setOutputCol("features_scaled")
> .setMin(0.0)
> .setMax(5.0)
> val gmm = new GaussianMixture()
> .setFeaturesCol("features_scaled")
> .setK(2)
> val pipeline = new Pipeline().setStages(Array(scaler, gmm))
> pipeline.fit(df)
> requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
> at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
> at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
> at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
> at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
> at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
> at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
> at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
> at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
> at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
> {code}
> Why the unit tests did not complain this errors? Because some estimators/transformers missed calling {{transformSchema(dataset.schema)}} firstly during {{fit}} or {{transform}}. I will also add this function to all estimators/transformers who missed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org