You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peng Meng (JIRA)" <ji...@apache.org> on 2017/10/18 01:59:02 UTC
[jira] [Commented] (SPARK-22295) Chi Square selector not
recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208708#comment-16208708 ]
Peng Meng commented on SPARK-22295:
-----------------------------------
Hi [~cheburakshu] , thanks for reporting this bug and helpful code.
This is caused by similar problem but not the same thing as SPARK-22277.
The reason is when transform a dataframe, the field/attribute is not correctly set.
Maybe there are some other similar bugs in the code, we can solve them separately, or solve them together.
[~yanboliang] [~mlnick] [~srowen]
> Chi Square selector not recognizing field in Data frame
> -------------------------------------------------------
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 2.1.1
> Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in the data frame while fitting the model. I am using PIMA Indians diabetes dataset of UCI. The complete code and output is below for reference. But, when some rows of the input file is created as a dataframe manually, it will work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
> outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> +----+-----+-----+-----+-----+-----+-----+----+------+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> only showing top 1 row
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> |preg| plas| pres| skin| test| mass| pedi| age| class| features|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> only showing top 1 row
> (<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Field "class" does not exist.', 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t at org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0B743BC0>)
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
> outputCol="selected", labelCol="class").fit(df)
> except:
> print(sys.exc_info())
> print(css.selectedFeatures)
> {code}
> Output:
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> |preg| plas| pres| skin| test| mass| pedi| age|class|
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> only showing top 1 row
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> |preg| plas| pres| skin| test| mass| pedi| age|class| features|
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...|
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> only showing top 1 row
> [0, 1, 2, 3, 5]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org