You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peng Meng (JIRA)" <ji...@apache.org> on 2017/10/18 01:59:02 UTC
[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

    [ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208708#comment-16208708 ] 

Peng Meng commented on SPARK-22295:
-----------------------------------

Hi [~cheburakshu] ， thanks for reporting this bug and helpful code. 
This is caused by similar problem but not the same thing as SPARK-22277. 
The reason is when transform a dataframe, the field/attribute is not correctly set.

Maybe there are some other similar bugs in the code, we can solve them separately, or solve them together.   

[~yanboliang] [~mlnick] [~srowen]

> Chi Square selector not recognizing field in Data frame
> -------------------------------------------------------
>
>                 Key: SPARK-22295
>                 URL: https://issues.apache.org/jira/browse/SPARK-22295
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in the data frame while fitting the model. I am using PIMA Indians diabetes dataset of UCI. The complete code and output is below for reference. But, when some rows of the input file is created as a dataframe manually, it will work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
>     css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>                           outputCol="selected", labelCol='class').fit(df)
> except:
>     print(sys.exc_info())
> {code}
> Output:
> +----+-----+-----+-----+-----+-----+-----+----+------+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> |   6|  148|   72|   35|    0| 33.6|0.627|  50|     1|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> only showing top 1 row
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> |preg| plas| pres| skin| test| mass| pedi| age| class|            features|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> |   6|  148|   72|   35|    0| 33.6|0.627|  50|     1|[6.0,148.0,72.0,3...|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> only showing top 1 row
> (<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Field "class" does not exist.', 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t at org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0B743BC0>)
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
>     css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>                           outputCol="selected", labelCol="class").fit(df)
> except:
>     print(sys.exc_info())
> print(css.selectedFeatures)
> {code}
> Output:
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> |preg| plas| pres| skin| test| mass| pedi| age|class|
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> |   6|  148|   72|   35|    0| 33.6|0.627|  50|    1|
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> only showing top 1 row
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> |preg| plas| pres| skin| test| mass| pedi| age|class|            features|
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> |   6|  148|   72|   35|    0| 33.6|0.627|  50|    1|[6.0,148.0,72.0,3...|
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> only showing top 1 row
> [0, 1, 2, 3, 5]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org