You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheburakshu (JIRA)" <ji...@apache.org> on 2017/10/17 18:21:00 UTC
[jira] [Updated] (SPARK-22295) Chi Square selector not recognizing field in Data frame

     [ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheburakshu updated SPARK-22295:
--------------------------------
    Description: 
ChiSquare selector is not recognizing the field 'class' which is present in the data frame while fitting the model. I am using PIMA Indians diabetes dataset of UCI. The complete code and output is below for reference.

Kindly help.

{code:python}
from pyspark.ml.feature import VectorAssembler, ChiSqSelector
import sys

file_name='data/pima-indians-diabetes.data'

df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()

df.show(1)
assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features")
df=assembler.transform(df)
df.show(1)
try:
    css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
                          outputCol="selected", labelCol='class').fit(df)
except:
    print(sys.exc_info())
{code}

Output:

+----+-----+-----+-----+-----+-----+-----+----+------+
|preg| plas| pres| skin| test| mass| pedi| age| class|
+----+-----+-----+-----+-----+-----+-----+----+------+
|   6|  148|   72|   35|    0| 33.6|0.627|  50|     1|
+----+-----+-----+-----+-----+-----+-----+----+------+
only showing top 1 row

+----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
|preg| plas| pres| skin| test| mass| pedi| age| class|            features|
+----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
|   6|  148|   72|   35|    0| 33.6|0.627|  50|     1|[6.0,148.0,72.0,3...|
+----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
only showing top 1 row

(<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Field "class" does not exist.', 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t at org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0B743BC0>)

  was:
There is a difference in behavior when Chisquare selector is used v direct feature use in decision tree classifier. 
In the below code, I have used chisquare selector as a thru' pass but the decision tree classifier is unable to process it. But, it is able to process when the features are used directly.

The example is pulled out directly from Apache spark python documentation.

Kindly help.

{code:python}
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
import sys

df = spark.createDataFrame([
    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])

# ChiSq selector will just be a pass-through. All four featuresin the i/p will be in output also.
selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
                     outputCol="selectedFeatures", labelCol="clicked")
result = selector.fit(df).transform(df)
print("ChiSqSelector output with top %d features selected" % 
selector.getNumTopFeatures())

from pyspark.ml.classification import DecisionTreeClassifier

try:
# Fails
    dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
    model = dt.fit(result)
except:
    print(sys.exc_info())
#Works    
    dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
    model = dt.fit(df)
    
# Make predictions. Using same dataset, not splitting!!
predictions = model.transform(result)

# Select example rows to display.
predictions.select("prediction", "clicked", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="clicked", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))
{code}

Output:

ChiSqSelector output with top 4 features selected
(<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but it does not have the number of values specified.', 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t at org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0A87D878>)
+----------+-------+------------------+
|prediction|clicked|          features|
+----------+-------+------------------+
|       1.0|    1.0|[0.0,0.0,18.0,1.0]|
|       0.0|    0.0|[0.0,1.0,12.0,0.0]|
|       0.0|    0.0|[1.0,0.0,15.0,0.1]|
+----------+-------+------------------+

Test Error = 0 


> Chi Square selector not recognizing field in Data frame
> -------------------------------------------------------
>
>                 Key: SPARK-22295
>                 URL: https://issues.apache.org/jira/browse/SPARK-22295
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in the data frame while fitting the model. I am using PIMA Indians diabetes dataset of UCI. The complete code and output is below for reference.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
>     css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>                           outputCol="selected", labelCol='class').fit(df)
> except:
>     print(sys.exc_info())
> {code}
> Output:
> +----+-----+-----+-----+-----+-----+-----+----+------+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> |   6|  148|   72|   35|    0| 33.6|0.627|  50|     1|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> only showing top 1 row
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> |preg| plas| pres| skin| test| mass| pedi| age| class|            features|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> |   6|  148|   72|   35|    0| 33.6|0.627|  50|     1|[6.0,148.0,72.0,3...|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> only showing top 1 row
> (<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Field "class" does not exist.', 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t at org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0B743BC0>)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org