You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peng Meng (JIRA)" <ji...@apache.org> on 2017/10/17 08:01:00 UTC
[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

    [ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207146#comment-16207146 ] 

Peng Meng commented on SPARK-22277:
-----------------------------------

This seems is a bug. If no one is working on it. I can work on it. 

> Chi Square selector garbling Vector content.
> --------------------------------------------
>
>                 Key: SPARK-22277
>                 URL: https://issues.apache.org/jira/browse/SPARK-22277
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the decision tree classifier is unable to process it. But, it is able to process when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
>     (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
>     (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
>     (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>                      outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
>     dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
>     model = dt.fit(result)
> except:
>     print(sys.exc_info())
> #Works    
>     dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
>     model = dt.fit(df)
>     
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
>     labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but it does not have the number of values specified.', 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t at org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0A87D878>)
> +----------+-------+------------------+
> |prediction|clicked|          features|
> +----------+-------+------------------+
> |       1.0|    1.0|[0.0,0.0,18.0,1.0]|
> |       0.0|    0.0|[0.0,1.0,12.0,0.0]|
> |       0.0|    0.0|[1.0,0.0,15.0,0.1]|
> +----------+-------+------------------+
> Test Error = 0 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org