You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2017/10/24 09:59:00 UTC

[jira] [Comment Edited] (SPARK-22277) Chi Square selector garbling Vector content.

    [ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216623#comment-16216623 ] 

Weichen Xu edited comment on SPARK-22277 at 10/24/17 9:58 AM:
--------------------------------------------------------------

[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
{code}
Are these features categorical features ?  
If yes, then you cannot directly training them by DecisionTreeClassifier, because they lack of metadata(Binary / Nominal) and will be regarded as continuous features.
If no, then you cannot use ChiSqSelector to select topFeatures, because ChiSqSelector is only available for categorical features.
But your above code compare the two cases, they're contrary, so I am confused.

If these features are categorical, you can process them first by VectorIndexer, it will fill metadata for the features first, and then process by ChiSqSelector/DecisionTreeClassifier, then I think no error will occur.


was (Author: weichenxu123):
[~cheburakshu] Your post code has some issue:
{code}
df = spark.createDataFrame([
    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
{code}
Are these features categorical features ?  
If yes, then you cannot directly training them by DecisionTreeClassifier, because they lack of metadata(Binary / Nominal) and will be regarded as continuous features.
If no, then you cannot use ChiSqSelector to select topFeatures, because ChiSqSelector is only available for categorical features.
But your above code compare the two cases, they're contrary, so I am confused.

> Chi Square selector garbling Vector content.
> --------------------------------------------
>
>                 Key: SPARK-22277
>                 URL: https://issues.apache.org/jira/browse/SPARK-22277
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the decision tree classifier is unable to process it. But, it is able to process when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
>     (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
>     (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
>     (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>                      outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
>     dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
>     model = dt.fit(result)
> except:
>     print(sys.exc_info())
> #Works    
>     dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
>     model = dt.fit(df)
>     
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
>     labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (<class 'pyspark.sql.utils.IllegalArgumentException'>, IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but it does not have the number of values specified.', 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t at org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0A87D878>)
> +----------+-------+------------------+
> |prediction|clicked|          features|
> +----------+-------+------------------+
> |       1.0|    1.0|[0.0,0.0,18.0,1.0]|
> |       0.0|    0.0|[0.0,1.0,12.0,0.0]|
> |       0.0|    0.0|[1.0,0.0,15.0,0.1]|
> +----------+-------+------------------+
> Test Error = 0 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org