You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/10/11 19:02:20 UTC

[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

    [ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566284#comment-15566284 ] 

Sean Owen commented on SPARK-17870:
-----------------------------------

OK I get it, they're doing different things really. The scikit version is computing the statistic for count-valued features vs categorical label, and the Spark version is computing this for categorical features vs categorical labels. Although the number of label classes is constant in both cases, the Spark computation would depend on the number of feature classes too. Yes, it does need to be changed in Spark.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> ------------------------------------------------------------------------
>
>                 Key: SPARK-17870
>                 URL: https://issues.apache.org/jira/browse/SPARK-17870
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>            Reporter: Peng Meng
>            Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org