You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 05:35:26 UTC

[jira] [Updated] (SPARK-6312) ChiSqTest should check for too few counts

     [ https://issues.apache.org/jira/browse/SPARK-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-6312:
--------------------------------
    Labels: bulk-closed  (was: )

> ChiSqTest should check for too few counts
> -----------------------------------------
>
>                 Key: SPARK-6312
>                 URL: https://issues.apache.org/jira/browse/SPARK-6312
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>              Labels: bulk-closed
>
> ChiSqTest assumes that elements of the contingency matrix are large enough (have enough counts) s.t. the central limit theorem kicks in.  It would be reasonable to do one or more of the following:
> * Add a note in the docs about making sure there are a reasonable number of instances being used (or counts in the contingency table entries, to be more precise and account for skewed category distributions).
> * Add a check in the code which could:
> ** Log a warning message
> ** Alter the p-value to make sure it indicates the test result is insignificant



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org