You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/03/02 21:31:00 UTC

[jira] [Resolved] (SPARK-25911) [spark-ml] Hypothesis testing module

     [ https://issues.apache.org/jira/browse/SPARK-25911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-25911.
-------------------------------
    Resolution: Won't Fix

I don't think we'd add all of those. Some of these are already in JIRA as ideas. While I don't think we'll add much more like this to ML, if you have one you can argue is widely used, and you can implement it, then I'd create (or find) a JIRA for that one to discuss first.

> [spark-ml] Hypothesis testing module
> ------------------------------------
>
>                 Key: SPARK-25911
>                 URL: https://issues.apache.org/jira/browse/SPARK-25911
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 3.0.0
>            Reporter: Uday Babbar
>            Priority: Minor
>
> h2. Why this ticket was created
> Feasibility determination of some subset of hypothesis testing module mainly along value proposition front and to get a preliminary opinion of how does it generally sound. Can work on a more comprehensive proposal if say, it's generally agreed upon that including dataframe API for t-test makes sense in the o.a.s.ml package. 
> h2. Current state
> There are some streaming implementation in the o.a.s.mllib module, but there are no dataframe APIs for some standard tests (t-test). 
> ||Test ||Current state||Proposed state||
> |t-test (welch's, student)|only streaming |Dataframe API|
> |chi-squared|streaming, Dataframe/RDD API present| - |
> |ANOVA|-|Dataframe API|
> |mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make sense to include this)|
> h2. Rationale 
> The utility of experimentation platforms is pervasive and most of them that operate at scale (a large portion of them use spark for offline computation) require distributed implementation of hypothesis tests to calculate p-values of different metrics/features. These APIs would enable distributed computation of the relevant stats and prevent overhead in moving data (or some downstream view of it) to a framework where such stats computation is available (R, scipy). 
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org