You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dev Lakhani (Created) (JIRA)" <ji...@apache.org> on 2012/04/20 20:26:42 UTC

[jira] [Created] (MAHOUT-1000) Implementation of Single Sample T-Test using Map Reduce/Mahout

Implementation of Single Sample T-Test using Map Reduce/Mahout
--------------------------------------------------------------

                 Key: MAHOUT-1000
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1000
             Project: Mahout
          Issue Type: New Feature
          Components: Math
    Affects Versions: Backlog
         Environment: Linux, Mac OS, Hadoop 0.20.2, Mahout 0.x
            Reporter: Dev Lakhani
             Fix For: Backlog


Implement a map/reduce version of the single sample t test to test whether a sample of n subjects comes from a population in which the mean equals a particular value.

For a large dataset, say n millions of rows, one can test whether the sample (large as it is) comes from the population mean.

Input:
1) specified population mean to be tested against
2) hypothesis direction : i.e. "two.sided", "less", "greater".
3) confidence level or alpha
4) flag to indicate paired or not paired

The procedure is as follows:
1. Use Map/Reduce to calculate the mean of the sample.
2. Use Map/Reduce to calculate standard error of the population mean.
3. Use Map/Reduce to calculate the t statistic
4. Estimate the degrees of freedom depending on equal sample variances 

Output
1) The value of the t-statistic.
2) The p-value for the test.
3) Flag that is true if the null hypothesis can be rejected with confidence 1 - alpha; false otherwise.

References
http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-1000) Implementation of Single Sample T-Test using Map Reduce/Mahout

Posted by "Dev Lakhani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259847#comment-13259847 ] 

Dev Lakhani commented on MAHOUT-1000:
-------------------------------------

I guess this was a naive attempt at trying to create a MR version of the Apache commons math/statistics package. Following this implementation, the idea is to go on to extend to ANOVAs, Wilcoxon Tests, Pearson correlations, Kolmogrov-Smirnov and other R like features (but in MR).

Yup it could be done in Pig but it's maybe likely to need a UDF e.g. the TTest in commons math defines the TDistribution for lookup of statistical values so perhaps it's better doing the whole thing in Java. This also makes it easier to test and control/tune the MR jobs.

I was just trying to test the waters really and see if there is support for this; if so then there are plenty of basic stats tests than can be implemented for big data. This will require a bit of help from the community. If not please feel free to close this entry.

Cheers


                
> Implementation of Single Sample T-Test using Map Reduce/Mahout
> --------------------------------------------------------------
>
>                 Key: MAHOUT-1000
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1000
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: Backlog
>         Environment: Linux, Mac OS, Hadoop 0.20.2, Mahout 0.x
>            Reporter: Dev Lakhani
>              Labels: newbie
>             Fix For: Backlog
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Implement a map/reduce version of the single sample t test to test whether a sample of n subjects comes from a population in which the mean equals a particular value.
> For a large dataset, say n millions of rows, one can test whether the sample (large as it is) comes from the population mean.
> Input:
> 1) specified population mean to be tested against
> 2) hypothesis direction : i.e. "two.sided", "less", "greater".
> 3) confidence level or alpha
> 4) flag to indicate paired or not paired
> The procedure is as follows:
> 1. Use Map/Reduce to calculate the mean of the sample.
> 2. Use Map/Reduce to calculate standard error of the population mean.
> 3. Use Map/Reduce to calculate the t statistic
> 4. Estimate the degrees of freedom depending on equal sample variances 
> Output
> 1) The value of the t-statistic.
> 2) The p-value for the test.
> 3) Flag that is true if the null hypothesis can be rejected with confidence 1 - alpha; false otherwise.
> References
> http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-1000) Implementation of Single Sample T-Test using Map Reduce/Mahout

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258623#comment-13258623 ] 

Ted Dunning commented on MAHOUT-1000:
-------------------------------------

I am not sure that I see the value here.  All you need for this calculation is the means, the squared differences and the counts.

Do we really need this in Mahout when 3 lines of Pig suffice?
                
> Implementation of Single Sample T-Test using Map Reduce/Mahout
> --------------------------------------------------------------
>
>                 Key: MAHOUT-1000
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1000
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>    Affects Versions: Backlog
>         Environment: Linux, Mac OS, Hadoop 0.20.2, Mahout 0.x
>            Reporter: Dev Lakhani
>              Labels: newbie
>             Fix For: Backlog
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Implement a map/reduce version of the single sample t test to test whether a sample of n subjects comes from a population in which the mean equals a particular value.
> For a large dataset, say n millions of rows, one can test whether the sample (large as it is) comes from the population mean.
> Input:
> 1) specified population mean to be tested against
> 2) hypothesis direction : i.e. "two.sided", "less", "greater".
> 3) confidence level or alpha
> 4) flag to indicate paired or not paired
> The procedure is as follows:
> 1. Use Map/Reduce to calculate the mean of the sample.
> 2. Use Map/Reduce to calculate standard error of the population mean.
> 3. Use Map/Reduce to calculate the t statistic
> 4. Estimate the degrees of freedom depending on equal sample variances 
> Output
> 1) The value of the t-statistic.
> 2) The p-value for the test.
> 3) Flag that is true if the null hypothesis can be rejected with confidence 1 - alpha; false otherwise.
> References
> http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira