You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "DB Tsai (JIRA)" <ji...@apache.org> on 2014/06/03 23:14:01 UTC

[jira] [Updated] (SPARK-1969) Public available online summarizer for mean, variance, min, and max

     [ https://issues.apache.org/jira/browse/SPARK-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

DB Tsai updated SPARK-1969:
---------------------------

    Description: 
Basically, it moves the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi. 

Changes:
1) Moved the trait from org.apache.spark.mllib.stat.MultivariateStatisticalSummary to org.apache.spark.mllib.stats.Summarizer 
2) Moved the private implementation from org.apache.spark.mllib.linalg. ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
3) Added the API documentation for OnlineSummarizer
4) Added the unittest for OnlineSummarizer

  was:
Basically, it will be a ported from mahout's OnlineSummarizer

https://github.com/apache/mahout/blob/master/math/src/main/java/org/apache/mahout/math/stats/OnlineSummarizer.java

Computes on-line estimates of mean, variance and all five quartiles (notably including the median).  Since this is done in a completely incremental fashion (that is what is meant by on-line) estimates are available at any time and the amount of memory used is constant.  

Somewhat surprisingly, the quantile estimates are about as good as you would get if you actually kept all of the samples.
 
The method used for mean and variance is Welford's method.  See
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm

The method used for computing the quartiles is a simplified form of the stochastic approximation method described in the article "Incremental Quantile Estimation for Massive Tracking" by Chen, Lambert and Pinheiro


     Issue Type: Improvement  (was: New Feature)
        Summary: Public available online summarizer for mean, variance, min, and max  (was: Online Summarizer for mean, variance, min, max, and quartile)

> Public available online summarizer for mean, variance, min, and max
> -------------------------------------------------------------------
>
>                 Key: SPARK-1969
>                 URL: https://issues.apache.org/jira/browse/SPARK-1969
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: DB Tsai
>
> Basically, it moves the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi. 
> Changes:
> 1) Moved the trait from org.apache.spark.mllib.stat.MultivariateStatisticalSummary to org.apache.spark.mllib.stats.Summarizer 
> 2) Moved the private implementation from org.apache.spark.mllib.linalg. ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
> 3) Added the API documentation for OnlineSummarizer
> 4) Added the unittest for OnlineSummarizer



--
This message was sent by Atlassian JIRA
(v6.2#6252)