You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Anirudh Joshi (Jira)" <ji...@apache.org> on 2023/04/02 17:54:00 UTC

[jira] [Comment Edited] (STATISTICS-54) [GSoC] Summary statistics API for Java 8 streams

    [ https://issues.apache.org/jira/browse/STATISTICS-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707700#comment-17707700 ] 

Anirudh Joshi edited comment on STATISTICS-54 at 4/2/23 5:53 PM:
-----------------------------------------------------------------

Thanks Alex and Gilles for your inputs. It really gave me clarity on a few things I was confused about. I had a few questions about some of our choices
 # When we say "We would incorporate the functionality directly into a new module in the Statistics project" this means we would create a new module in the [commons-statistics|https://github.com/apache/commons-statistics] project, porting over the StorelessUnivariateStatistic currently implemented in [commons-math-legacy/stat/descriptive|https://github.com/apache/commons-math/tree/master/commons-math-legacy/src/main/java/org/apache/commons/math4/legacy/stat/descriptive] package. Is my understanding correct ?
 # Among the StorelessUnivariateStatistic implementations we have, not all are part of SummaryStatistics class (E.g. Kurtosis, PSquarePercentile). Would the goal of this project be to implement collectors on all implementations of StorelessUnivariateStatistic or just the statistic values we compute as part of SummaryStatistics ?
 # About the question of whether we need "a separate stream version for each of the statistics", I would like to know if we plan to support callers passing custom `statistic implementation` while computing SummaryStatistics ? If the goal is to support this, I feel it is better we have a stream version for each individual statistic and make the implementation a composition of SummaryStatistics class. That way we can allow for callers to supply their own custom implementations if need be. But as Gilles noted, this would duplicate a lot of computations between classes (classic example being Variance and StandardDeviation where each would compute the same `square of deviations from the mean` separately) and this would be sub optimal in my opinion. May be we can group the common functionality in a separate class and share this between the depending classes (SumOfSquareDeviationFromMean class for my example above and initialize this class inside the SummaryStatistics and initialize both Variance and StandardDeviation classes with this instance if called from SummaryStatistics). I would like to know your thoughts on this.

{code:java}
public final class SumOfSquareDeviationFromMean {
// Implementation
}

public class Variance implements DoubleSupplier {
    public static Variance of(double... values);
    public static Variance create(); // Could provide an implementation choice
    public static Variance with(SumOfSquareDeviationFromMean); // called from SummaryStatistics    
    public Variance add(double);
    public Variance add(Variance);
    public double getAsDouble();
}

public class StandardDeviation implements DoubleSupplier { 
    public static StandardDeviation of(double... values); 
    public static StandardDeviation create(); // Could provide an implementation choice
    public static StandardDeviation with(SumOfSquareDeviationFromMean); // called from SummaryStatistics
    public StandardDeviation add(double); 
    public StandardDeviation add(StandardDeviation); 
    public StandardDeviation getAsDouble(); 
} {code}
 

We could also have two factory constructor methods in the SummaryStatistics class.
{code:java}
 public static DoubleStatisticSummary of(Statistic... statistics); // include Statistic values (with a fair warning that this may be sub optimal if used incorrectly)

 public static DoubleStatisticSummary of(); // include ALL Statistic values {code}
So that the caller could potentially decide which one to use based on the need.

Please let me know. Thanks again for your inputs. This is helping me gain more clarity into the work to be done.


was (Author: JIRAUSER299640):
Thanks Alex and Gilles for your inputs. It really gave me clarity on a few things I was confused about. I had a few questions about some of our choices
 # When we say "We would incorporate the functionality directly into a new module in the Statistics project" this means we would create a new module in the [commons-statistics|https://github.com/apache/commons-statistics] project, porting over the StorelessUnivariateStatistic currently implemented in [commons-math-legacy/stat/descriptive|https://github.com/apache/commons-math/tree/master/commons-math-legacy/src/main/java/org/apache/commons/math4/legacy/stat/descriptive] package. Is my understanding correct ?
 # Among the StorelessUnivariateStatistic implementations we have, not all are part of SummaryStatistics class (E.g. Kurtosis, PSquarePercentile). Would the goal of this project be to implement collectors on all implementations of StorelessUnivariateStatistic or just the statistic values we compute as part of SummaryStatistics ?
 # About the question of whether we need "a separate stream version for each of the statistics", I would like to know if we plan to support callers passing custom `statistic implementation` while computing SummaryStatistics ? If the goal is to support this, I feel it is better we have a stream version for each individual statistic and make the implementation a composition of SummaryStatistics class. That way we can allow for callers to supply their own custom implementations if need be. But as Gilles noted, this would duplicate a lot of computations between classes (classic example being Variance and StandardDeviation where each would compute the same `square of deviations from the mean` separately) and this would be sub optimal in my opinion. May be we can group the common functionality in a separate class and share this between the depending classes (SumOfSquareDeviationFromMean class for my example above and initialize this class inside the SummaryStatistics and initialize both Variance and StandardDeviation classes with this instance if called from SummaryStatistics). I would like to know your thoughts on this.

{code:java}
public final class SumOfSquareDeviationFromMean {
// Implementation
}

public class Variance implements DoubleSupplier {
    public static Variance of(double... values);
    public static Variance create(); // Could provide an implementation choice
    public static Variance with(SumOfSquareDeviationFromMean); // called from SummaryStatistics    
    public Variance add(double);
    public Variance add(Variance);
    public double getAsDouble();
}

public class StandardDeviation implements DoubleSupplier { 
    public static StandardDeviation of(double... values); 
    public static StandardDeviation create(); // Could provide an implementation choice
    public static StandardDeviation with(SumOfSquareDeviationFromMean); // called from SummaryStatistics
    public StandardDeviation add(double); 
    public StandardDeviation add(StandardDeviation); 
    public StandardDeviation getAsDouble(); 
}{code}
 

Please let me know. Thanks again for your inputs. This is helping me gain more clarity into the work to be done.

> [GSoC] Summary statistics API for Java 8 streams
> ------------------------------------------------
>
>                 Key: STATISTICS-54
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-54
>             Project: Commons Statistics
>          Issue Type: Wish
>          Components: descriptive
>            Reporter: Alex Herbert
>            Priority: Minor
>              Labels: full-time, gsoc, gsoc2022, gsoc2023
>             Fix For: 1.0
>
>
> Placeholder for tasks that could be undertaken in this year's [GSoC|https://summerofcode.withgoogle.com/].
> Ideas:
> - Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math {{stat.descriptive}} package including {{{}moments{}}}, {{rank}} and {{summary}} sub-packages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)