You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Anirudh Joshi (Jira)" <ji...@apache.org> on 2023/04/01 09:53:00 UTC

[jira] [Commented] (STATISTICS-54) [GSoC] Summary statistics API for Java 8 streams

    [ https://issues.apache.org/jira/browse/STATISTICS-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707525#comment-17707525 ] 

Anirudh Joshi commented on STATISTICS-54:
-----------------------------------------

Hello [~aherbert] and [~erans]. Hope you are doing well. My name is Anirudh and I am interested in contributing to this project as part of GSoC 2023. I am working on my proposal but would like to discuss my ideas with the community before I finalize my idea to see if I am thinking in the right direction.

I have been familiarizing myself with commons-stat/stat/descriptive project over the past few days. I saw that the current implementation of SummaryStatistics works only with sequential stream of values since the combiner parameter of Stream::collect is never invoked in this case. Our goal is to add support for parallel streams too since it would definitely would help us reduce processing time to compute Summary Statistics esp. when the dataset size is large.

An important ingredient that we need to support streams is the `merge` functionality. We need the ability to merge two partially constructed `StorelessUnivariateStatistic` objects. Once we implement this for all implementing classes of StorelessUnivariateStatistic we would be able to compute partial SummaryStatistic and use our merge function to aggregate these partially constructed SummaryStatistic objects to a result SummaryStatistic object that gives out the statistics for the entire dataset. 

My idea is to define a generic interface as follows
{code:java}
public interface StatisticAccumulator<T extends StorelessUnivariateStatistic> {

    // Add a single value to the accumulator
    void add(double d);
    
    // To ensure that the parameter to merge function are bound to an accumulator impl of the same statistic type T
    <U extends StatisticAccumulator<T>> void merge(U other);

    // Merge two partially constructed StorelessUnivariateStatistic objects 
    void merge(T other);

    // Get the statistic we are trying to accumulate
    T get();

} {code}
And have implementations for various statistics we have such as MeanAccumulator, GeometricMeanAccumulator, VarianceAccumulator etc.

A sample usage (assuming we have an implementation for MeanAccumulator) would look like
{code:java}
List<Double> data = Arrays.asList(1.0, 2.0, 3.0, 4.0, -1.0);
Mean mean = data.parallelStream()
        .collect(MeanAccumulator::new, MeanAccumulator::add, MeanAccumulator::merge)
        .get(); {code}
I have a [proof of concept PR|https://github.com/apache/commons-math/compare/master...ani5rudh:commons-math:STATISTICS-54-Proof-Of-Concept] for my approach with implementation for MeanAccumulator.

I am still a student learning principles of Object Oriented Design and Modelling, so my approach may not be perfect. I would like to know your thoughts on my approach so that I fix and improve my design. Your feedback is very valuable for my learning and developing my skills.

I also wanted to know if the scope of the project as far as GSoC is concerned is to add stream support along with unit tests for all the sub classes of `AbstractStorelessUnivariateStatistic` (around 17 of them) or is it a subset of these ? I am asking since to get clarity on the goals and plan accordingly to achieve the goals in 12 weeks of GSoC coding period. Please let me know. Thanks in advance!

> [GSoC] Summary statistics API for Java 8 streams
> ------------------------------------------------
>
>                 Key: STATISTICS-54
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-54
>             Project: Commons Statistics
>          Issue Type: Wish
>          Components: descriptive
>            Reporter: Alex Herbert
>            Priority: Minor
>              Labels: full-time, gsoc, gsoc2022, gsoc2023
>             Fix For: 1.0
>
>
> Placeholder for tasks that could be undertaken in this year's [GSoC|https://summerofcode.withgoogle.com/].
> Ideas:
> - Design an updated summary statistics API for use with Java 8 streams based on the summary statistic implementations in the Commons Math {{stat.descriptive}} package including {{{}moments{}}}, {{rank}} and {{summary}} sub-packages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)