You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Virendra singh Rajpurohit <vi...@gmail.com> on 2019/06/02 12:45:12 UTC

[Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

I've been trying to make summary statistics class. I have some doubt. There
is a class DoubleSummaryStatistics in java.util package(There are two more
for Int and Long). I'll attach this file here.
Do I have to design SummaryStatistics in this way only? I mean, description
on DoubleSummaryStatistics is "This class is designed to work with (though
does not require) streams
<https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html>.
For example, you can compute summary statistics on a stream of doubles with:


 DoubleSummaryStatistics stats =
doubleStream.collect(DoubleSummaryStatistics::new,

DoubleSummaryStatistics::accept,


DoubleSummaryStatistics::combine);"
Earlier my understanding of the project was that the user just have to call
the function "getSummary()" & all the calculations will be done
automatically in streams. but As we can see in DoubleSummaryStatistics we
have to call collect() method.
There are some functions like max, min, sum, count, average which are
already defined in this class. So should I extend this class in my class or
not? Also, I'll have to add more statistics other than max,min,sum for that
I have to override accept() function which will be used for  streams.

Warm Regards,
-- 
*Virendra Singh Rajpurohit*

*University of Petroleum and Energy Studies,Dehradun*
Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit





[image: Mailtrack]
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
Sender
notified by
Mailtrack
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
06/02/19,
6:14:27 PM

Re: [Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

Posted by Alex Herbert <al...@gmail.com>.

> On 2 Jun 2019, at 13:45, Virendra singh Rajpurohit <vi...@gmail.com> wrote:
> 
> I've been trying to make summary statistics class. I have some doubt. There is a class DoubleSummaryStatistics in java.util package(There are two more for Int and Long). I'll attach this file here. 
> Do I have to design SummaryStatistics in this way only? I mean, description on DoubleSummaryStatistics is "This class is designed to work with (though does not require) streams <https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html>. For example, you can compute summary statistics on a stream of doubles with:
>  
>  DoubleSummaryStatistics stats = doubleStream.collect(DoubleSummaryStatistics::new,
>                                                       DoubleSummaryStatistics::accept, 
>                                                       DoubleSummaryStatistics::combine);"
> Earlier my understanding of the project was that the user just have to call the function "getSummary()" & all the calculations will be done automatically in streams.

If you put all the work with streams inside the getSummary() function then the user cannot decide how to build the stream (e.g. serial or parallel). So designing like the JDK class to work with streams would be better.

> but As we can see in DoubleSummaryStatistics we have to call collect() method.  
> There are some functions like max, min, sum, count, average which are already defined in this class. So should I extend this class in my class or not? Also, I'll have to add more statistics other than max,min,sum for that I have to override accept() function which will be used for  streams.

You could extend this JDK class to add functionality. In the accept and combine method just call super.accept and super.combine. Then do the extra work you require.

One useful stat that is missing from the class is variance. A first addition would be to extend DoubleSummaryStatistics and add a variance (plus standard deviation) function with a variant for the population variance (or population standard deviation).

Note that a method to add a second moment to another second moment is required. This is not present in math4 AFAIK. There is this parallel variance algorithm [1] that would allow you to implement the combine() method to join two instances of your summary statistics class.

Alex


[1] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm>


> 
> Warm Regards,
> -- 
> Virendra Singh Rajpurohit
> 
> University of Petroleum and Energy Studies,Dehradun
> Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit <https://www.linkedin.com/in/virendra-singh-rajpurohit>
> 
> 
> 
> 
> 
>   <https://mailtrack.io/?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>	Sender notified by 
> Mailtrack <https://mailtrack.io/?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 06/02/19, 6:14:27 PM	
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org


Re: [Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

Posted by Gilles Sadowski <gi...@gmail.com>.
Hello.

Side note: Top-posting is quite annoying in these discussions...

Le dim. 2 juin 2019 à 21:27, Eric Barnhill <er...@gmail.com> a écrit :
>
> As discussed on prior threads you should have both. There will need to be
> static convenience methods for a user who wants to make a very simple call,
> say Stats.mean() . But, as Alex said, this convenience class will just be a
> front end for the statistics functionality itself. That needs to be in its
> own classes (Mean(), Variance()) which can produce instances that give the
> user more flexibility, For example storeless statistics like Mean() or
> Variance(), or StandardDeviation(), should be updatable, as Gilles said, or
> handle different kind of streams like Alex said. Yet these classes need to
> be designed so that they perform as well as simple implementations when
> desired.
>

Related discussion:
    https://issues.apache.org/jira/browse/STATISTICS-14

I agree with the requirement that "simple" usage must be possible.
However, it seems to me that the discussion is upside-down: simple
usage can always be provided by another layer (similar to the "toArray"
method in JDK's "List").  Seamless integration with stream does not
as obvious; hence should not be an afterthought.
Unless I'm mistaken, another way to look at it, is the "in-memoy" vs
"storeless" divide.  The latter being the most interesting case (when the
quantity can be computed) design-wise.

I suggest that the testing ground (read: code) is to provide the variance.
And see how it plays with a "DoubleStream", how it can also provide
"sum of squares" and "mean"; or how, inversely, "sum of squares" and
"mean" can be "combined" to provide variance.

Regards,
Gilles

> On Sun, Jun 2, 2019 at 5:45 AM Virendra singh Rajpurohit <
> virendrasinghrp@gmail.com> wrote:
>
> > I've been trying to make summary statistics class. I have some doubt.
> > There is a class DoubleSummaryStatistics in java.util package(There are two
> > more for Int and Long). I'll attach this file here.
> > Do I have to design SummaryStatistics in this way only? I mean,
> > description on DoubleSummaryStatistics is "This class is designed to work
> > with (though does not require) streams
> > <https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html>.
> > For example, you can compute summary statistics on a stream of doubles with:
> >
> >
> >  DoubleSummaryStatistics stats = doubleStream.collect(DoubleSummaryStatistics::new,
> >                                                       DoubleSummaryStatistics::accept,
> >
> >
> > DoubleSummaryStatistics::combine);"
> > Earlier my understanding of the project was that the user just have to
> > call the function "getSummary()" & all the calculations will be done
> > automatically in streams. but As we can see in DoubleSummaryStatistics we
> > have to call collect() method.
> > There are some functions like max, min, sum, count, average which are
> > already defined in this class. So should I extend this class in my class or
> > not? Also, I'll have to add more statistics other than max,min,sum for that
> > I have to override accept() function which will be used for  streams.
> >
> > Warm Regards,
> > --
> > *Virendra Singh Rajpurohit*
> >
> > *University of Petroleum and Energy Studies,Dehradun*
> > Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit
> >
> >
> >
> >
> >
> > [image: Mailtrack]
> > <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> Sender
> > notified by
> > Mailtrack
> > <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 06/02/19,
> > 6:14:27 PM
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

Posted by Eric Barnhill <er...@gmail.com>.
As discussed on prior threads you should have both. There will need to be
static convenience methods for a user who wants to make a very simple call,
say Stats.mean() . But, as Alex said, this convenience class will just be a
front end for the statistics functionality itself. That needs to be in its
own classes (Mean(), Variance()) which can produce instances that give the
user more flexibility, For example storeless statistics like Mean() or
Variance(), or StandardDeviation(), should be updatable, as Gilles said, or
handle different kind of streams like Alex said. Yet these classes need to
be designed so that they perform as well as simple implementations when
desired.






On Sun, Jun 2, 2019 at 5:45 AM Virendra singh Rajpurohit <
virendrasinghrp@gmail.com> wrote:

> I've been trying to make summary statistics class. I have some doubt.
> There is a class DoubleSummaryStatistics in java.util package(There are two
> more for Int and Long). I'll attach this file here.
> Do I have to design SummaryStatistics in this way only? I mean,
> description on DoubleSummaryStatistics is "This class is designed to work
> with (though does not require) streams
> <https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html>.
> For example, you can compute summary statistics on a stream of doubles with:
>
>
>  DoubleSummaryStatistics stats = doubleStream.collect(DoubleSummaryStatistics::new,
>                                                       DoubleSummaryStatistics::accept,
>
>
> DoubleSummaryStatistics::combine);"
> Earlier my understanding of the project was that the user just have to
> call the function "getSummary()" & all the calculations will be done
> automatically in streams. but As we can see in DoubleSummaryStatistics we
> have to call collect() method.
> There are some functions like max, min, sum, count, average which are
> already defined in this class. So should I extend this class in my class or
> not? Also, I'll have to add more statistics other than max,min,sum for that
> I have to override accept() function which will be used for  streams.
>
> Warm Regards,
> --
> *Virendra Singh Rajpurohit*
>
> *University of Petroleum and Energy Studies,Dehradun*
> Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit
>
>
>
>
>
> [image: Mailtrack]
> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> Sender
> notified by
> Mailtrack
> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 06/02/19,
> 6:14:27 PM
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org