You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Phil Steitz <ph...@gmail.com> on 2013/01/07 17:03:22 UTC

[math] DiscreteEmpiricalDistribution

The EmpiricalDistribution class in the random package is designed to
support large samples.  It does not store all of data points in
memory, but instead bins the data and uses smoothing kernels within
the bins.  I have recently had the need for a discrete empirical
distribution - i.e., an implementation that stores the full dataset
in memory and creates the empirical distribution exactly from it. 
If there are no objections, I would like to open a JIRA and commit
o.a.c.m.distribution.DiscreteEmpiricalDistribution implementing
this.  I will document the approach and design decisions in the JIRA
if others are OK with this addition.

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] DiscreteEmpiricalDistribution

Posted by Ted Dunning <te...@gmail.com>.

This will be very useful.

Sampling from discrete ECDF's is also closely related to generating samples
from a multinomial distribution.  I did a bit of work on the latter
problem.  The result of that work is in

org.apache.mahout.math.random.Multinomial

The major difference that you will have is that you have an ordered domain
that you are sampling from while Multinomial is simply sampling from a set.
 It would be relatively easy to use Multinomial<Interval> to do what you
want where Interval represents a half open interval (and allows +Inf as a
right bound and -Inf as a left bound), but I think that you might gain
something from the ability to split an interval that would make Multinomial
irrelevant to your needs.

With Multinomial<Interval>, one strategy would be to delete an interval and
insert both halves which may be a bit more expensive than desired.  Doing
lots of deletions is also bad in Multinomial because it leaves an entry in
place with zero probability (because I was lazy).

Trying to mutate the Interval you are splitting so that it forms the left
half of the new interval doesn't actually help because modifying the
probability of an element just does a deletion and insertion.  Better to
use the first strategy.

It would be very easy to add a garbage collection step that eliminates zero
probability entries.  As I mentioned, I was just lazy.

Anyway, steal concept or code as you feel appropriate.

On Mon, Jan 7, 2013 at 8:03 AM, Phil Steitz <ph...@gmail.com> wrote:

> The EmpiricalDistribution class in the random package is designed to
> support large samples.  It does not store all of data points in
> memory, but instead bins the data and uses smoothing kernels within
> the bins.  I have recently had the need for a discrete empirical
> distribution - i.e., an implementation that stores the full dataset
> in memory and creates the empirical distribution exactly from it.
> If there are no objections, I would like to open a JIRA and commit
> o.a.c.m.distribution.DiscreteEmpiricalDistribution implementing
> this.  I will document the approach and design decisions in the JIRA
> if others are OK with this addition.
>
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>