You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Phil Steitz <ph...@gmail.com> on 2011/09/05 22:44:34 UTC

[math] EmpiricalDistribution

I have a couple of proposals for this class:

0) Merge the interface and impl.   This is consistent with what we
are doing in some other places where we have only one implementation.
1) Extend this class to actually provide a distribution - i.e.
implement the Distribution interface. 
2) make the kernel used within bins configurable.  Currently, values
are generated (and the cdf would be computed) assuming a Gaussian
distribution within bins.  I think at least a uniform option should
be provided.

Thanks in advance for any feedback on this or further suggestions
for improvement.

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] EmpiricalDistribution

Posted by Phil Steitz <ph...@gmail.com>.
On 9/7/11 11:20 AM, Mikkel Meyer Andersen wrote:
> 2011/9/7 Phil Steitz <ph...@gmail.com>:
>> On 9/6/11 8:58 AM, Mikkel Meyer Andersen wrote:
>>> 2011/9/6 Phil Steitz <ph...@gmail.com>:
>>>> On 9/6/11 12:00 AM, Mikkel Meyer Andersen wrote:
>>>>> 2011/9/5 Phil Steitz <ph...@gmail.com>:
>>>>>> I have a couple of proposals for this class:
>>>>>>
>>>>>> 0) Merge the interface and impl.   This is consistent with what we
>>>>>> are doing in some other places where we have only one implementation.
>>>>> Fine with me.
>>>>>> 1) Extend this class to actually provide a distribution - i.e.
>>>>>> implement the Distribution interface.
>>>>> Won't we have problems, e.g. with implementing cumulativeProbability?
>>>> The idea I had was to interpolate within bins.  So to compute the
>>>> cdf at x you would find its bin, sum the mass (based on number of
>>>> original sample points contained, like the sampling does) of the
>>>> bins below its containing bin and then use the defined kernel within
>>>> bin to determine how much of its own bin's mass to include.
>>> Seems reasonable. But: We might want to include a user specified
>>> support - just simple (endpoints of an interval) - or else the highest
>>> and lowest value specifies the support which might not be a good idea.
>> By the latter, do you mean just interpolate linearly between lowest
>> and highest, or do you mean the lowest / highest actually observed
>> points in the bin?  The first is like using a uniform kernel in the
>> bins.  By "user-specified support" I guess you mean make the
>> interpolation strategy pluggable somehow, right?   What launched me
>> into thinking about making the kernel used for sampling configurable
>> was thinking about how uniform would probably be better / more
>> defensible for use interpolating the cdf in some cases.  Then you
>> have to ask is it OK to use a different kernel for the sampling vs
>> cdf computation.  My instinct is to say no and keep it simple -
>> allow a uniform kernel to be chosen in place of the hard-coded
>> Gaussian there now and then use the configured kernel for both
>> sampling and cdf computation.  Even with mixed kernels, you will
>> probably in most cases end up with decent fidelity between sampling
>> results and the cdf; but I can imagine scenarios where Gaussian
>> kernels with coarse grids could lead to funny sampling distributions
>> that would not follow the linearly-interpolated cdf very well near
>> grid points.
>>
>> Phil
> "but I can imagine scenarios where Gaussian
> kernels with coarse grids could lead to funny sampling distributions
> that would not follow the linearly-interpolated cdf very well near
> grid points."
> Yes, precisely. Especially if trying to distribute the probability
> mass on a discrete grid :-).
>
> To clearify what I ment by user-specified support:
> If a user has observations 1, 3, 4, we would probably want to open up
> for probability mass elsewhere than just at {1, 2, 3, 4} (2 is
> interpolated). Then I mean that it might make sense that the user can
> specify that that the distribution is discrete with a support of {0,
> 1, 2, 3, 4, 5} (2 is interpolated and 0/5 interpolated). Similar for
> continuous distributions.

Interesting idea.  The original intent of this class was to model
only continuous distributions - really just to construct a
continuous distribution from a sample.  The points within a bin are
used to estimate parameters of the kernel representing the
underlying continuous distribution within the bin.  For discrete
distributions we obviously already have the Frequency class, which
more or less represents the discrete distribution corresponding to a
sample.  The hybrid kind of idea where you parcel out the mass in a
user-defined way is interesting, but I think the example above would
best be modeled directly in a discrete distribution implementation. 
Could be we should give some thought to how to make it easier for
people to define discrete distributions or recover them from
Frequency instances.  Also, see the last comment below on combining
distributions.
>
> Of is that too ambitious?
>
> Regarding kernels, I'm okay with only supporting uniform and Gaussian,
> but we might think about it - we might come up with a clever solution
> giving pluggable kernels almost for free (if we are lucky :-)).

As long as the distributions used can be parameterized using
SummarryStatistics info, that should be not that difficult to do -
even allowing different kernels for different bins using lists of
indices.  Might be a little complicated for people to grok, but most
users would just use Gaussian or uniform kernels throughout.

Your example of the augmented frequency distribution above
illustrates another thing that I have needed before but never
proposed to [math] which is a facility for defining distributions
piecewise, either from densities or cdfs, rescaling as necessary to
get the composite to be a distribution.  What I am proposing for
EmpiricalDistribution is a specialized example.  But suppose you
could do Distribution composite = new CompositeDistribution(f,
fWeight, cut, g, gWeight) where f has support (a,b) and g has
support (c, d) with a < cut <= b and c <= cut < d.  The composite
would use use fWeighted f cdf up to cut and then gWeighted g, with
the whole thing scaled so total mass is 1.  This could be iterated
to create composite chains.  This is exactly what
EmpiricalDisribution does now (for sampling, and will do for
probabilities when we add the cdf computation) with bin boundaries
as cut points and weights equal to empirical probabilities of the bins.

Phil
>
> Cheers, Mikkel.
>
>>>>>> 2) make the kernel used within bins configurable.  Currently, values
>>>>>> are generated (and the cdf would be computed) assuming a Gaussian
>>>>>> distribution within bins.  I think at least a uniform option should
>>>>>> be provided.
>>>>> +1, maybe it can be generalised to providing user-defined kernels.
>>>> Good idea.  Need to think about how to enable that.
>>>>
>>>> Thanks!
>>>>
>>>> Phil
>>>>>> Thanks in advance for any feedback on this or further suggestions
>>>>>> for improvement.
>>>>>>
>>>>>> Phil
>>>>>>
>>> Cheers, Mikkel.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] EmpiricalDistribution

Posted by Mikkel Meyer Andersen <mi...@mikl.dk>.
2011/9/7 Phil Steitz <ph...@gmail.com>:
> On 9/6/11 8:58 AM, Mikkel Meyer Andersen wrote:
>> 2011/9/6 Phil Steitz <ph...@gmail.com>:
>>> On 9/6/11 12:00 AM, Mikkel Meyer Andersen wrote:
>>>> 2011/9/5 Phil Steitz <ph...@gmail.com>:
>>>>> I have a couple of proposals for this class:
>>>>>
>>>>> 0) Merge the interface and impl.   This is consistent with what we
>>>>> are doing in some other places where we have only one implementation.
>>>> Fine with me.
>>>>> 1) Extend this class to actually provide a distribution - i.e.
>>>>> implement the Distribution interface.
>>>> Won't we have problems, e.g. with implementing cumulativeProbability?
>>> The idea I had was to interpolate within bins.  So to compute the
>>> cdf at x you would find its bin, sum the mass (based on number of
>>> original sample points contained, like the sampling does) of the
>>> bins below its containing bin and then use the defined kernel within
>>> bin to determine how much of its own bin's mass to include.
>> Seems reasonable. But: We might want to include a user specified
>> support - just simple (endpoints of an interval) - or else the highest
>> and lowest value specifies the support which might not be a good idea.
>
> By the latter, do you mean just interpolate linearly between lowest
> and highest, or do you mean the lowest / highest actually observed
> points in the bin?  The first is like using a uniform kernel in the
> bins.  By "user-specified support" I guess you mean make the
> interpolation strategy pluggable somehow, right?   What launched me
> into thinking about making the kernel used for sampling configurable
> was thinking about how uniform would probably be better / more
> defensible for use interpolating the cdf in some cases.  Then you
> have to ask is it OK to use a different kernel for the sampling vs
> cdf computation.  My instinct is to say no and keep it simple -
> allow a uniform kernel to be chosen in place of the hard-coded
> Gaussian there now and then use the configured kernel for both
> sampling and cdf computation.  Even with mixed kernels, you will
> probably in most cases end up with decent fidelity between sampling
> results and the cdf; but I can imagine scenarios where Gaussian
> kernels with coarse grids could lead to funny sampling distributions
> that would not follow the linearly-interpolated cdf very well near
> grid points.
>
> Phil
"but I can imagine scenarios where Gaussian
kernels with coarse grids could lead to funny sampling distributions
that would not follow the linearly-interpolated cdf very well near
grid points."
Yes, precisely. Especially if trying to distribute the probability
mass on a discrete grid :-).

To clearify what I ment by user-specified support:
If a user has observations 1, 3, 4, we would probably want to open up
for probability mass elsewhere than just at {1, 2, 3, 4} (2 is
interpolated). Then I mean that it might make sense that the user can
specify that that the distribution is discrete with a support of {0,
1, 2, 3, 4, 5} (2 is interpolated and 0/5 interpolated). Similar for
continuous distributions.

Of is that too ambitious?

Regarding kernels, I'm okay with only supporting uniform and Gaussian,
but we might think about it - we might come up with a clever solution
giving pluggable kernels almost for free (if we are lucky :-)).

Cheers, Mikkel.

>>>>> 2) make the kernel used within bins configurable.  Currently, values
>>>>> are generated (and the cdf would be computed) assuming a Gaussian
>>>>> distribution within bins.  I think at least a uniform option should
>>>>> be provided.
>>>> +1, maybe it can be generalised to providing user-defined kernels.
>>> Good idea.  Need to think about how to enable that.
>>>
>>> Thanks!
>>>
>>> Phil
>>>>> Thanks in advance for any feedback on this or further suggestions
>>>>> for improvement.
>>>>>
>>>>> Phil
>>>>>
>> Cheers, Mikkel.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] EmpiricalDistribution

Posted by Phil Steitz <ph...@gmail.com>.
On 9/6/11 8:58 AM, Mikkel Meyer Andersen wrote:
> 2011/9/6 Phil Steitz <ph...@gmail.com>:
>> On 9/6/11 12:00 AM, Mikkel Meyer Andersen wrote:
>>> 2011/9/5 Phil Steitz <ph...@gmail.com>:
>>>> I have a couple of proposals for this class:
>>>>
>>>> 0) Merge the interface and impl.   This is consistent with what we
>>>> are doing in some other places where we have only one implementation.
>>> Fine with me.
>>>> 1) Extend this class to actually provide a distribution - i.e.
>>>> implement the Distribution interface.
>>> Won't we have problems, e.g. with implementing cumulativeProbability?
>> The idea I had was to interpolate within bins.  So to compute the
>> cdf at x you would find its bin, sum the mass (based on number of
>> original sample points contained, like the sampling does) of the
>> bins below its containing bin and then use the defined kernel within
>> bin to determine how much of its own bin's mass to include.
> Seems reasonable. But: We might want to include a user specified
> support - just simple (endpoints of an interval) - or else the highest
> and lowest value specifies the support which might not be a good idea.

By the latter, do you mean just interpolate linearly between lowest
and highest, or do you mean the lowest / highest actually observed
points in the bin?  The first is like using a uniform kernel in the
bins.  By "user-specified support" I guess you mean make the
interpolation strategy pluggable somehow, right?   What launched me
into thinking about making the kernel used for sampling configurable
was thinking about how uniform would probably be better / more
defensible for use interpolating the cdf in some cases.  Then you
have to ask is it OK to use a different kernel for the sampling vs
cdf computation.  My instinct is to say no and keep it simple -
allow a uniform kernel to be chosen in place of the hard-coded
Gaussian there now and then use the configured kernel for both
sampling and cdf computation.  Even with mixed kernels, you will
probably in most cases end up with decent fidelity between sampling
results and the cdf; but I can imagine scenarios where Gaussian
kernels with coarse grids could lead to funny sampling distributions
that would not follow the linearly-interpolated cdf very well near
grid points.

Phil
>>>> 2) make the kernel used within bins configurable.  Currently, values
>>>> are generated (and the cdf would be computed) assuming a Gaussian
>>>> distribution within bins.  I think at least a uniform option should
>>>> be provided.
>>> +1, maybe it can be generalised to providing user-defined kernels.
>> Good idea.  Need to think about how to enable that.
>>
>> Thanks!
>>
>> Phil
>>>> Thanks in advance for any feedback on this or further suggestions
>>>> for improvement.
>>>>
>>>> Phil
>>>>
> Cheers, Mikkel.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] EmpiricalDistribution

Posted by Mikkel Meyer Andersen <mi...@mikl.dk>.
2011/9/6 Phil Steitz <ph...@gmail.com>:
> On 9/6/11 12:00 AM, Mikkel Meyer Andersen wrote:
>> 2011/9/5 Phil Steitz <ph...@gmail.com>:
>>> I have a couple of proposals for this class:
>>>
>>> 0) Merge the interface and impl.   This is consistent with what we
>>> are doing in some other places where we have only one implementation.
>> Fine with me.
>>> 1) Extend this class to actually provide a distribution - i.e.
>>> implement the Distribution interface.
>> Won't we have problems, e.g. with implementing cumulativeProbability?
>
> The idea I had was to interpolate within bins.  So to compute the
> cdf at x you would find its bin, sum the mass (based on number of
> original sample points contained, like the sampling does) of the
> bins below its containing bin and then use the defined kernel within
> bin to determine how much of its own bin's mass to include.
Seems reasonable. But: We might want to include a user specified
support - just simple (endpoints of an interval) - or else the highest
and lowest value specifies the support which might not be a good idea.
>
>>> 2) make the kernel used within bins configurable.  Currently, values
>>> are generated (and the cdf would be computed) assuming a Gaussian
>>> distribution within bins.  I think at least a uniform option should
>>> be provided.
>> +1, maybe it can be generalised to providing user-defined kernels.
>
> Good idea.  Need to think about how to enable that.
>
> Thanks!
>
> Phil
>>> Thanks in advance for any feedback on this or further suggestions
>>> for improvement.
>>>
>>> Phil
>>>
Cheers, Mikkel.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] EmpiricalDistribution

Posted by Phil Steitz <ph...@gmail.com>.
On 9/6/11 12:00 AM, Mikkel Meyer Andersen wrote:
> 2011/9/5 Phil Steitz <ph...@gmail.com>:
>> I have a couple of proposals for this class:
>>
>> 0) Merge the interface and impl.   This is consistent with what we
>> are doing in some other places where we have only one implementation.
> Fine with me.
>> 1) Extend this class to actually provide a distribution - i.e.
>> implement the Distribution interface.
> Won't we have problems, e.g. with implementing cumulativeProbability?

The idea I had was to interpolate within bins.  So to compute the
cdf at x you would find its bin, sum the mass (based on number of
original sample points contained, like the sampling does) of the
bins below its containing bin and then use the defined kernel within
bin to determine how much of its own bin's mass to include.

>> 2) make the kernel used within bins configurable.  Currently, values
>> are generated (and the cdf would be computed) assuming a Gaussian
>> distribution within bins.  I think at least a uniform option should
>> be provided.
> +1, maybe it can be generalised to providing user-defined kernels.

Good idea.  Need to think about how to enable that.

Thanks!

Phil
>> Thanks in advance for any feedback on this or further suggestions
>> for improvement.
>>
>> Phil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [math] EmpiricalDistribution

Posted by Mikkel Meyer Andersen <mi...@mikl.dk>.
2011/9/5 Phil Steitz <ph...@gmail.com>:
> I have a couple of proposals for this class:
>
> 0) Merge the interface and impl.   This is consistent with what we
> are doing in some other places where we have only one implementation.
Fine with me.
> 1) Extend this class to actually provide a distribution - i.e.
> implement the Distribution interface.
Won't we have problems, e.g. with implementing cumulativeProbability?
> 2) make the kernel used within bins configurable.  Currently, values
> are generated (and the cdf would be computed) assuming a Gaussian
> distribution within bins.  I think at least a uniform option should
> be provided.
+1, maybe it can be generalised to providing user-defined kernels.
>
> Thanks in advance for any feedback on this or further suggestions
> for improvement.
>
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org