You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Mikkel Meyer Andersen <mi...@mikl.dk> on 2009/11/02 04:09:13 UTC

Re: [math] Generate random data using the Inverse CDF Method?

Phil,
I understand your opinion, but I don't agree with you (but I accept
that you have a different meaning about things just as I expect you to
accept mine). I'm sure this is quite common in projects like this, and
are interesting in hearing how matters like this are settled? Are
there any committers besides you - and how many? And do you guys they
have some sort of private list where you discuss this or is it totally
up to you to decide how this ends (maybe waiting for others on the
list to have their saying)? I don't mean to be rude or impolite, so
excuse me if I am; I just don't know how this stuff works.

Cheers, Mikkel.

2009/10/31 Ted Dunning <te...@gmail.com>:
> Is that a completely unchangeable opinion?
>
> Would you be willing to live with data generation in the distribution
> classes even if you don't particularly like it?
>
> Would user opinions sway your opinion if we had some way of collecting them?
>
> On Fri, Oct 30, 2009 at 5:28 PM, Phil Steitz <ph...@gmail.com> wrote:
>
>> > What do you think about the proposal at
>> > http://issues.apache.org/jira/browse/MATH-310 in regards to our
>> > discussion on the topic?
>>
>> As I said above, I am not in favor of adding random data generation
>> to the distributions package.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Ted Dunning <te...@gmail.com>.

I think that it would be good to have samplers in the random package.

One reasonable one would be InverseCumulativeSampler.

Another would be something like RecursiveLookupTableDiscreteSampler

The generic nextSample routine could call the most generic sampler we have.
I would guess that would start with InverseCumulativeSampler and for
discrete distributions it might use a slightly more clever sampler that
switches from InverseCumulativeSampler to something more advanced after
being called a few times (that being the clue that some setup time would be
reasonable).

On Mon, Nov 2, 2009 at 1:06 PM, Mikkel Meyer Andersen <mi...@mikl.dk> wrote:

> Another way of implementing the functionality is to call some
> nextSample(Abstract{Continuous, Discrete}Distribution) in the Random
> package from the Distribution-classes
>

-- 
Ted Dunning, CTO
DeepDyve

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Mikkel Meyer Andersen <mi...@mikl.dk>.

I agree, Ted. It seems like reasonable arguments.

Another way of implementing the functionality is to call some
nextSample(Abstract{Continuous, Discrete}Distribution) in the Random
package from the Distribution-classes (and
nextSample(ExponentialDistribution) equivalent to nextExponential for
an optimised implementation). Would that be a better way of doing it?
I think this is conceptually wrong, but I'm ready to do this if it
means it goes through.

Cheers, Mikkel.

2009/11/3 Ted Dunning <te...@gmail.com>:
> We should probably say which parts of the problem are important to us.  It
> begins to sound like we each care about slightly different aspects of the
> problem.
>
> The only points that I really care about are:
>
> - the user should have available some obvious way to sample from a
> distribution as a method on the distribution itself.  This need is not met
> by having a completely separate class in a different package that the user
> must somehow intuit the existence of.
>
> - the user should have the widest possible number of distributions that have
> *some* kind of sampling procedure that produces accurate samples.  Morevoer,
> this wide availability should happen very soon.
>
> Note that neither of these points really implies much about implementation
> other than where the user of commons-math can find an access to
> implementations and that we implement something across many distributions
> very soon.
>
> These are points that I explicitly don't care about:
>
> - should the implementation be based on inverse cumulative distributions if
> available?  If there is another way to get lots of sampling algorithms
> implemented, I am all for it.  Marsaglia's table method for discrete
> distributions is an interesting option for some cases.  There may be other
> algorithms that could have wide applicability.  Multiple approaches might be
> a good idea, special purpose samplers for some cases (like normal or
> exponential distributions), kind of general methods like Marsaglia's method
> where it can be done.  If all of the common cases have special purpose, high
> quality generators, I don't see a problem with letting all of the other
> distributions that we haven't considered yet fall back to inverse cumulative
> methods.  But all of these considerations are not what I really care about.
> I only care about very wide availability of *some* sampling method.
>
> - should there be random number generators that provide more
> generality/flexibility/alternative implementations for sampling for various
> distributions.  This is an implementation question that can be answered many
> ways.  I think that lots of alternatives are good.  I even think that having
> pure implementations of one method or another might be an excellent way to
> allow us to stitch together the sampling available by default from the
> distribution.  All of these consideration, however, are not what I really
> care about.  What I care about is that all of these implementations should
> be ignorable by a less than devoted user of commons math.
>
> Now, it seems to me that the points that Phil cares most about fall mostly
> into the set of things that I care less about.  Moreover, some of the
> opinions that Phil has expressed have been stated in ways that I may have
> misinterpreted.  For instance, it sounded to me like Phil was saying that we
> shouldn't even implement the inverse cumulative sampler.  On reflection, I
> think that his real point is that we should not use the inverse cumulative
> method where there are better methods, especially if we already have
> implementations of the better methods.
>
> Likewise, it sounded to me like Phil was saying that we absolutely shouldn't
> allow easy access to a community consensus sampling algorithm from the
> distribution.  On further reflection, I think that his real point is that we
> simply should not be doing most implementation in the distribution function
> class, but should have a separate package to separate all that work away
> from the view of the users.  That sounds like a really good idea, if only to
> decrease the noise for the casual user of the distribution classes.
>
> This sounds like the germ of compromise.
>
> On Mon, Nov 2, 2009 at 3:03 AM, Phil Steitz <ph...@gmail.com> wrote:
>
>>  I just don't like your suggested implementation and package
>> placement.  I proposed an alternative (a generic method added
>> somewhere in the random package), which you did not like. There are
>> no doubt other better ways to do this.  Perhaps others have ideas?
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Ted Dunning <te...@gmail.com>.

We should probably say which parts of the problem are important to us.  It
begins to sound like we each care about slightly different aspects of the
problem.

The only points that I really care about are:

- the user should have available some obvious way to sample from a
distribution as a method on the distribution itself.  This need is not met
by having a completely separate class in a different package that the user
must somehow intuit the existence of.

- the user should have the widest possible number of distributions that have
*some* kind of sampling procedure that produces accurate samples.  Morevoer,
this wide availability should happen very soon.

Note that neither of these points really implies much about implementation
other than where the user of commons-math can find an access to
implementations and that we implement something across many distributions
very soon.

These are points that I explicitly don't care about:

- should the implementation be based on inverse cumulative distributions if
available?  If there is another way to get lots of sampling algorithms
implemented, I am all for it.  Marsaglia's table method for discrete
distributions is an interesting option for some cases.  There may be other
algorithms that could have wide applicability.  Multiple approaches might be
a good idea, special purpose samplers for some cases (like normal or
exponential distributions), kind of general methods like Marsaglia's method
where it can be done.  If all of the common cases have special purpose, high
quality generators, I don't see a problem with letting all of the other
distributions that we haven't considered yet fall back to inverse cumulative
methods.  But all of these considerations are not what I really care about.
I only care about very wide availability of *some* sampling method.

- should there be random number generators that provide more
generality/flexibility/alternative implementations for sampling for various
distributions.  This is an implementation question that can be answered many
ways.  I think that lots of alternatives are good.  I even think that having
pure implementations of one method or another might be an excellent way to
allow us to stitch together the sampling available by default from the
distribution.  All of these consideration, however, are not what I really
care about.  What I care about is that all of these implementations should
be ignorable by a less than devoted user of commons math.

Now, it seems to me that the points that Phil cares most about fall mostly
into the set of things that I care less about.  Moreover, some of the
opinions that Phil has expressed have been stated in ways that I may have
misinterpreted.  For instance, it sounded to me like Phil was saying that we
shouldn't even implement the inverse cumulative sampler.  On reflection, I
think that his real point is that we should not use the inverse cumulative
method where there are better methods, especially if we already have
implementations of the better methods.

Likewise, it sounded to me like Phil was saying that we absolutely shouldn't
allow easy access to a community consensus sampling algorithm from the
distribution.  On further reflection, I think that his real point is that we
simply should not be doing most implementation in the distribution function
class, but should have a separate package to separate all that work away
from the view of the users.  That sounds like a really good idea, if only to
decrease the noise for the casual user of the distribution classes.

This sounds like the germ of compromise.

On Mon, Nov 2, 2009 at 3:03 AM, Phil Steitz <ph...@gmail.com> wrote:

>  I just don't like your suggested implementation and package
> placement.  I proposed an alternative (a generic method added
> somewhere in the random package), which you did not like. There are
> no doubt other better ways to do this.  Perhaps others have ideas?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Phil Steitz <ph...@gmail.com>.

Phil Steitz wrote:
> Mikkel Meyer Andersen wrote:
>> 2009/11/3 Luc Maisonobe <Lu...@free.fr>:
>>> There are at least one other regular commiter and three other committers
>>> that have been active on the list last year. Phil is clearly one of the
>>> most involved maintainers and he has been here since the beginning.
>> Okay, thanks for the info. I know how much Phil means and I haven't
>> for a second doubted that.
> 
> One important thing to understand about how things work here is that
> there is no hierarchy among committers and in terms of ideas,
> patches, itches-to-scratch, etc. all - including noncomitters - are
> on equal footing.  Just because I have been around for a while does
> not mean my ideas are any better than yours or anyone else's.
> 
>>> There are only two lists: the users list and the developers list (here).
>>> Both lists are archived and searchable.
>>>
>>> I have no preference on this specific topic, sorry. One important thing
>>> to me is also to keep backward compatibility (as strange as it might
>>> seem after the bunch of changes I introduced last summer).
>> I agree with this, at least to the degree where it is practically durable.
>>> Would the change imply that the random package would disappear ? In this
>>> case I would be against it. Would that change imply that low level "raw"
>>> generators would be in random and higher level generators in
>>> distribution ? In this case, I don't know what is better.
>>>
>>> One thing I would like to add at some time in the future would be better
>>> and more modern "raw" generators in the same spirit as the Mersenne
>>> Twister (typically I would like to add the WELL family of generators).
>>>
>>> From a user point of view, it is also important to be able to select a
>>> different raw generator underlying a high level one. This is used for
>>> example in Monte-Carlo analyses when one wants to reproduce a subset of
>>> an already generated sequence, or according to what has higher priority,
>>> generation speed or generation accuracy with respect to the desired
>>> repartition.
> 
> This is why I would like to keep the random data generation
> machinery in the random package.  As I stated elsewhere, I am +0/1
> on the idea of adding generic inversion-based generators that work
> with any invertible distribution; but I still do not see attaching
> them to the distribution implementations as a good idea.  This is
> for three reasons: 0) I see it as poor separation of concerns
> (admittedly this is a matter of taste, but I do not see sourcing
> random deviates as an essential behavior of a probability
> distribution)

A little more explanation of the separation of concerns issue.
Inference is another thing that one frequently does *with*
distributions.  This was in fact the application that led to
introduction of the first distributions in commons-math.  But would
we add hypothesis testing to the distributions themselves? Obviously
no.  It is interesting to ask for each distribution, how often would
you have need to either generate random data from it or perform
hypothesis tests using it.  In addition to the obvious question of
separation of concerns, the variability in the response to this is
another indication that neither of these are essential behaviors of
the class.

Phil

 1) if the implementation is *only* inversion-based, it
> will be naive for some distributions and we do not want users to get
> a bad impl by default 2) to fix 1) we have to essentially refactor
> our package structure to place random data generation into the
> distributions package, causing users to have to instantiate
> distributions and also configure generators to get deviates.  I see
> it as simpler and more natural to use a RandomData instance.  I am
> -1 on dropping the random package for the reasons that Luc states.
> Therefore, I am not in favor of attaching this functionality to the
> distributions.
> 
> Phil
> 
> 
>>> Luc
>> Cheers, Mikkel.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Phil Steitz <ph...@gmail.com>.

Mikkel Meyer Andersen wrote:
> 2009/11/3 Luc Maisonobe <Lu...@free.fr>:
>> There are at least one other regular commiter and three other committers
>> that have been active on the list last year. Phil is clearly one of the
>> most involved maintainers and he has been here since the beginning.
> Okay, thanks for the info. I know how much Phil means and I haven't
> for a second doubted that.

One important thing to understand about how things work here is that
there is no hierarchy among committers and in terms of ideas,
patches, itches-to-scratch, etc. all - including noncomitters - are
on equal footing.  Just because I have been around for a while does
not mean my ideas are any better than yours or anyone else's.

> 
>> There are only two lists: the users list and the developers list (here).
>> Both lists are archived and searchable.
>>
>> I have no preference on this specific topic, sorry. One important thing
>> to me is also to keep backward compatibility (as strange as it might
>> seem after the bunch of changes I introduced last summer).
> I agree with this, at least to the degree where it is practically durable.
>> Would the change imply that the random package would disappear ? In this
>> case I would be against it. Would that change imply that low level "raw"
>> generators would be in random and higher level generators in
>> distribution ? In this case, I don't know what is better.
>>
>> One thing I would like to add at some time in the future would be better
>> and more modern "raw" generators in the same spirit as the Mersenne
>> Twister (typically I would like to add the WELL family of generators).
>>
>> From a user point of view, it is also important to be able to select a
>> different raw generator underlying a high level one. This is used for
>> example in Monte-Carlo analyses when one wants to reproduce a subset of
>> an already generated sequence, or according to what has higher priority,
>> generation speed or generation accuracy with respect to the desired
>> repartition.

This is why I would like to keep the random data generation
machinery in the random package.  As I stated elsewhere, I am +0/1
on the idea of adding generic inversion-based generators that work
with any invertible distribution; but I still do not see attaching
them to the distribution implementations as a good idea.  This is
for three reasons: 0) I see it as poor separation of concerns
(admittedly this is a matter of taste, but I do not see sourcing
random deviates as an essential behavior of a probability
distribution) 1) if the implementation is *only* inversion-based, it
will be naive for some distributions and we do not want users to get
a bad impl by default 2) to fix 1) we have to essentially refactor
our package structure to place random data generation into the
distributions package, causing users to have to instantiate
distributions and also configure generators to get deviates.  I see
it as simpler and more natural to use a RandomData instance.  I am
-1 on dropping the random package for the reasons that Luc states.
Therefore, I am not in favor of attaching this functionality to the
distributions.

Phil


>>
>> Luc
> Cheers, Mikkel.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Mikkel Meyer Andersen <mi...@mikl.dk>.

2009/11/3 Luc Maisonobe <Lu...@free.fr>:
> There are at least one other regular commiter and three other committers
> that have been active on the list last year. Phil is clearly one of the
> most involved maintainers and he has been here since the beginning.
Okay, thanks for the info. I know how much Phil means and I haven't
for a second doubted that.

> There are only two lists: the users list and the developers list (here).
> Both lists are archived and searchable.
>
> I have no preference on this specific topic, sorry. One important thing
> to me is also to keep backward compatibility (as strange as it might
> seem after the bunch of changes I introduced last summer).
I agree with this, at least to the degree where it is practically durable.
>
> Would the change imply that the random package would disappear ? In this
> case I would be against it. Would that change imply that low level "raw"
> generators would be in random and higher level generators in
> distribution ? In this case, I don't know what is better.
>
> One thing I would like to add at some time in the future would be better
> and more modern "raw" generators in the same spirit as the Mersenne
> Twister (typically I would like to add the WELL family of generators).
>
> From a user point of view, it is also important to be able to select a
> different raw generator underlying a high level one. This is used for
> example in Monte-Carlo analyses when one wants to reproduce a subset of
> an already generated sequence, or according to what has higher priority,
> generation speed or generation accuracy with respect to the desired
> repartition.
>
> Luc
Cheers, Mikkel.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Luc Maisonobe <Lu...@free.fr>.

Phil Steitz a écrit :
> Mikkel Meyer Andersen wrote:
>> Phil,
>> I understand your opinion, but I don't agree with you (but I accept
>> that you have a different meaning about things just as I expect you to
>> accept mine). I'm sure this is quite common in projects like this, and
>> are interesting in hearing how matters like this are settled? Are
>> there any committers besides you - and how many? And do you guys they

There are at least one other regular commiter and three other committers
that have been active on the list last year. Phil is clearly one of the
most involved maintainers and he has been here since the beginning.

>> have some sort of private list where you discuss this or is it totally
>> up to you to decide how this ends (maybe waiting for others on the
>> list to have their saying)? I don't mean to be rude or impolite, so
>> excuse me if I am; I just don't know how this stuff works.
> 
> All discussion happens on the public list. Sometimes it takes a
> while for us to reach consensus and often the consensus represents a
> compromise that we did not see as a possibility at first.

There are only two lists: the users list and the developers list (here).
Both lists are archived and searchable.

> I am open
> to making it easy to provide inversion-based random data generators.
>  I just don't like your suggested implementation and package
> placement.  I proposed an alternative (a generic method added
> somewhere in the random package), which you did not like. There are
> no doubt other better ways to do this.  Perhaps others have ideas?

I have no preference on this specific topic, sorry. One important thing
to me is also to keep backward compatibility (as strange as it might
seem after the bunch of changes I introduced last summer).

Would the change imply that the random package would disappear ? In this
case I would be against it. Would that change imply that low level "raw"
generators would be in random and higher level generators in
distribution ? In this case, I don't know what is better.

One thing I would like to add at some time in the future would be better
and more modern "raw" generators in the same spirit as the Mersenne
Twister (typically I would like to add the WELL family of generators).

>From a user point of view, it is also important to be able to select a
different raw generator underlying a high level one. This is used for
example in Monte-Carlo analyses when one wants to reproduce a subset of
an already generated sequence, or according to what has higher priority,
generation speed or generation accuracy with respect to the desired
repartition.

Luc

> 
> Phil
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Generate random data using the Inverse CDF Method?

Posted by Phil Steitz <ph...@gmail.com>.

Mikkel Meyer Andersen wrote:
> Phil,
> I understand your opinion, but I don't agree with you (but I accept
> that you have a different meaning about things just as I expect you to
> accept mine). I'm sure this is quite common in projects like this, and
> are interesting in hearing how matters like this are settled? Are
> there any committers besides you - and how many? And do you guys they
> have some sort of private list where you discuss this or is it totally
> up to you to decide how this ends (maybe waiting for others on the
> list to have their saying)? I don't mean to be rude or impolite, so
> excuse me if I am; I just don't know how this stuff works.

All discussion happens on the public list. Sometimes it takes a
while for us to reach consensus and often the consensus represents a
compromise that we did not see as a possibility at first.  I am open
to making it easy to provide inversion-based random data generators.
 I just don't like your suggested implementation and package
placement.  I proposed an alternative (a generic method added
somewhere in the random package), which you did not like. There are
no doubt other better ways to do this.  Perhaps others have ideas?

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org