You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Greg Sterijevski <gs...@gmail.com> on 2011/09/01 07:05:00 UTC

[math] Multiple Algorithms

Hello All,

This question popped into my head this evening, what is the right way to
handle multiple algorithms which purport to calculate the same thing? There
are, for example, a couple of ways to calculate the student t cdf. What is
the common's philosophy on deciding:

1. Whether to allow multiple algorithms.
2. How an algorithm is included.
    a.) Does a 'bug' or shortcoming need to be shown in the current
implementation?
    b.) Say that algorithm a works for a numerical range and b works best on
another. Are both included? Is a new 'meta' algorithm written which mixes
both a and b?
3. Does simplicity count?
4. Does speed matter?

A while back, Chris Nix reimplemented the SVD routine. I am not sure I
remember the old routine so I cannot say there was anything worth keeping
there. However was there a conscious decision to scrap it? Why not have it
live side by side with the new one? (Again, I am not saying the old
algorithm was better-Chris' contribution definitely was valuable). I think
we will run into these issues often.

Thoughts? If this has been discussed already, my apologies.

-Greg

Re: [math] Multiple Algorithms

Posted by Greg Sterijevski <gs...@gmail.com>.

Thanks for the response Phil.

This question has been on my mind for a month or so. I was working on some
Student T approximators (ergo my check in of the NIST data for Student T),
when I noticed that the commons Student T was based on the continuing
fraction version of the incomplete beta formulation. Then I updated my local
working copy and noticed you have made some changes to the Normal
distribution. I looked at that implementation and it is the one based on the
error function. There is nothing wrong with the choices that were made and I
am sure your error function is awesome. However, my experience has been
that, especially in the tails, you want other approximators (typically some
variation on a power series).

As for parallels, I think you are correct, the best one is the commons
implementation schema for random number generators. As I was looking back,
it struck me as odd that the old SVD was ditched. Again, not saying it did
not deserve it, just wanted to make sure I understood the philosophy behind
the decision making.

Thank you,

-Greg





On Thu, Sep 1, 2011 at 1:11 AM, Phil Steitz <ph...@gmail.com> wrote:

> On 8/31/11 10:05 PM, Greg Sterijevski wrote:
> > Hello All,
> >
> > This question popped into my head this evening, what is the right way to
> > handle multiple algorithms which purport to calculate the same thing?
> There
> > are, for example, a couple of ways to calculate the student t cdf. What
> is
> > the common's philosophy on deciding:
> >
> > 1. Whether to allow multiple algorithms.
> > 2. How an algorithm is included.
> >     a.) Does a 'bug' or shortcoming need to be shown in the current
> > implementation?
> >     b.) Say that algorithm a works for a numerical range and b works best
> on
> > another. Are both included? Is a new 'meta' algorithm written which mixes
> > both a and b?
> > 3. Does simplicity count?
> > 4. Does speed matter?
> >
> > A while back, Chris Nix reimplemented the SVD routine. I am not sure I
> > remember the old routine so I cannot say there was anything worth keeping
> > there. However was there a conscious decision to scrap it? Why not have
> it
> > live side by side with the new one? (Again, I am not saying the old
> > algorithm was better-Chris' contribution definitely was valuable). I
> think
> > we will run into these issues often.
> >
> > Thoughts? If this has been discussed already, my apologies.
>
> Well, at a high level, we tried to settle this at the very
> beginning.  Have a look at items 3 and 4 in the "guiding principles"
> on the [math] main page :)  That stuff comes from the original
> proposal for [math] and we have tried to stay more or less faithful
> to the principles laid out there.
>
> The integration, ode and solvers packages all try to do exactly what
> you are suggesting - when multiple algorithms, or even variations on
> an algorithm, exist and no single one can really do the job for all
> practical use cases, we welcome and incorporate implementations of
> multiple different ones.  How we do that varies by package and this
> is one place where our "interface pain" has led to some trauma.
> Initially, we pretty much uniformly separated interfaces from
> implementation precisely for this reason.  You can still see that in
> the distributions package and while it is a little ironic since we
> have been talking about collapsing the interfaces into the impls
> there, the setup now supports multiple different implementations of
> any of the distributions.
>
> I don't want to turn this into an abstract discussion of how best to
> support the strategy pattern in [math].  I think the "how to do it"
> part depends on the problem / package.  What we have tried to do so
> far and I think we should keep trying to do is:
>
> 0) If there is a single best algorithm that really does work well in
> almost all cases, make that "the obvious thing" and try to make our
> implementation of it as robust as possible.  If we provide only one
> implementation, try to make it the best one.  I would say put
> SimpleRegression, or Variance, for example, in this category.
>
> 1) When multiple different standard algorithms exist, make sure our
> API supports adding alternatives - including user-supplied
> alternatives - in the least-confusing way we can think of.  Welcome
> contributions of the alternatives and try to document as best we can
> how and when to use the different implementations.  Try to stick to
> standard algorithms, with good external references, so we don't have
> to turn our javadoc and/or user guide into a numerical analysis
> textbook.
>
> 2) Whenever possible, try to encapsulate the part that varies for
> different implementations, so the API remains simple and the
> variable part can be "plugged in."  Good examples of this are the
> RandomGenerators or the Solvers used by different classes.
>
> The general question is a good one; but the specific answer depends
> on the algorithms and classes involved.  In the two cases that you
> mention (t distribution cdf and SVD), we have to balance API
> complexity and more code to support against practical value.  SVD
> has been a big challenge for us, so I would say we should start at
> step 0) on that one; but I could easily be talked into supporting
> multiple impls given strong numerical arguments and multiple good,
> robust impls.  For the t distribution, I am not so sure.  A separate
> thread for that is probably best.  My intuition there is that like
> the other distributions, we and our users are best off with a single
> impl that does whatever it needs to do in different ranges to
> provide robust estimates, possibly allowing configuration settings
> to trade performance for accuracy.
>
> Phil
>
> >
> > -Greg
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [math] Multiple Algorithms

Posted by Phil Steitz <ph...@gmail.com>.

On 8/31/11 10:05 PM, Greg Sterijevski wrote:
> Hello All,
>
> This question popped into my head this evening, what is the right way to
> handle multiple algorithms which purport to calculate the same thing? There
> are, for example, a couple of ways to calculate the student t cdf. What is
> the common's philosophy on deciding:
>
> 1. Whether to allow multiple algorithms.
> 2. How an algorithm is included.
>     a.) Does a 'bug' or shortcoming need to be shown in the current
> implementation?
>     b.) Say that algorithm a works for a numerical range and b works best on
> another. Are both included? Is a new 'meta' algorithm written which mixes
> both a and b?
> 3. Does simplicity count?
> 4. Does speed matter?
>
> A while back, Chris Nix reimplemented the SVD routine. I am not sure I
> remember the old routine so I cannot say there was anything worth keeping
> there. However was there a conscious decision to scrap it? Why not have it
> live side by side with the new one? (Again, I am not saying the old
> algorithm was better-Chris' contribution definitely was valuable). I think
> we will run into these issues often.
>
> Thoughts? If this has been discussed already, my apologies.

Well, at a high level, we tried to settle this at the very
beginning.  Have a look at items 3 and 4 in the "guiding principles"
on the [math] main page :)  That stuff comes from the original
proposal for [math] and we have tried to stay more or less faithful
to the principles laid out there.

The integration, ode and solvers packages all try to do exactly what
you are suggesting - when multiple algorithms, or even variations on
an algorithm, exist and no single one can really do the job for all
practical use cases, we welcome and incorporate implementations of
multiple different ones.  How we do that varies by package and this
is one place where our "interface pain" has led to some trauma. 
Initially, we pretty much uniformly separated interfaces from
implementation precisely for this reason.  You can still see that in
the distributions package and while it is a little ironic since we
have been talking about collapsing the interfaces into the impls
there, the setup now supports multiple different implementations of
any of the distributions.

I don't want to turn this into an abstract discussion of how best to
support the strategy pattern in [math].  I think the "how to do it"
part depends on the problem / package.  What we have tried to do so
far and I think we should keep trying to do is:

0) If there is a single best algorithm that really does work well in
almost all cases, make that "the obvious thing" and try to make our
implementation of it as robust as possible.  If we provide only one
implementation, try to make it the best one.  I would say put
SimpleRegression, or Variance, for example, in this category.

1) When multiple different standard algorithms exist, make sure our
API supports adding alternatives - including user-supplied
alternatives - in the least-confusing way we can think of.  Welcome
contributions of the alternatives and try to document as best we can
how and when to use the different implementations.  Try to stick to
standard algorithms, with good external references, so we don't have
to turn our javadoc and/or user guide into a numerical analysis
textbook.

2) Whenever possible, try to encapsulate the part that varies for
different implementations, so the API remains simple and the
variable part can be "plugged in."  Good examples of this are the
RandomGenerators or the Solvers used by different classes.

The general question is a good one; but the specific answer depends
on the algorithms and classes involved.  In the two cases that you
mention (t distribution cdf and SVD), we have to balance API
complexity and more code to support against practical value.  SVD
has been a big challenge for us, so I would say we should start at
step 0) on that one; but I could easily be talked into supporting
multiple impls given strong numerical arguments and multiple good,
robust impls.  For the t distribution, I am not so sure.  A separate
thread for that is probably best.  My intuition there is that like
the other distributions, we and our users are best off with a single
impl that does whatever it needs to do in different ranges to
provide robust estimates, possibly allowing configuration settings
to trade performance for accuracy.

Phil

>
> -Greg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org