You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Gilles <gi...@harfang.homelinux.org> on 2014/06/18 13:37:06 UTC

[Math] MATH-1129

Hi.

See
   https://issues.apache.org/jira/browse/MATH-1129

The problem reported was due to the sorting procedure not behaving
correctly in the presence of NaN.
This procedure could be replaced by an equivalent method from the JDK:
   java.util.Arrays.sort(double[],int,int)

Any objection?


Regards,
Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Posted by venkatesha murthy <ve...@gmail.com>.
On Thu, Jun 19, 2014 at 2:10 AM, Luc Maisonobe <lu...@spaceroots.org> wrote:
>
> Le 18/06/2014 20:28, venkatesha m a écrit :
> > Hi Luc, Gilles
>
> Hi Venkat
>
> >
> > First of, Iam immensly thankful to all your comments on this patch.
> > Next, i am attaching my new patch with today's date(18-jun). However
> > please advise if i need to remove the old patch file if it confuses.
>
> OK, I'll roll back my current work and will restart from this new patch.
> Please wait until it is committed before submitting a new version,
> otherwise it will be a nightmare to keep track of changes. Let's do it
> in small tractable steps.
Hi Luc,

Sure. I will not submit another patch until till this is committed.
Thanks for taking this in.
I am just adding clarifications for few things below. However i concur
with you on most part.
>
>
> >
> > Please find my response below. The new patch has the suggested
> > changes in the switch case for nan handling; But; However i have my
> > view points on the different default nan strategies associated to
> > types. Please permit me to explain (sorry for long summary)
> >
> > First, i would like to leave the Default implementation of Percentile
> > as-is (Meaning in my MATH-1120 patch it is mapped to Type.CM)since
> > otherwise we will break user old expectation even for non nan and non
> > inf entries as well. The existing Percentile implementation also did
> > a copy of the array slice before doing kth slection(the old evaluate
> > method can be refered) and even i am doing the same it slightly in
> > different circumstance in the switch case with nan filtering. The
> > existing tests could fail if we change the default types to say R_8
> > (please refer to PercentileTest.java code as well to see the finer
> > variations in expected values that is being looked at for different
> > types. I have used R execution as the basis for setting those
> > values).
>
> Yes, the results would slightly change because the new algorithm you set
> up is better. I would not qualify this as "breaking" user code. For me,
> breaking user code is when the semantics or the API really changes.
>
> We are already plagued in [math] with too stringent backward
> compatibility, so adding another stricter interpretation that even
> slight numerical improvements should never occur would completely block
> all releases.
>
> Our current handling of NaNs fails and do not even fulfill what I have
> written in the Javadoc (see MATH-1129). So there *is* a bug. We cannot
> say fixing a bug changes results so we must preserve bugs at all costs!
>

In order to closely adhere with the existing test results of the
existing type CM(that need FIXED)
and to match with R results on R_x(that need REMOVED),i was forced for
different Nan strategies.


Now if we can do away with existing CM implementation (and use one of
R_x methods as default)
we can stick to one NanStrategy (REMOVED). I agree here.

(just fyi: Type.CM and R_6 have same indexing positioning logic but
R_6 differs for when to
return min and max elements of the array. please refer to R_6 java doc)
> >
> > Secondly, Percentile.java header comment states somewhere to an
> > effect that NaNs would be (left as-is and) handled by java's default
> > sort behavior and no removal being done.
>
> And this was not really done. NaNs were sorted inconsistently.
>
> > So for me to map this
> > behaviour to new implementation; it was NaNStrategy.FIXED that came
> > close and didnt require any of the existing test cases for the
> > existing Percentile behavior to change.
>
> I don't agree. The Javadoc reads:
>
>  This ordering makes Double.NaN</code> larger than any other value
>  (including Double.POSITIVE_INFINITY)
>
> So the best match would be NaNStrategy.MAXIMAL.
>
I was assuming Nan handling in Percentile to closely match with that
in NaturalRanking.java.
I used FIXED to satisfy Type.CM as FIXED places NaNs higher than +Inf
which is what is needed in CM.
However please ignore this comment if we decide to use one of R_x for
the default (instead of Type.CM)

> > What i am re-iterating here
> > is the existing behavior tests have completely passed with new
> > Type.CM and FIXED. (And now i have added several more tests including
> > different types as well).
> >
> > Thirdly, While all the R_x (where x :[1-9]) types as run and verified
> > by R tool; seemed to clearly convey the NaNs needed to be removed and
> > hence you see that i have used different strategy
> > NaNStrategy#REMOVED. I agree while multiple defaults are not wise to
> > have ; however; if we are forced to have Apache CM as supported type
> > (which is not one of R_x types) and we have the need to support
> > multiple variants (R1- R9) ; then it is inevitable to have type
> > sepcific NaNStrategy as per the need.
>
> I don't understand your rational. We are speaking about default here. As
> we indeed need some backward compatibility at API level, we cannot force
> users to select a specific strategy (as this would force users to change
> their existing code), so we need a default. You say it should be the
> triplet (Type.CM, NaNStrategy.FIXED, PivotingStrategy.MEDIAN_OF_3), I
> say it could be (Type.R_1, PivotingStrategy.MEDIAN_OF_3).
>
Well backward compatability is maintained even now and normal users
can just use either
Percentile(quantile) or Percentile(quantile,type) without bothering
about NanStrategy.

The constrctor with  Nanstrategy was an optional one. we could decide
to make it public or not.

> If NaNStrategy is really something that may be chosen independently of
> the Type, whatever we finally choose, we can say the default is a
> consistent selection of one triplet. Users of the new API who decide by
> themselve it is worth changing their code will have to do their own
> selection of the three components of the triplet. The Junit tests can of
> course use the full-blown new API and select something different, just
> to be consistent with the reference R results (using
> NaNStrategy.REMOVED). It doesn't need to be some second level default
> when the user has selected R_i, but did not select a NaNStrategy, then
> we select a better default automatically.
>
> If on the other hand NaNStrategy is not something that may be chosen
> independently of the Type (perhaps because the algorithm would not work
> in this case), then it should not be a separate choice and it should be
> hardwired in the Type, without any possibility for the user to change
> it. I doubt this is the case.
>
> > I also feel ; NaN handling
> > should be allowed for overriding atleast in a controlled manner as
> > different use cases may exist for needing this variation in nan
> > handling. Therefore IMO while we could avoid the public access to
> > change these defaults; it is relevant to support these variations of
> > nan handling on a per type and allow atleast sub classes to override
> > if a rare need arises. While the very name NaNStrategy reminds me of
> > different ways to look at that; i feel we will be much restricted if
> > we just said that we stick to one way of NaNHandling for all types.
> > Please let me know your thoughts.
>
> Subclassing seems too much for this. Also I see your latest patch has
> some protected fields to allow for this, which is not allowed in our
> checkstyle/findbugs settings (but this is a detail, we could replace the
> protected fields by private field and protected setters if really needed).
>
Are you saying to do away with protected constructor and instead allow
the Nan handling  in a public constructor/builder?

Regarding protected fields; (estimationType and nan map) sorry for the
slip and  will correct the same.
Could you please share your checkstyle/findbugs configuration.(For
some reason my findbugs/checkstyle report doesnt alert)

> The initial idea of your patch was sufficient to me : a few enums for
> the various parts, a few methods to selects the enums (you used
> constructors while I prefer builder pattern, but it's again only an
> implementation detail), and defaults for existing code which simply
> doesn't set any of these enums since they did not exist in previous
> version. It was simple and practical to me. The only thing I frowned
> upon was having several levels of defaults in the patch before today,
> and the inheritance complexity in the patch from today.
>
I was of an opinion that only constructors with 4+ parameters might
benefit with Builder.
But I agree with you and thanks for suggestion.I will try make the
builder changes in my local copy.

> >
> > Next, Regarding the PivotingStrategy; At first, i wanted to convey
> > here that to have all the partitioning, pivoting and selection in
> > separated classes/enums than inside main class. I have made it as
> > static due to the fact that; it is more of a non-functional
> > requirement and felt that it need not be set for every instance (more
> > of a global setting that doesnt vary across types). Please correct me
> > here and let know if it still needs to be per instance.
>
> Anything that is global and can be changed is dangerous in a low level
> library. You can never know how your code will be called. I do develop
> other libraries that use Apache Commons Math (several independent
> different layers of libraries), and other people do develop complete
> application on top of the libraries I develop. Now consider that in one
> of my middle level library I need this percentile for an obscure
> internal computation and I decide to use PivotingStrategy.CENTRAL. Then
> the girl who uses my library and completely ignore the fact I rely on
> Percentile deep inside the guts of my code can also decide to use
> Percentile with PivotingStrategy.MEDIAN_OF_3. Depending on her code
> begin called before or after mine, of worse, both codes being
> interleaved during the program main loop, this would give unexpected
> results.
>
> As a rule of thumb, class fields that can be modified are a nightmare in
> many code (unfortunately sometimes you cannot avoid them), and its even
> a worse nightmare when you do not control all parts of the code in a
> single team, which is obviously the case with libraries.
>
Ok, i understand and agree.
As enums cant read instance vars from containing class;i will need to
create a KthSelector outside and then pass it.

> > I also made
> > it package accessable/settable solely because medianOf3 method had
> > been package level for the sole intent of possible overriding of the
> > same within that package. Meaning; if some one really needed to
> > tinker around pivoting they need a way to do it which i have provided
> > it using a strategy.
>
> No, I really think it was package level because of a programming error,
> and I even think I made this programming error. This thing should have
> been private, and now we have to live with it, at least until 4.0 when
> we will be able to remove it.
>
So if this was unintended; Do i still need to provide a public setting
of PivotingStrategy

> > Next, In the current patch that i am going to attach as new dated
> > patch (since you have already started looking at the old one ; which
> > i would leave it as is). I find many utility type methods;
> > replaceNaN, removeNaN( Predicated Lists ) and copyOf(values,
> > begin.length) and as well as KthSelector with PivotingStrategy etc
> > all of which can perhaps make its way to MathArrays and MathUtils.
> > Please let me know.If so i will once again re-factor these changes
> > and submit the patch.
>
> Yes, it is probably interesting to move them in one of these utility
> classes, but no, don't refactor your patch yet. Wait until we have
> committed something so the next changes will be smaller. As Gilles said,
> the patch is really big now, so keeping modifying it completely just to
> do some fine tuning like this is not the way to go. Let's first have
> something that work, is reasonably documented, is compliant with our
> main development rules and produces results, and then we will fine tune
> it, move methods, adjust defaults, check performance ...
>
Sure, I will come up with this change and propose in a separate thread

> I had a difficult day and have a headache right now, so I will not do
> the commit just now. I need at least a few more hours to work on it. I
> hope to be able to commit something in about 24h, but no promise. Just
> refrain from doing another patch that would imply I redo all the same
> work once more. We are very close to have a first version of your work
> included in the main repository (with a few changes). Once it's in
> place, we will continue the discussion and tightening up the remaining
> loose ends.
>
Sure i will await for the commit.

thanks
venkat

> best regards,
> Luc
>
> >
> > Thanks for reading this through and for your time in reviewing .
> > Please let me know your opinion on all of these.
> >
> > thanks venkat.
> >
> >
> > -------------------------------------------- On Wed, 18/6/14, Gilles
> > <gi...@harfang.homelinux.org> wrote:
> >
> > Subject: Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion
> > about support on variations in, percentile calculation To:
> > dev@commons.apache.org Date: Wednesday, 18 June, 2014, 8:37 PM
> >
> > On Wed, 18 Jun 2014 16:39:12 +0200, Luc Maisonobe wrote:
> >> Hi Gilles and Venkat,
> >>
> >> Le 18/06/2014 15:40, Gilles a écrit :
> >>> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe
> > wrote:
> >>>> Le 18/06/2014 14:32, Gilles a écrit :
> >>>>> Hello Luc.
> >>>>>
> >>>>>>>
> >>>>>>> See https://issues.apache.org/jira/browse/MATH-1129
> >>>>>>>
> >>>>>>> The problem reported was due to the
> > sorting procedure not behaving
> >>>>>>> correctly in the presence of NaN. This procedure could be
> >>>>>>> replaced by
> > an equivalent method from the JDK:
> >>>>>>> java.util.Arrays.sort(double[],int,int)
> >>>>>>>
> >>>>>>> Any objection?
> >>>>>>
> >>>>>> If you imply removing the select,
> > medianOf3 and partition methods,
> >>>>>> yes I would object. If you imply replacing
> > only the insertionSort method used
> >>>>>> for small sub-arrays, then I almost
> > agree.
> >>>>>
> >>>>> Issue 1129 concerns the latter: See the
> > comment in "insertionSort" in
> >>>>> the current code.
> >>>>>
> >>>>> However, for the former, you should really
> > have a look at MATH-1120
> >>>>> https://issues.apache.org/jira/browse/MATH-1120 The proposed
> >>>>> patch indeed touches those
> > methods.
> >>>>
> >>>> So I think it is worth fixing MATH-1120 first,
> > with the NaNStrategy and
> >>>> go back to MATH-1129 afterwards, to see if it
> > still applies of if
> >>>> MATH-1120 also fixed it.
> >>>
> >>> As per my last comment on MATH-1120, the
> > "NaNStrategy" part of the patch
> >>> is an extension of the initial feature request.
> >>
> >> Sure, but it is a really good addition and seems fair
> > to consider.
> >>
> >>>
> >>> As per the Javadoc, the CM code was originally
> > meant to handle (in some
> >>> way), the presence of NaN values within the data.
> > So I thought that this
> >>> should be fixed before extending the API (which
> > entailed additional
> >>> questions as the proposed patch is fairly
> > "massive").
> >>
> >> Yes, the patch is a big one, and it has been works on
> > for a while.
> >> As it has quite stabilized by now, I think it would be
> > the good time
> >> to include it (with some editing I could do) and to
> > start working with
> >> separate patches applied on top of this one.
> >>
> >> If we let it out of the code longer, it will become
> > more and more
> >> difficult for the contributor to work on it and to us
> > to include it.
> >>
> >>>
> >>> Especially, the new code is clearly untested for
> > performance, and this
> >>> seems to be an important argument by your previous
> > post.
> >>
> >> Yes, it is important as some issues were raised on it
> > (if you remember
> >> well, they were raised by one researcher from another
> > lab working on the
> >> same big project as you, just after the symposium when
> > we met).
> >>
> >>>
> >>> So, it is possible to add "isNaN" conditional, as I
> > did in "insertionSort"
> >>> and fix the current code without additional
> > impact.
> >>
> >> I really consider it as an ugly hack rather than
> > attempting to really
> >> fix the mess I did in this class.
> >>
> >>>
> >>> Or, we redesign (as per MATH-1120) which implies
> > forgetting about
> >>> performance for now. The patch removes
> > "insertionSort" altogether!
> >>
> >> Well, it is just one line calling Arrays.sort(work, begin, end); I
> >> can put the insertion sort back here while editing
> > the patch before
> >> committing it.
> >>
> >>> IMHO, it is too many changes at once...
> >>
> >> You know the popular saying about « premature
> > optimization ». I think it
> >> would be fair to temporarily forget about performance,
> > fix the issue and
> >> set up a new ground to work on based on MATH-1120
> > patch, then improve
> >> this by small corrections once the first huge patch has
> > been committed,
> >> and then look back at performances.
> >
> > That's fine with me. I was really waiting from someone else to apply
> > this patch. ;-)
> >
> > Gilles
> >
> >>
> >>
> >>>
> >>> What is the meaning of percentile when NaNs are
> > kept in the data
> >>> (strategy "FIXED")? For all the other strategies, the NaNs will
> >>> not
> > cause a problem
> >>> within the algorithms being discussed here (being
> > replaced by +inf
> >>> or -inf, or removed, or raising an exception).
> >>
> >> This is one point we should really discuss, but it can
> > be done later.
> >> I agree with you, FIXED is not appropriate here. We
> > could simply raise
> >> an exception if it is selected as not being meaningful
> > for this algorithm.
> >>
> >>>
> >>>
> >>>>
> >>>> I will have a look at the proposed patch,
> > including your comments.
> >>>
> >>> Thanks.
> >>
> >> So I did have a look and really think it is worth
> > applying it. I agree
> >> with your comments and will include your change for the
> > NaNs recoding.
> >> This change alone would solve MATH-1129 by the way.
> >>
> >> I also am quite puzzled with the various default
> > strategies and would
> >> propose only one. I would even suggest to select
> > immediately R8 instead
> >> of our legacy method, as it would nevertheless be an
> > improvement and
> >> would not break existing code (only provide better
> > results).
> >>
> >> As there are several customization area (method,
> > pivoting NaN strategy),
> >> it would also be interesting to use the builder
> > approach we started in
> >> optimization. This would also allow us to pass an
> > argument for the
> >> RANDOM pivoting where the user could select a specific
> > random generator
> >> with specific seed, I really do not like having a
> > random generator
> >> blindly created with the automatic time dependent
> > seed.
> >>
> >> Another point I already changed is the medianOf3 method
> > throwing an
> >> exception in addition to being deprecated. It is really
> > not friendly to
> >> users. So I replaced the exception by a call to the
> > appropriate enum
> >> method (but of course let the deprecation in place).
> >>
> >> I also changed the pivoting strategy to be an instance
> > field rather than
> >> a class field, just as all the other customization
> > parameters.
> >>
> >> best regards, Luc
> >>
> >>>
> >>> Gilles
> >>>
> >>>>> [...]
> >
> >
> > ---------------------------------------------------------------------
> >
> >
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> >
> >
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Posted by Luc Maisonobe <lu...@spaceroots.org>.
Le 18/06/2014 20:28, venkatesha m a écrit :
> Hi Luc, Gilles

Hi Venkat

> 
> First of, Iam immensly thankful to all your comments on this patch.
> Next, i am attaching my new patch with today's date(18-jun). However
> please advise if i need to remove the old patch file if it confuses.

OK, I'll roll back my current work and will restart from this new patch.
Please wait until it is committed before submitting a new version,
otherwise it will be a nightmare to keep track of changes. Let's do it
in small tractable steps.

> 
> Please find my response below. The new patch has the suggested
> changes in the switch case for nan handling; But; However i have my
> view points on the different default nan strategies associated to
> types. Please permit me to explain (sorry for long summary)
> 
> First, i would like to leave the Default implementation of Percentile
> as-is (Meaning in my MATH-1120 patch it is mapped to Type.CM)since
> otherwise we will break user old expectation even for non nan and non
> inf entries as well. The existing Percentile implementation also did
> a copy of the array slice before doing kth slection(the old evaluate
> method can be refered) and even i am doing the same it slightly in
> different circumstance in the switch case with nan filtering. The
> existing tests could fail if we change the default types to say R_8
> (please refer to PercentileTest.java code as well to see the finer
> variations in expected values that is being looked at for different
> types. I have used R execution as the basis for setting those
> values).

Yes, the results would slightly change because the new algorithm you set
up is better. I would not qualify this as "breaking" user code. For me,
breaking user code is when the semantics or the API really changes.

We are already plagued in [math] with too stringent backward
compatibility, so adding another stricter interpretation that even
slight numerical improvements should never occur would completely block
all releases.

Our current handling of NaNs fails and do not even fulfill what I have
written in the Javadoc (see MATH-1129). So there *is* a bug. We cannot
say fixing a bug changes results so we must preserve bugs at all costs!

> 
> Secondly, Percentile.java header comment states somewhere to an
> effect that NaNs would be (left as-is and) handled by java's default
> sort behavior and no removal being done.

And this was not really done. NaNs were sorted inconsistently.

> So for me to map this
> behaviour to new implementation; it was NaNStrategy.FIXED that came
> close and didnt require any of the existing test cases for the
> existing Percentile behavior to change.

I don't agree. The Javadoc reads:

 This ordering makes Double.NaN</code> larger than any other value
 (including Double.POSITIVE_INFINITY)

So the best match would be NaNStrategy.MAXIMAL.

> What i am re-iterating here
> is the existing behavior tests have completely passed with new
> Type.CM and FIXED. (And now i have added several more tests including
> different types as well).
> 
> Thirdly, While all the R_x (where x :[1-9]) types as run and verified
> by R tool; seemed to clearly convey the NaNs needed to be removed and
> hence you see that i have used different strategy
> NaNStrategy#REMOVED. I agree while multiple defaults are not wise to
> have ; however; if we are forced to have Apache CM as supported type
> (which is not one of R_x types) and we have the need to support
> multiple variants (R1- R9) ; then it is inevitable to have type
> sepcific NaNStrategy as per the need.

I don't understand your rational. We are speaking about default here. As
we indeed need some backward compatibility at API level, we cannot force
users to select a specific strategy (as this would force users to change
their existing code), so we need a default. You say it should be the
triplet (Type.CM, NaNStrategy.FIXED, PivotingStrategy.MEDIAN_OF_3), I
say it could be (Type.R_1, PivotingStrategy.MEDIAN_OF_3).

If NaNStrategy is really something that may be chosen independently of
the Type, whatever we finally choose, we can say the default is a
consistent selection of one triplet. Users of the new API who decide by
themselve it is worth changing their code will have to do their own
selection of the three components of the triplet. The Junit tests can of
course use the full-blown new API and select something different, just
to be consistent with the reference R results (using
NaNStrategy.REMOVED). It doesn't need to be some second level default
when the user has selected R_i, but did not select a NaNStrategy, then
we select a better default automatically.

If on the other hand NaNStrategy is not something that may be chosen
independently of the Type (perhaps because the algorithm would not work
in this case), then it should not be a separate choice and it should be
hardwired in the Type, without any possibility for the user to change
it. I doubt this is the case.

> I also feel ; NaN handling
> should be allowed for overriding atleast in a controlled manner as
> different use cases may exist for needing this variation in nan
> handling. Therefore IMO while we could avoid the public access to
> change these defaults; it is relevant to support these variations of
> nan handling on a per type and allow atleast sub classes to override
> if a rare need arises. While the very name NaNStrategy reminds me of
> different ways to look at that; i feel we will be much restricted if
> we just said that we stick to one way of NaNHandling for all types.
> Please let me know your thoughts.

Subclassing seems too much for this. Also I see your latest patch has
some protected fields to allow for this, which is not allowed in our
checkstyle/findbugs settings (but this is a detail, we could replace the
protected fields by private field and protected setters if really needed).

The initial idea of your patch was sufficient to me : a few enums for
the various parts, a few methods to selects the enums (you used
constructors while I prefer builder pattern, but it's again only an
implementation detail), and defaults for existing code which simply
doesn't set any of these enums since they did not exist in previous
version. It was simple and practical to me. The only thing I frowned
upon was having several levels of defaults in the patch before today,
and the inheritance complexity in the patch from today.

> 
> Next, Regarding the PivotingStrategy; At first, i wanted to convey
> here that to have all the partitioning, pivoting and selection in
> separated classes/enums than inside main class. I have made it as
> static due to the fact that; it is more of a non-functional
> requirement and felt that it need not be set for every instance (more
> of a global setting that doesnt vary across types). Please correct me
> here and let know if it still needs to be per instance.

Anything that is global and can be changed is dangerous in a low level
library. You can never know how your code will be called. I do develop
other libraries that use Apache Commons Math (several independent
different layers of libraries), and other people do develop complete
application on top of the libraries I develop. Now consider that in one
of my middle level library I need this percentile for an obscure
internal computation and I decide to use PivotingStrategy.CENTRAL. Then
the girl who uses my library and completely ignore the fact I rely on
Percentile deep inside the guts of my code can also decide to use
Percentile with PivotingStrategy.MEDIAN_OF_3. Depending on her code
begin called before or after mine, of worse, both codes being
interleaved during the program main loop, this would give unexpected
results.

As a rule of thumb, class fields that can be modified are a nightmare in
many code (unfortunately sometimes you cannot avoid them), and its even
a worse nightmare when you do not control all parts of the code in a
single team, which is obviously the case with libraries.

> I also made
> it package accessable/settable solely because medianOf3 method had
> been package level for the sole intent of possible overriding of the
> same within that package. Meaning; if some one really needed to
> tinker around pivoting they need a way to do it which i have provided
> it using a strategy.

No, I really think it was package level because of a programming error,
and I even think I made this programming error. This thing should have
been private, and now we have to live with it, at least until 4.0 when
we will be able to remove it.

> 
> Next, In the current patch that i am going to attach as new dated
> patch (since you have already started looking at the old one ; which
> i would leave it as is). I find many utility type methods;
> replaceNaN, removeNaN( Predicated Lists ) and copyOf(values,
> begin.length) and as well as KthSelector with PivotingStrategy etc
> all of which can perhaps make its way to MathArrays and MathUtils.
> Please let me know.If so i will once again re-factor these changes
> and submit the patch.

Yes, it is probably interesting to move them in one of these utility
classes, but no, don't refactor your patch yet. Wait until we have
committed something so the next changes will be smaller. As Gilles said,
the patch is really big now, so keeping modifying it completely just to
do some fine tuning like this is not the way to go. Let's first have
something that work, is reasonably documented, is compliant with our
main development rules and produces results, and then we will fine tune
it, move methods, adjust defaults, check performance ...

I had a difficult day and have a headache right now, so I will not do
the commit just now. I need at least a few more hours to work on it. I
hope to be able to commit something in about 24h, but no promise. Just
refrain from doing another patch that would imply I redo all the same
work once more. We are very close to have a first version of your work
included in the main repository (with a few changes). Once it's in
place, we will continue the discussion and tightening up the remaining
loose ends.

best regards,
Luc

> 
> Thanks for reading this through and for your time in reviewing .
> Please let me know your opinion on all of these.
> 
> thanks venkat.
> 
> 
> -------------------------------------------- On Wed, 18/6/14, Gilles
> <gi...@harfang.homelinux.org> wrote:
> 
> Subject: Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion
> about support on variations in, percentile calculation To:
> dev@commons.apache.org Date: Wednesday, 18 June, 2014, 8:37 PM
> 
> On Wed, 18 Jun 2014 16:39:12 +0200, Luc Maisonobe wrote:
>> Hi Gilles and Venkat,
>> 
>> Le 18/06/2014 15:40, Gilles a écrit :
>>> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe
> wrote:
>>>> Le 18/06/2014 14:32, Gilles a écrit :
>>>>> Hello Luc.
>>>>> 
>>>>>>> 
>>>>>>> See https://issues.apache.org/jira/browse/MATH-1129
>>>>>>> 
>>>>>>> The problem reported was due to the
> sorting procedure not behaving
>>>>>>> correctly in the presence of NaN. This procedure could be
>>>>>>> replaced by
> an equivalent method from the JDK:
>>>>>>> java.util.Arrays.sort(double[],int,int)
>>>>>>> 
>>>>>>> Any objection?
>>>>>> 
>>>>>> If you imply removing the select,
> medianOf3 and partition methods,
>>>>>> yes I would object. If you imply replacing
> only the insertionSort method used
>>>>>> for small sub-arrays, then I almost
> agree.
>>>>> 
>>>>> Issue 1129 concerns the latter: See the
> comment in "insertionSort" in
>>>>> the current code.
>>>>> 
>>>>> However, for the former, you should really
> have a look at MATH-1120
>>>>> https://issues.apache.org/jira/browse/MATH-1120 The proposed
>>>>> patch indeed touches those
> methods.
>>>> 
>>>> So I think it is worth fixing MATH-1120 first,
> with the NaNStrategy and
>>>> go back to MATH-1129 afterwards, to see if it
> still applies of if
>>>> MATH-1120 also fixed it.
>>> 
>>> As per my last comment on MATH-1120, the
> "NaNStrategy" part of the patch
>>> is an extension of the initial feature request.
>> 
>> Sure, but it is a really good addition and seems fair
> to consider.
>> 
>>> 
>>> As per the Javadoc, the CM code was originally
> meant to handle (in some
>>> way), the presence of NaN values within the data.
> So I thought that this
>>> should be fixed before extending the API (which
> entailed additional
>>> questions as the proposed patch is fairly
> "massive").
>> 
>> Yes, the patch is a big one, and it has been works on
> for a while.
>> As it has quite stabilized by now, I think it would be
> the good time
>> to include it (with some editing I could do) and to
> start working with
>> separate patches applied on top of this one.
>> 
>> If we let it out of the code longer, it will become
> more and more
>> difficult for the contributor to work on it and to us
> to include it.
>> 
>>> 
>>> Especially, the new code is clearly untested for
> performance, and this
>>> seems to be an important argument by your previous
> post.
>> 
>> Yes, it is important as some issues were raised on it
> (if you remember
>> well, they were raised by one researcher from another
> lab working on the
>> same big project as you, just after the symposium when
> we met).
>> 
>>> 
>>> So, it is possible to add "isNaN" conditional, as I
> did in "insertionSort"
>>> and fix the current code without additional
> impact.
>> 
>> I really consider it as an ugly hack rather than
> attempting to really
>> fix the mess I did in this class.
>> 
>>> 
>>> Or, we redesign (as per MATH-1120) which implies
> forgetting about
>>> performance for now. The patch removes
> "insertionSort" altogether!
>> 
>> Well, it is just one line calling Arrays.sort(work, begin, end); I
>> can put the insertion sort back here while editing
> the patch before
>> committing it.
>> 
>>> IMHO, it is too many changes at once...
>> 
>> You know the popular saying about « premature
> optimization ». I think it
>> would be fair to temporarily forget about performance,
> fix the issue and
>> set up a new ground to work on based on MATH-1120
> patch, then improve
>> this by small corrections once the first huge patch has
> been committed,
>> and then look back at performances.
> 
> That's fine with me. I was really waiting from someone else to apply 
> this patch. ;-)
> 
> Gilles
> 
>> 
>> 
>>> 
>>> What is the meaning of percentile when NaNs are
> kept in the data
>>> (strategy "FIXED")? For all the other strategies, the NaNs will
>>> not
> cause a problem
>>> within the algorithms being discussed here (being
> replaced by +inf
>>> or -inf, or removed, or raising an exception).
>> 
>> This is one point we should really discuss, but it can
> be done later.
>> I agree with you, FIXED is not appropriate here. We
> could simply raise
>> an exception if it is selected as not being meaningful
> for this algorithm.
>> 
>>> 
>>> 
>>>> 
>>>> I will have a look at the proposed patch,
> including your comments.
>>> 
>>> Thanks.
>> 
>> So I did have a look and really think it is worth
> applying it. I agree
>> with your comments and will include your change for the
> NaNs recoding.
>> This change alone would solve MATH-1129 by the way.
>> 
>> I also am quite puzzled with the various default
> strategies and would
>> propose only one. I would even suggest to select
> immediately R8 instead
>> of our legacy method, as it would nevertheless be an
> improvement and
>> would not break existing code (only provide better
> results).
>> 
>> As there are several customization area (method,
> pivoting NaN strategy),
>> it would also be interesting to use the builder
> approach we started in
>> optimization. This would also allow us to pass an
> argument for the
>> RANDOM pivoting where the user could select a specific
> random generator
>> with specific seed, I really do not like having a
> random generator
>> blindly created with the automatic time dependent
> seed.
>> 
>> Another point I already changed is the medianOf3 method
> throwing an
>> exception in addition to being deprecated. It is really
> not friendly to
>> users. So I replaced the exception by a call to the
> appropriate enum
>> method (but of course let the deprecation in place).
>> 
>> I also changed the pivoting strategy to be an instance
> field rather than
>> a class field, just as all the other customization
> parameters.
>> 
>> best regards, Luc
>> 
>>> 
>>> Gilles
>>> 
>>>>> [...]
> 
> 
> ---------------------------------------------------------------------
>
> 
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
>
> 
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Posted by venkatesha m <ts...@yahoo.com.INVALID>.
Hi Luc, Gilles

First of, Iam immensly thankful to all your comments on this patch. Next, i am attaching my new patch with today's date(18-jun). However please advise if i need to remove the old patch file if it confuses.

Please find my response below. The new patch has the suggested changes in the switch case for nan handling; But; However i have my view points on the different default nan strategies associated to types. Please permit me to explain (sorry for long summary)

First,
i would like to leave the Default implementation of Percentile as-is (Meaning in my MATH-1120 patch it is mapped to Type.CM)since otherwise we will break user old expectation even for non nan and non inf entries as well. The existing Percentile implementation also did a copy of the array slice before doing kth slection(the old evaluate method can be refered) and even i am doing the same it slightly in different circumstance in the switch case with nan filtering. The existing tests could fail if we change the default types to say R_8 (please refer to PercentileTest.java code as well to see the finer variations in expected values that is being looked at for different types. I have used R execution as the basis for setting those values).

Secondly,
Percentile.java header comment states somewhere to an effect that NaNs would be (left as-is and) handled by java's default sort behavior and no removal being done. So for me to map this behaviour to new implementation; it was NaNStrategy.FIXED that came close and didnt require any of the existing test cases for the existing Percentile behavior to change. What i am re-iterating here is the existing behavior tests have completely passed with new Type.CM and FIXED. (And now i have added several more tests including different types as well).

Thirdly,
While all the R_x (where x :[1-9]) types as run and verified by R tool; seemed to clearly convey the NaNs needed to be removed and hence you see that i have used different strategy NaNStrategy#REMOVED.
I agree while multiple defaults are not wise to have ; however; if we are forced to have Apache CM as supported type (which is not one of R_x types) and we have the need to support multiple variants (R1- R9) ; then it is inevitable to have type sepcific NaNStrategy as per the need.
I also feel ; NaN handling should be allowed for overriding atleast in a controlled manner as different use cases may exist for needing this variation in nan handling. Therefore IMO while we could avoid the public access to change these defaults; it is relevant to support these variations of nan handling on a per type and allow atleast sub classes to override if a rare need arises.
While the very name NaNStrategy reminds me of different ways to look at that; i feel we will be much restricted if we just said that we stick to one way of NaNHandling for all types. Please let me know your thoughts.

Next,
Regarding the PivotingStrategy; At first, i wanted to convey here that to have all the partitioning, pivoting and selection in separated classes/enums than inside main class. I have made it as static due to the fact that; it is more of a non-functional requirement and felt that it need not be set for every instance (more of a global setting that doesnt vary across types).
Please correct me here and let know if it still needs to be per instance.
I also made it package accessable/settable solely because medianOf3 method had been package level for the sole intent of possible overriding of the same within that package. Meaning; if some one really needed to tinker around pivoting they need a way to do it which i have provided it using a strategy.

Next,
In the current patch that i am going to attach as new dated patch (since you have already started looking at the old one ; which i would leave it as is). I find many utility type methods; replaceNaN, removeNaN( Predicated Lists ) and copyOf(values, begin.length) and as well as KthSelector with PivotingStrategy etc all of which can perhaps make its way to MathArrays and MathUtils. Please let me know.If so i will once again re-factor these changes and submit the patch.

Thanks for reading this through and for your time in reviewing . Please let me know your opinion on all of these.

thanks
venkat.


--------------------------------------------
On Wed, 18/6/14, Gilles <gi...@harfang.homelinux.org> wrote:

 Subject: Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation
 To: dev@commons.apache.org
 Date: Wednesday, 18 June, 2014, 8:37 PM
 
 On Wed, 18 Jun 2014 16:39:12 +0200,
 Luc Maisonobe wrote:
 > Hi Gilles and Venkat,
 > 
 > Le 18/06/2014 15:40, Gilles a écrit :
 >> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe
 wrote:
 >>> Le 18/06/2014 14:32, Gilles a écrit :
 >>>> Hello Luc.
 >>>> 
 >>>>>> 
 >>>>>> See
 >>>>>>   https://issues.apache.org/jira/browse/MATH-1129
 >>>>>> 
 >>>>>> The problem reported was due to the
 sorting procedure not behaving
 >>>>>> correctly in the presence of NaN.
 >>>>>> This procedure could be replaced by
 an equivalent method from the JDK:
 >>>>>>   java.util.Arrays.sort(double[],int,int)
 >>>>>> 
 >>>>>> Any objection?
 >>>>> 
 >>>>> If you imply removing the select,
 medianOf3 and partition methods,
 >>>>> yes I
 >>>>> would object. If you imply replacing
 only the insertionSort method used
 >>>>> for small sub-arrays, then I almost
 agree.
 >>>> 
 >>>> Issue 1129 concerns the latter: See the
 comment in "insertionSort" in
 >>>> the
 >>>> current code.
 >>>> 
 >>>> However, for the former, you should really
 have a look at MATH-1120
 >>>>   https://issues.apache.org/jira/browse/MATH-1120
 >>>> The proposed patch indeed touches those
 methods.
 >>> 
 >>> So I think it is worth fixing MATH-1120 first,
 with the NaNStrategy and
 >>> go back to MATH-1129 afterwards, to see if it
 still applies of if
 >>> MATH-1120 also fixed it.
 >> 
 >> As per my last comment on MATH-1120, the
 "NaNStrategy" part of the patch
 >> is an extension of the initial feature request.
 > 
 > Sure, but it is a really good addition and seems fair
 to consider.
 > 
 >> 
 >> As per the Javadoc, the CM code was originally
 meant to handle (in some
 >> way), the presence of NaN values within the data.
 So I thought that this
 >> should be fixed before extending the API (which
 entailed additional
 >> questions as the proposed patch is fairly
 "massive").
 > 
 > Yes, the patch is a big one, and it has been works on
 for a while.
 > As it has quite stabilized by now, I think it would be
 the good time
 > to include it (with some editing I could do) and to
 start working with
 > separate patches applied on top of this one.
 > 
 > If we let it out of the code longer, it will become
 more and more
 > difficult for the contributor to work on it and to us
 to include it.
 > 
 >> 
 >> Especially, the new code is clearly untested for
 performance, and this
 >> seems to be an important argument by your previous
 post.
 > 
 > Yes, it is important as some issues were raised on it
 (if you remember
 > well, they were raised by one researcher from another
 lab working on the
 > same big project as you, just after the symposium when
 we met).
 > 
 >> 
 >> So, it is possible to add "isNaN" conditional, as I
 did in "insertionSort"
 >> and fix the current code without additional
 impact.
 > 
 > I really consider it as an ugly hack rather than
 attempting to really
 > fix the mess I did in this class.
 > 
 >> 
 >> Or, we redesign (as per MATH-1120) which implies
 forgetting about
 >> performance for now. The patch removes
 "insertionSort" altogether!
 > 
 > Well, it is just one line calling
 >   Arrays.sort(work, begin, end);
 > I can put the insertion sort back here while editing
 the patch before
 > committing it.
 > 
 >> IMHO, it is too many changes at once...
 > 
 > You know the popular saying about « premature
 optimization ». I think it
 > would be fair to temporarily forget about performance,
 fix the issue and
 > set up a new ground to work on based on MATH-1120
 patch, then improve
 > this by small corrections once the first huge patch has
 been committed,
 > and then look back at performances.
 
 That's fine with me. I was really waiting from someone else
 to apply
 this patch. ;-)
 
 Gilles
 
 > 
 > 
 >> 
 >> What is the meaning of percentile when NaNs are
 kept in the data
 >> (strategy "FIXED")?
 >> For all the other strategies, the NaNs will not
 cause a problem
 >> within the algorithms being discussed here (being
 replaced by +inf
 >> or -inf, or removed, or raising an exception).
 > 
 > This is one point we should really discuss, but it can
 be done later.
 > I agree with you, FIXED is not appropriate here. We
 could simply raise
 > an exception if it is selected as not being meaningful
 for this algorithm.
 > 
 >> 
 >> 
 >>> 
 >>> I will have a look at the proposed patch,
 including your comments.
 >> 
 >> Thanks.
 > 
 > So I did have a look and really think it is worth
 applying it. I agree
 > with your comments and will include your change for the
 NaNs recoding.
 > This change alone would solve MATH-1129 by the way.
 > 
 > I also am quite puzzled with the various default
 strategies and would
 > propose only one. I would even suggest to select
 immediately R8 instead
 > of our legacy method, as it would nevertheless be an
 improvement and
 > would not break existing code (only provide better
 results).
 > 
 > As there are several customization area (method,
 pivoting NaN strategy),
 > it would also be interesting to use the builder
 approach we started in
 > optimization. This would also allow us to pass an
 argument for the
 > RANDOM pivoting where the user could select a specific
 random generator
 > with specific seed, I really do not like having a
 random generator
 > blindly created with the automatic time dependent
 seed.
 > 
 > Another point I already changed is the medianOf3 method
 throwing an
 > exception in addition to being deprecated. It is really
 not friendly to
 > users. So I replaced the exception by a call to the
 appropriate enum
 > method (but of course let the deprecation in place).
 > 
 > I also changed the pivoting strategy to be an instance
 field rather than
 > a class field, just as all the other customization
 parameters.
 > 
 > best regards,
 > Luc
 > 
 >> 
 >> Gilles
 >> 
 >>>> [...]
 
 
 ---------------------------------------------------------------------
 To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
 For additional commands, e-mail: dev-help@commons.apache.org
 
 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Posted by Gilles <gi...@harfang.homelinux.org>.
On Wed, 18 Jun 2014 16:39:12 +0200, Luc Maisonobe wrote:
> Hi Gilles and Venkat,
>
> Le 18/06/2014 15:40, Gilles a écrit :
>> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe wrote:
>>> Le 18/06/2014 14:32, Gilles a écrit :
>>>> Hello Luc.
>>>>
>>>>>>
>>>>>> See
>>>>>>   https://issues.apache.org/jira/browse/MATH-1129
>>>>>>
>>>>>> The problem reported was due to the sorting procedure not 
>>>>>> behaving
>>>>>> correctly in the presence of NaN.
>>>>>> This procedure could be replaced by an equivalent method from 
>>>>>> the JDK:
>>>>>>   java.util.Arrays.sort(double[],int,int)
>>>>>>
>>>>>> Any objection?
>>>>>
>>>>> If you imply removing the select, medianOf3 and partition 
>>>>> methods,
>>>>> yes I
>>>>> would object. If you imply replacing only the insertionSort 
>>>>> method used
>>>>> for small sub-arrays, then I almost agree.
>>>>
>>>> Issue 1129 concerns the latter: See the comment in "insertionSort" 
>>>> in
>>>> the
>>>> current code.
>>>>
>>>> However, for the former, you should really have a look at 
>>>> MATH-1120
>>>>   https://issues.apache.org/jira/browse/MATH-1120
>>>> The proposed patch indeed touches those methods.
>>>
>>> So I think it is worth fixing MATH-1120 first, with the NaNStrategy 
>>> and
>>> go back to MATH-1129 afterwards, to see if it still applies of if
>>> MATH-1120 also fixed it.
>>
>> As per my last comment on MATH-1120, the "NaNStrategy" part of the 
>> patch
>> is an extension of the initial feature request.
>
> Sure, but it is a really good addition and seems fair to consider.
>
>>
>> As per the Javadoc, the CM code was originally meant to handle (in 
>> some
>> way), the presence of NaN values within the data. So I thought that 
>> this
>> should be fixed before extending the API (which entailed additional
>> questions as the proposed patch is fairly "massive").
>
> Yes, the patch is a big one, and it has been works on for a while.
> As it has quite stabilized by now, I think it would be the good time
> to include it (with some editing I could do) and to start working 
> with
> separate patches applied on top of this one.
>
> If we let it out of the code longer, it will become more and more
> difficult for the contributor to work on it and to us to include it.
>
>>
>> Especially, the new code is clearly untested for performance, and 
>> this
>> seems to be an important argument by your previous post.
>
> Yes, it is important as some issues were raised on it (if you 
> remember
> well, they were raised by one researcher from another lab working on 
> the
> same big project as you, just after the symposium when we met).
>
>>
>> So, it is possible to add "isNaN" conditional, as I did in 
>> "insertionSort"
>> and fix the current code without additional impact.
>
> I really consider it as an ugly hack rather than attempting to really
> fix the mess I did in this class.
>
>>
>> Or, we redesign (as per MATH-1120) which implies forgetting about
>> performance for now. The patch removes "insertionSort" altogether!
>
> Well, it is just one line calling
>   Arrays.sort(work, begin, end);
> I can put the insertion sort back here while editing the patch before
> committing it.
>
>> IMHO, it is too many changes at once...
>
> You know the popular saying about « premature optimization ». I think 
> it
> would be fair to temporarily forget about performance, fix the issue 
> and
> set up a new ground to work on based on MATH-1120 patch, then improve
> this by small corrections once the first huge patch has been 
> committed,
> and then look back at performances.

That's fine with me. I was really waiting from someone else to apply
this patch. ;-)

Gilles

>
>
>>
>> What is the meaning of percentile when NaNs are kept in the data
>> (strategy "FIXED")?
>> For all the other strategies, the NaNs will not cause a problem
>> within the algorithms being discussed here (being replaced by +inf
>> or -inf, or removed, or raising an exception).
>
> This is one point we should really discuss, but it can be done later.
> I agree with you, FIXED is not appropriate here. We could simply 
> raise
> an exception if it is selected as not being meaningful for this 
> algorithm.
>
>>
>>
>>>
>>> I will have a look at the proposed patch, including your comments.
>>
>> Thanks.
>
> So I did have a look and really think it is worth applying it. I 
> agree
> with your comments and will include your change for the NaNs 
> recoding.
> This change alone would solve MATH-1129 by the way.
>
> I also am quite puzzled with the various default strategies and would
> propose only one. I would even suggest to select immediately R8 
> instead
> of our legacy method, as it would nevertheless be an improvement and
> would not break existing code (only provide better results).
>
> As there are several customization area (method, pivoting NaN 
> strategy),
> it would also be interesting to use the builder approach we started 
> in
> optimization. This would also allow us to pass an argument for the
> RANDOM pivoting where the user could select a specific random 
> generator
> with specific seed, I really do not like having a random generator
> blindly created with the automatic time dependent seed.
>
> Another point I already changed is the medianOf3 method throwing an
> exception in addition to being deprecated. It is really not friendly 
> to
> users. So I replaced the exception by a call to the appropriate enum
> method (but of course let the deprecation in place).
>
> I also changed the pivoting strategy to be an instance field rather 
> than
> a class field, just as all the other customization parameters.
>
> best regards,
> Luc
>
>>
>> Gilles
>>
>>>> [...]


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129 and Re: [MATH-1120] Needed opinion about support on variations in, percentile calculation

Posted by Luc Maisonobe <lu...@spaceroots.org>.
Hi Gilles and Venkat,

Le 18/06/2014 15:40, Gilles a écrit :
> On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe wrote:
>> Le 18/06/2014 14:32, Gilles a écrit :
>>> Hello Luc.
>>>
>>>>>
>>>>> See
>>>>>   https://issues.apache.org/jira/browse/MATH-1129
>>>>>
>>>>> The problem reported was due to the sorting procedure not behaving
>>>>> correctly in the presence of NaN.
>>>>> This procedure could be replaced by an equivalent method from the JDK:
>>>>>   java.util.Arrays.sort(double[],int,int)
>>>>>
>>>>> Any objection?
>>>>
>>>> If you imply removing the select, medianOf3 and partition methods,
>>>> yes I
>>>> would object. If you imply replacing only the insertionSort method used
>>>> for small sub-arrays, then I almost agree.
>>>
>>> Issue 1129 concerns the latter: See the comment in "insertionSort" in
>>> the
>>> current code.
>>>
>>> However, for the former, you should really have a look at MATH-1120
>>>   https://issues.apache.org/jira/browse/MATH-1120
>>> The proposed patch indeed touches those methods.
>>
>> So I think it is worth fixing MATH-1120 first, with the NaNStrategy and
>> go back to MATH-1129 afterwards, to see if it still applies of if
>> MATH-1120 also fixed it.
> 
> As per my last comment on MATH-1120, the "NaNStrategy" part of the patch
> is an extension of the initial feature request.

Sure, but it is a really good addition and seems fair to consider.

> 
> As per the Javadoc, the CM code was originally meant to handle (in some
> way), the presence of NaN values within the data. So I thought that this
> should be fixed before extending the API (which entailed additional
> questions as the proposed patch is fairly "massive").

Yes, the patch is a big one, and it has been works on for a while.
As it has quite stabilized by now, I think it would be the good time
to include it (with some editing I could do) and to start working with
separate patches applied on top of this one.

If we let it out of the code longer, it will become more and more
difficult for the contributor to work on it and to us to include it.

> 
> Especially, the new code is clearly untested for performance, and this
> seems to be an important argument by your previous post.

Yes, it is important as some issues were raised on it (if you remember
well, they were raised by one researcher from another lab working on the
same big project as you, just after the symposium when we met).

> 
> So, it is possible to add "isNaN" conditional, as I did in "insertionSort"
> and fix the current code without additional impact.

I really consider it as an ugly hack rather than attempting to really
fix the mess I did in this class.

> 
> Or, we redesign (as per MATH-1120) which implies forgetting about
> performance for now. The patch removes "insertionSort" altogether!

Well, it is just one line calling
  Arrays.sort(work, begin, end);
I can put the insertion sort back here while editing the patch before
committing it.

> IMHO, it is too many changes at once...

You know the popular saying about « premature optimization ». I think it
would be fair to temporarily forget about performance, fix the issue and
set up a new ground to work on based on MATH-1120 patch, then improve
this by small corrections once the first huge patch has been committed,
and then look back at performances.


> 
> What is the meaning of percentile when NaNs are kept in the data
> (strategy "FIXED")?
> For all the other strategies, the NaNs will not cause a problem
> within the algorithms being discussed here (being replaced by +inf
> or -inf, or removed, or raising an exception).

This is one point we should really discuss, but it can be done later.
I agree with you, FIXED is not appropriate here. We could simply raise
an exception if it is selected as not being meaningful for this algorithm.

> 
> 
>>
>> I will have a look at the proposed patch, including your comments.
> 
> Thanks.

So I did have a look and really think it is worth applying it. I agree
with your comments and will include your change for the NaNs recoding.
This change alone would solve MATH-1129 by the way.

I also am quite puzzled with the various default strategies and would
propose only one. I would even suggest to select immediately R8 instead
of our legacy method, as it would nevertheless be an improvement and
would not break existing code (only provide better results).

As there are several customization area (method, pivoting NaN strategy),
it would also be interesting to use the builder approach we started in
optimization. This would also allow us to pass an argument for the
RANDOM pivoting where the user could select a specific random generator
with specific seed, I really do not like having a random generator
blindly created with the automatic time dependent seed.

Another point I already changed is the medianOf3 method throwing an
exception in addition to being deprecated. It is really not friendly to
users. So I replaced the exception by a call to the appropriate enum
method (but of course let the deprecation in place).

I also changed the pivoting strategy to be an instance field rather than
a class field, just as all the other customization parameters.

best regards,
Luc

> 
> Gilles
> 
>>> [...]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129

Posted by Gilles <gi...@harfang.homelinux.org>.
On Wed, 18 Jun 2014 15:02:41 +0200, Luc Maisonobe wrote:
> Le 18/06/2014 14:32, Gilles a écrit :
>> Hello Luc.
>>
>>>>
>>>> See
>>>>   https://issues.apache.org/jira/browse/MATH-1129
>>>>
>>>> The problem reported was due to the sorting procedure not behaving
>>>> correctly in the presence of NaN.
>>>> This procedure could be replaced by an equivalent method from the 
>>>> JDK:
>>>>   java.util.Arrays.sort(double[],int,int)
>>>>
>>>> Any objection?
>>>
>>> If you imply removing the select, medianOf3 and partition methods, 
>>> yes I
>>> would object. If you imply replacing only the insertionSort method 
>>> used
>>> for small sub-arrays, then I almost agree.
>>
>> Issue 1129 concerns the latter: See the comment in "insertionSort" 
>> in the
>> current code.
>>
>> However, for the former, you should really have a look at MATH-1120
>>   https://issues.apache.org/jira/browse/MATH-1120
>> The proposed patch indeed touches those methods.
>
> So I think it is worth fixing MATH-1120 first, with the NaNStrategy 
> and
> go back to MATH-1129 afterwards, to see if it still applies of if
> MATH-1120 also fixed it.

As per my last comment on MATH-1120, the "NaNStrategy" part of the 
patch
is an extension of the initial feature request.

As per the Javadoc, the CM code was originally meant to handle (in some
way), the presence of NaN values within the data. So I thought that 
this
should be fixed before extending the API (which entailed additional
questions as the proposed patch is fairly "massive").

Especially, the new code is clearly untested for performance, and this
seems to be an important argument by your previous post.

So, it is possible to add "isNaN" conditional, as I did in 
"insertionSort"
and fix the current code without additional impact.

Or, we redesign (as per MATH-1120) which implies forgetting about
performance for now. The patch removes "insertionSort" altogether!
IMHO, it is too many changes at once...

What is the meaning of percentile when NaNs are kept in the data
(strategy "FIXED")?
For all the other strategies, the NaNs will not cause a problem
within the algorithms being discussed here (being replaced by +inf
or -inf, or removed, or raising an exception).


>
> I will have a look at the proposed patch, including your comments.

Thanks.

Gilles

>> [...]


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129

Posted by Luc Maisonobe <lu...@spaceroots.org>.
Le 18/06/2014 14:32, Gilles a écrit :
> Hello Luc.
> 
>>>
>>> See
>>>   https://issues.apache.org/jira/browse/MATH-1129
>>>
>>> The problem reported was due to the sorting procedure not behaving
>>> correctly in the presence of NaN.
>>> This procedure could be replaced by an equivalent method from the JDK:
>>>   java.util.Arrays.sort(double[],int,int)
>>>
>>> Any objection?
>>
>> If you imply removing the select, medianOf3 and partition methods, yes I
>> would object. If you imply replacing only the insertionSort method used
>> for small sub-arrays, then I almost agree.
> 
> Issue 1129 concerns the latter: See the comment in "insertionSort" in the
> current code.
> 
> However, for the former, you should really have a look at MATH-1120
>   https://issues.apache.org/jira/browse/MATH-1120
> The proposed patch indeed touches those methods.

So I think it is worth fixing MATH-1120 first, with the NaNStrategy and
go back to MATH-1129 afterwards, to see if it still applies of if
MATH-1120 also fixed it.

I will have a look at the proposed patch, including your comments.

> 
>>
>> However, I'm not sure this would be sufficient as the sort is done
>> partially in all the methods above. The former ones are used to split
>> the big array into smaller ones and sorting only the sub-arrays that are
>> needed to compute the percentile (it is really a selection algorithm,
>> not a full-sort algorithm). So I guess we should look at all the methods
>> at once to ensure proper handling of NaNs. The trick is that all
>> comparisons involving NaN return false (<, <=, >, >=, ==, !=),
>> regardless of the NaN being at the left hand side or right hand side of
>> the comparison (so we get the funny consequence that if a is a NaN, the
>> test a == a evaluates to false ...).
> 
> As of r1603217, "insertionSort" seems to behave correctly (at least on the
> example reported in MATH-1129).

Yes, but it would be fed with garbage as the former methods fail
(typically medianOf3 does not handle NaNs properly), so I guess in the
general case it would clearly fail.

> 
>>
>> The fast an insertion sort is used at the end is because it is very
>> simply and efficient for small arrays, which is exactly what happens
>> here: the array part has MIN_SELECT_SIZE elements or less, i.e. 15
>> elements or less.
> 
> That was my question, implicitly: is it worth not using the JDK, or IOW,
> is "insertionSort" so much faster than "java.util.Arrays.sort" that it is
> worth maintaining a custom implementation?

Yes it is as long as it is kept simple. Quicksort is not that good at
small arrays, and this method can be called a huge number of times on
numerous small arrays, to it should really consider this.

> 
>>
>> As I wrote this selection algorithm, I could do the check if you want.
> 
> What check?

Checking the algorithms and fixing them using the strategy "NaN is
larger everything", as already stated in the javadoc. But as MATH-1120
proposes a more complete solution and allows other strategies, I'll look
at it first.

best regards,
Luc

> 
> 
> Regards,
> Gilles
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129

Posted by Gilles <gi...@harfang.homelinux.org>.
Hello Luc.

>>
>> See
>>   https://issues.apache.org/jira/browse/MATH-1129
>>
>> The problem reported was due to the sorting procedure not behaving
>> correctly in the presence of NaN.
>> This procedure could be replaced by an equivalent method from the 
>> JDK:
>>   java.util.Arrays.sort(double[],int,int)
>>
>> Any objection?
>
> If you imply removing the select, medianOf3 and partition methods, 
> yes I
> would object. If you imply replacing only the insertionSort method 
> used
> for small sub-arrays, then I almost agree.

Issue 1129 concerns the latter: See the comment in "insertionSort" in 
the
current code.

However, for the former, you should really have a look at MATH-1120
   https://issues.apache.org/jira/browse/MATH-1120
The proposed patch indeed touches those methods.

>
> However, I'm not sure this would be sufficient as the sort is done
> partially in all the methods above. The former ones are used to split
> the big array into smaller ones and sorting only the sub-arrays that 
> are
> needed to compute the percentile (it is really a selection algorithm,
> not a full-sort algorithm). So I guess we should look at all the 
> methods
> at once to ensure proper handling of NaNs. The trick is that all
> comparisons involving NaN return false (<, <=, >, >=, ==, !=),
> regardless of the NaN being at the left hand side or right hand side 
> of
> the comparison (so we get the funny consequence that if a is a NaN, 
> the
> test a == a evaluates to false ...).

As of r1603217, "insertionSort" seems to behave correctly (at least on 
the
example reported in MATH-1129).

>
> The fast an insertion sort is used at the end is because it is very
> simply and efficient for small arrays, which is exactly what happens
> here: the array part has MIN_SELECT_SIZE elements or less, i.e. 15
> elements or less.

That was my question, implicitly: is it worth not using the JDK, or 
IOW,
is "insertionSort" so much faster than "java.util.Arrays.sort" that it 
is
worth maintaining a custom implementation?

>
> As I wrote this selection algorithm, I could do the check if you 
> want.

What check?


Regards,
Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1129

Posted by Luc Maisonobe <lu...@spaceroots.org>.
Le 18/06/2014 13:37, Gilles a écrit :
> Hi.

Hi Gilles,

> 
> See
>   https://issues.apache.org/jira/browse/MATH-1129
> 
> The problem reported was due to the sorting procedure not behaving
> correctly in the presence of NaN.
> This procedure could be replaced by an equivalent method from the JDK:
>   java.util.Arrays.sort(double[],int,int)
> 
> Any objection?

If you imply removing the select, medianOf3 and partition methods, yes I
would object. If you imply replacing only the insertionSort method used
for small sub-arrays, then I almost agree.

However, I'm not sure this would be sufficient as the sort is done
partially in all the methods above. The former ones are used to split
the big array into smaller ones and sorting only the sub-arrays that are
needed to compute the percentile (it is really a selection algorithm,
not a full-sort algorithm). So I guess we should look at all the methods
at once to ensure proper handling of NaNs. The trick is that all
comparisons involving NaN return false (<, <=, >, >=, ==, !=),
regardless of the NaN being at the left hand side or right hand side of
the comparison (so we get the funny consequence that if a is a NaN, the
test a == a evaluates to false ...).

The fast an insertion sort is used at the end is because it is very
simply and efficient for small arrays, which is exactly what happens
here: the array part has MIN_SELECT_SIZE elements or less, i.e. 15
elements or less.

As I wrote this selection algorithm, I could do the check if you want.

best regards,
Luc

> 
> 
> Regards,
> Gilles
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org