You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Alex Herbert <al...@gmail.com> on 2021/11/14 09:29:30 UTC

[STATISTICS] Distribution support is connect

Both the discrete and continuous distribution have a property in the interface:

    /**
     * Indicates whether the support is connected, i.e. whether
     * all values between the lower and upper bound of the support
     * are included in the support.
     *
     * @return whether the support is connected.
     */
    boolean isSupportConnected();

This is only ever true for all distributions.

Other stats libraries in Python, R, Matlab, Mathematica do not have
this property. The property is in commons Math3 and dates back 10
years to the package name change from Math2 to Math3. I did not chase
the commit history through SVN.

In Math3 only 5 real distributions and 1 discrete distribution test
this property. They all test it is true.

Interestingly the discrete distribution is the enumerated distribution
built from a set of discrete values. This may be a case for returning
false if certain values between the lower and upper range have a
probability of zero. But in this case it is valid behaviour to return
zero for the probability.

It may be that this property was intended to be used to determine if
the distribution would throw an exception for certain values between
the lower and upper range for the support. In the [Statistics] version
of the distributions no exceptions are thrown. The return will be
either an appropriate extreme (+/- infinity) or NaN. There is also no
facility to determine what values within the support are not valid. So
the property isSupportConnect alone cannot be used to determine if the
value you are interested in is part of the support. This would require
a isSupported(double x) method.

I propose to remove this unused property from the distribution
interfaces prior to the initial 1.0 release to avoid this redundant
method.

Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [STATISTICS] Distribution support is connect

Posted by Alex Herbert <al...@gmail.com>.
On Sun, 14 Nov 2021 at 19:19, Alex Herbert <al...@gmail.com> wrote:
>
>
>
> On Sun, 14 Nov 2021, 18:58 Phil Steitz, <ph...@gmail.com> wrote:
>>
>>
>>
>> On 11/14/21 2:29 AM, Alex Herbert wrote:
>> > Both the discrete and continuous distribution have a property in the interface:
>> >
>> >      /**
>> >       * Indicates whether the support is connected, i.e. whether
>> >       * all values between the lower and upper bound of the support
>> >       * are included in the support.
>> >       *
>> >       * @return whether the support is connected.
>> >       */
>> >      boolean isSupportConnected();
>> >
>> > This is only ever true for all distributions.
>> >
>> > Other stats libraries in Python, R, Matlab, Mathematica do not have
>> > this property. The property is in commons Math3 and dates back 10
>> > years to the package name change from Math2 to Math3. I did not chase
>> > the commit history through SVN.
>> >
>> > In Math3 only 5 real distributions and 1 discrete distribution test
>> > this property. They all test it is true.
>> >
>> > Interestingly the discrete distribution is the enumerated distribution
>> > built from a set of discrete values. This may be a case for returning
>> > false if certain values between the lower and upper range have a
>> > probability of zero. But in this case it is valid behaviour to return
>> > zero for the probability.
>> >
>> > It may be that this property was intended to be used to determine if
>> > the distribution would throw an exception for certain values between
>> > the lower and upper range for the support. In the [Statistics] version
>> > of the distributions no exceptions are thrown. The return will be
>> > either an appropriate extreme (+/- infinity) or NaN. There is also no
>> > facility to determine what values within the support are not valid. So
>> > the property isSupportConnect alone cannot be used to determine if the
>> > value you are interested in is part of the support. This would require
>> > a isSupported(double x) method.
>>
>> The reason it exists is to make the default inverse cum impl handle the
>> case where there is a gap in support.  If you want to add the
>> requirement that support is always connected, you can drop the code in
>> the default inverse cum that does this.
>
>
> The point is that this never occurs. This feature is only tested using a mock distribution that requires it.
>
> The feature can always be added back if a distribution requires it in the future. But currently there are no such distributions.

Further to the last post I'd like to say thanks Phil for pointing me
in the right direction.

A bit more context in is MATH-699. The isSupportConected property is
used to determine if the CDF has a plateau and thus the inverse CDF
must eliminate the possibility that the returned value x is in a
region of the CDF which is a plateau and lower x to the minimum value
where CDF(x) >= p. So this functionality exists but is not used for
implementations in the library. It does this by a rather inefficient
binary search bracketed from the current solution x and the initial
lower bound which may be a long way from the solution x. Ideally some
information could be extracted from the Brent solver on the most
recently used bracketing interval to improve the scan downwards (i.e.
the most recent x value lower than the solution x that was visited by
the Brent solver).

Currently the inverse CDF for discrete distributions does not use this
property. So a simplification of the two distribution interfaces could
drop the methods. The AbstractContinuousDistribution could maintain
the method as protected and document it as relevant to solving the
inverse CDF. Or remove the property and support for plateaus. I would
favour the later as this is engineering the abstract class for cases
that do not currently exist. In ticket MATH-699 you (Phil Steitz) even
stated "My inclination would be to keep the implementation in the base
class as simple as possible, documenting what it does and pushing the
responsibility for dealing with plateaus in the distribution to the
implementations that have these." This is something that I agree with.
It seems that the plateau issue is not relevant for the distributions
in the library. I would favour removing this support and noting in the
method that plateau support was removed, see MATH-699 and CM 3 for an
implementation.

I also note that when the abstract base class for the continuous
distribution was ported to STATISTICS the method to get the solver
absolute accuracy was dropped. It was not overridden by any class in
CM. However this may be of use for improvements to the default inverse
method to support very small p-value (see STATISTICS-36).

Alex

[1] https://issues.apache.org/jira/browse/MATH-699
[2] https://issues.apache.org/jira/browse/STATISTICS-36

>
> Alex
>
>
>> .
>>
>> Phil
>> >
>> > I propose to remove this unused property from the distribution
>> > interfaces prior to the initial 1.0 release to avoid this redundant
>> > method.
>> >
>> > Alex
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> > For additional commands, e-mail: dev-help@commons.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [STATISTICS] Distribution support is connect

Posted by Alex Herbert <al...@gmail.com>.
On Sun, 14 Nov 2021, 18:58 Phil Steitz, <ph...@gmail.com> wrote:

>
>
> On 11/14/21 2:29 AM, Alex Herbert wrote:
> > Both the discrete and continuous distribution have a property in the
> interface:
> >
> >      /**
> >       * Indicates whether the support is connected, i.e. whether
> >       * all values between the lower and upper bound of the support
> >       * are included in the support.
> >       *
> >       * @return whether the support is connected.
> >       */
> >      boolean isSupportConnected();
> >
> > This is only ever true for all distributions.
> >
> > Other stats libraries in Python, R, Matlab, Mathematica do not have
> > this property. The property is in commons Math3 and dates back 10
> > years to the package name change from Math2 to Math3. I did not chase
> > the commit history through SVN.
> >
> > In Math3 only 5 real distributions and 1 discrete distribution test
> > this property. They all test it is true.
> >
> > Interestingly the discrete distribution is the enumerated distribution
> > built from a set of discrete values. This may be a case for returning
> > false if certain values between the lower and upper range have a
> > probability of zero. But in this case it is valid behaviour to return
> > zero for the probability.
> >
> > It may be that this property was intended to be used to determine if
> > the distribution would throw an exception for certain values between
> > the lower and upper range for the support. In the [Statistics] version
> > of the distributions no exceptions are thrown. The return will be
> > either an appropriate extreme (+/- infinity) or NaN. There is also no
> > facility to determine what values within the support are not valid. So
> > the property isSupportConnect alone cannot be used to determine if the
> > value you are interested in is part of the support. This would require
> > a isSupported(double x) method.
>
> The reason it exists is to make the default inverse cum impl handle the
> case where there is a gap in support.  If you want to add the
> requirement that support is always connected, you can drop the code in
> the default inverse cum that does this.


The point is that this never occurs. This feature is only tested using a
mock distribution that requires it.

The feature can always be added back if a distribution requires it in the
future. But currently there are no such distributions.

Alex


.
>
> Phil
> >
> > I propose to remove this unused property from the distribution
> > interfaces prior to the initial 1.0 release to avoid this redundant
> > method.
> >
> > Alex
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
>

Re: [STATISTICS] Distribution support is connect

Posted by Phil Steitz <ph...@gmail.com>.

On 11/14/21 2:29 AM, Alex Herbert wrote:
> Both the discrete and continuous distribution have a property in the interface:
>
>      /**
>       * Indicates whether the support is connected, i.e. whether
>       * all values between the lower and upper bound of the support
>       * are included in the support.
>       *
>       * @return whether the support is connected.
>       */
>      boolean isSupportConnected();
>
> This is only ever true for all distributions.
>
> Other stats libraries in Python, R, Matlab, Mathematica do not have
> this property. The property is in commons Math3 and dates back 10
> years to the package name change from Math2 to Math3. I did not chase
> the commit history through SVN.
>
> In Math3 only 5 real distributions and 1 discrete distribution test
> this property. They all test it is true.
>
> Interestingly the discrete distribution is the enumerated distribution
> built from a set of discrete values. This may be a case for returning
> false if certain values between the lower and upper range have a
> probability of zero. But in this case it is valid behaviour to return
> zero for the probability.
>
> It may be that this property was intended to be used to determine if
> the distribution would throw an exception for certain values between
> the lower and upper range for the support. In the [Statistics] version
> of the distributions no exceptions are thrown. The return will be
> either an appropriate extreme (+/- infinity) or NaN. There is also no
> facility to determine what values within the support are not valid. So
> the property isSupportConnect alone cannot be used to determine if the
> value you are interested in is part of the support. This would require
> a isSupported(double x) method.

The reason it exists is to make the default inverse cum impl handle the 
case where there is a gap in support.  If you want to add the 
requirement that support is always connected, you can drop the code in 
the default inverse cum that does this.

Phil
>
> I propose to remove this unused property from the distribution
> interfaces prior to the initial 1.0 release to avoid this redundant
> method.
>
> Alex
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org