You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Derek O'Callaghan <de...@ucd.ie> on 2010/09/21 19:57:39 UTC

Dirichlet - NormalModel.pdf() calculation problem

Hi Jeff,

I mentioned this issue in my last mail to the CDbw thread, but I thought 
I'd create a separate thread for it as it's a different problem 
(although similar).

When s0 is 1, NormalModel.computeParameters() will set stdDev to 
Double.MIN_VALUE. However, this causes a problem in subsequent calls to 
pdf() from DirichletState.adjustedProbability() . In such a case, the 
call to "double sd2 = stdDev * stdDev;" will set sd2 to 0, which causes 
pdf() to return NaN. This means that the call to 
UncommonDistribution.rMultinom() will return 0, and so (I think) all 
subsequent points will be assigned to cluster 0.

FYI I was able to workaround this by changing the following in 
NormalModel.pdf():

return ex / (stdDev * SQRT2PI);

to:

double pdf = ex / (stdDev * SQRT2PI);
if (Double.isNaN(pdf)) {
      pdf = 0.0;
}
return pdf;


As you mentioned in the other thread, 
AbstractCluster.computeParameters() will also set the radius to 
Double.MIN_VALUE when s0 is 1, although I'm not sure if that's used 
anywhere that'll cause a similar problem as in pdf() above.


Derek

Re: Dirichlet - NormalModel.pdf() calculation problem

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  The NormalModel pdf() and the GaussianCluster pdf() use different 
implementations; the latter uses UncommonDistributions.dNorm(), which 
returns Infinity if stdev==0. I will work on adjusting the stdev 
calculations with a very mild prior. You obviously know the right place 
to tickle the elephant here. Thanks!


On 9/21/10 2:26 PM, Ted Dunning wrote:
> Actually NaN is the correct value here.  If stdDev == 0, then the
> distribution is a delta function and the value is zero except where it is
> infinite or undefined.
>
> Much better is to prevent stdDev from being set to 0.  This is a problem
> with all maximum-likelihood clustering techniques that don't use prior
> distributions on the cluster paramters, especially the variance (aka the
> standard distribution).  Applying even a very mild prior will prevent these
> numerical problems, and applying a real prior could substantially improve
> the performance of the clustering.  For instance, if the general scale of
> your data is in the range of 1, then it is probably reasonable to expect
> that clusters should not have a diameter (std) small than, say, 0.01.
>
> On Tue, Sep 21, 2010 at 10:57 AM, Derek O'Callaghan<derek.ocallaghan@ucd.ie
>> wrote:
>> Hi Jeff,
>>
>> I mentioned this issue in my last mail to the CDbw thread, but I thought
>> I'd create a separate thread for it as it's a different problem (although
>> similar).
>>
>> When s0 is 1, NormalModel.computeParameters() will set stdDev to
>> Double.MIN_VALUE. However, this causes a problem in subsequent calls to
>> pdf() from DirichletState.adjustedProbability() . In such a case, the call
>> to "double sd2 = stdDev * stdDev;" will set sd2 to 0, which causes pdf() to
>> return NaN. This means that the call to UncommonDistribution.rMultinom()
>> will return 0, and so (I think) all subsequent points will be assigned to
>> cluster 0.
>>
>> FYI I was able to workaround this by changing the following in
>> NormalModel.pdf():
>>
>> return ex / (stdDev * SQRT2PI);
>>
>> to:
>>
>> double pdf = ex / (stdDev * SQRT2PI);
>> if (Double.isNaN(pdf)) {
>>      pdf = 0.0;
>> }
>> return pdf;
>>
>>
>> As you mentioned in the other thread, AbstractCluster.computeParameters()
>> will also set the radius to Double.MIN_VALUE when s0 is 1, although I'm not
>> sure if that's used anywhere that'll cause a similar problem as in pdf()
>> above.
>>
>>
>> Derek
>>


Re: Dirichlet - NormalModel.pdf() calculation problem

Posted by Ted Dunning <te...@gmail.com>.
Actually NaN is the correct value here.  If stdDev == 0, then the
distribution is a delta function and the value is zero except where it is
infinite or undefined.

Much better is to prevent stdDev from being set to 0.  This is a problem
with all maximum-likelihood clustering techniques that don't use prior
distributions on the cluster paramters, especially the variance (aka the
standard distribution).  Applying even a very mild prior will prevent these
numerical problems, and applying a real prior could substantially improve
the performance of the clustering.  For instance, if the general scale of
your data is in the range of 1, then it is probably reasonable to expect
that clusters should not have a diameter (std) small than, say, 0.01.

On Tue, Sep 21, 2010 at 10:57 AM, Derek O'Callaghan <derek.ocallaghan@ucd.ie
> wrote:

> Hi Jeff,
>
> I mentioned this issue in my last mail to the CDbw thread, but I thought
> I'd create a separate thread for it as it's a different problem (although
> similar).
>
> When s0 is 1, NormalModel.computeParameters() will set stdDev to
> Double.MIN_VALUE. However, this causes a problem in subsequent calls to
> pdf() from DirichletState.adjustedProbability() . In such a case, the call
> to "double sd2 = stdDev * stdDev;" will set sd2 to 0, which causes pdf() to
> return NaN. This means that the call to UncommonDistribution.rMultinom()
> will return 0, and so (I think) all subsequent points will be assigned to
> cluster 0.
>
> FYI I was able to workaround this by changing the following in
> NormalModel.pdf():
>
> return ex / (stdDev * SQRT2PI);
>
> to:
>
> double pdf = ex / (stdDev * SQRT2PI);
> if (Double.isNaN(pdf)) {
>     pdf = 0.0;
> }
> return pdf;
>
>
> As you mentioned in the other thread, AbstractCluster.computeParameters()
> will also set the radius to Double.MIN_VALUE when s0 is 1, although I'm not
> sure if that's used anywhere that'll cause a similar problem as in pdf()
> above.
>
>
> Derek
>

Re: Dirichlet - NormalModel.pdf() calculation problem

Posted by Derek O'Callaghan <de...@ucd.ie>.
>  Oh that's brilliant! I have seen the same situation before 
> too but never found the reason for it. Personally, I'd prefer to 
> detect the divide by zero explicitly; something like:
> 
> if (stdDev > 0)
>     return ex / (stdDev * SQRT2PI);
> else
>     return 0;
>

Yep, that looks better than what I had, I'll use that instead.
 
> On the AbstractCluster point, since all Clusters (being 
> themselves Models in the latest refactoring) can now be used 
> directly by Dirichlet, the GaussianCluster subclass (which is 
> now equivalent to AssymetricSampledNormalModel if you check) 
> will have the same pdf problem. Check also 
> DistanceMeasureClusterDistribution which instantiates 
> DistanceMeasureClusters (equivalent to L1Models) for models and 
> GaussianClusterDistribution. Once these bake out a little I plan 
> to deprecate most of the current Dirichlet models (which were 
> experimental anyway and kind of a learning experience). There 
> are already unit tests for the new hierarchy that produce 
> equivalent results afaict.
>

Yeah, I had the same problem when using GaussianClusterDistribution, I can see that it's pdf() will also generate NaN as you say. I might hold off on Dirichlet for the moment, or I'll just use it with the workaround. I was just trying it out to see what kind of results I get.

Now I want to take a look at that clean eigenvectors problem :)
 
> On 9/21/10 1:57 PM, Derek O'Callaghan wrote:
> >Hi Jeff,
> >
> >I mentioned this issue in my last mail to the CDbw thread, but 
> I thought I'd create a separate thread for it as it's a 
> different problem (although similar).
> >
> >When s0 is 1, NormalModel.computeParameters() will set stdDev 
> to Double.MIN_VALUE. However, this causes a problem in 
> subsequent calls to pdf() from 
> DirichletState.adjustedProbability() . In such a case, the call 
> to "double sd2 = stdDev * stdDev;" will set sd2 to 0, which 
> causes pdf() to return NaN. This means that the call to 
> UncommonDistribution.rMultinom() will return 0, and so (I think) 
> all subsequent points will be assigned to cluster 0.
> >
> >FYI I was able to workaround this by changing the following in 
> NormalModel.pdf():>
> >return ex / (stdDev * SQRT2PI);
> >
> >to:
> >
> >double pdf = ex / (stdDev * SQRT2PI);
> >if (Double.isNaN(pdf)) {
> >     pdf = 0.0;
> >}
> >return pdf;
> >
> >
> >As you mentioned in the other thread, 
> AbstractCluster.computeParameters() will also set the radius to 
> Double.MIN_VALUE when s0 is 1, although I'm not sure if that's 
> used anywhere that'll cause a similar problem as in pdf() above.
> >
> >
> >Derek
> >
>

Re: Dirichlet - NormalModel.pdf() calculation problem

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  Oh that's brilliant! I have seen the same situation before too but 
never found the reason for it. Personally, I'd prefer to detect the 
divide by zero explicitly; something like:

if (stdDev > 0)
     return ex / (stdDev * SQRT2PI);
else
     return 0;

On the AbstractCluster point, since all Clusters (being themselves 
Models in the latest refactoring) can now be used directly by Dirichlet, 
the GaussianCluster subclass (which is now equivalent to 
AssymetricSampledNormalModel if you check) will have the same pdf 
problem. Check also DistanceMeasureClusterDistribution which 
instantiates DistanceMeasureClusters (equivalent to L1Models) for models 
and GaussianClusterDistribution. Once these bake out a little I plan to 
deprecate most of the current Dirichlet models (which were experimental 
anyway and kind of a learning experience). There are already unit tests 
for the new hierarchy that produce equivalent results afaict.

On 9/21/10 1:57 PM, Derek O'Callaghan wrote:
> Hi Jeff,
>
> I mentioned this issue in my last mail to the CDbw thread, but I 
> thought I'd create a separate thread for it as it's a different 
> problem (although similar).
>
> When s0 is 1, NormalModel.computeParameters() will set stdDev to 
> Double.MIN_VALUE. However, this causes a problem in subsequent calls 
> to pdf() from DirichletState.adjustedProbability() . In such a case, 
> the call to "double sd2 = stdDev * stdDev;" will set sd2 to 0, which 
> causes pdf() to return NaN. This means that the call to 
> UncommonDistribution.rMultinom() will return 0, and so (I think) all 
> subsequent points will be assigned to cluster 0.
>
> FYI I was able to workaround this by changing the following in 
> NormalModel.pdf():
>
> return ex / (stdDev * SQRT2PI);
>
> to:
>
> double pdf = ex / (stdDev * SQRT2PI);
> if (Double.isNaN(pdf)) {
>      pdf = 0.0;
> }
> return pdf;
>
>
> As you mentioned in the other thread, 
> AbstractCluster.computeParameters() will also set the radius to 
> Double.MIN_VALUE when s0 is 1, although I'm not sure if that's used 
> anywhere that'll cause a similar problem as in pdf() above.
>
>
> Derek
>