You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vasil Vasilev <va...@gmail.com> on 2011/06/27 10:49:38 UTC

Incorrect calculation of pdf

Hi,

Recently I wanted to use Dirichlet clustering algorithm to cluster vectors
directly taken out of vectorized text, whose dimensionality was around
50000. In this situation the algorithm fails to calculate the pdf of a
vector corresponding to cluster center due to problems with numerical
precision during multiplication.

In this regard, what do you think of modifying the GaussianCluster.pdf()
method in such way that it works with logarithmic probabilities?

Regards, Vasil

Re: Incorrect calculation of pdf

Posted by Ted Dunning <te...@gmail.com>.
Better to have a logPdf method that never does the exponentiation
internally.

But other than that detail, yes.

On Tue, Jun 28, 2011 at 8:56 AM, Jeff Eastman <je...@narus.com> wrote:

> In other words, you plan to take the log(pdf) of each term in the model
> vectors, sum them and exponentiate the result? It would be interesting to
> compare the results.
>
> -----Original Message-----
> From: Vasil Vasilev [mailto:vavasilev@gmail.com]
> Sent: Tuesday, June 28, 2011 6:02 AM
> To: user@mahout.apache.org
> Subject: Re: Incorrect calculation of pdf
>
> In fact my idea was very simple, although I do not know if it will work OK:
> Do all calculations on logarithmic level and just before return -
> exponentiate the result. This will not change the function's expected
> result
>
> On Mon, Jun 27, 2011 at 9:03 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Actually, pdf() should always be a pdf(), not a logPdf().  Many
> algorithms
> > want one or the other.  Some don't much care because log is monotonic.
>  But
> > we should do what the name implies.
> >
> > On Mon, Jun 27, 2011 at 10:15 AM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > A better approach would be to create a new Model and ModelDistribution
> > that
> > > uses log arithmetic of your choosing. The initial models are very
> simple
> > > minded and are likely not adequate for real applications.
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: Monday, June 27, 2011 7:51 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Incorrect calculation of pdf
> > >
> > > There should not be a change to an existing method.
> > >
> > > It would be find to add another method, perhaps called logPdf, that
> does
> > > what you suggest.  This loss of precision is common with the normal
> > > distribution in high dimensions.
> > >
> > > On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <va...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Recently I wanted to use Dirichlet clustering algorithm to cluster
> > > vectors
> > > > directly taken out of vectorized text, whose dimensionality was
> around
> > > > 50000. In this situation the algorithm fails to calculate the pdf of
> a
> > > > vector corresponding to cluster center due to problems with numerical
> > > > precision during multiplication.
> > > >
> > > > In this regard, what do you think of modifying the
> > GaussianCluster.pdf()
> > > > method in such way that it works with logarithmic probabilities?
> > > >
> > > > Regards, Vasil
> > > >
> > >
> >
>

RE: Incorrect calculation of pdf

Posted by Jeff Eastman <je...@Narus.com>.
In other words, you plan to take the log(pdf) of each term in the model vectors, sum them and exponentiate the result? It would be interesting to compare the results.

-----Original Message-----
From: Vasil Vasilev [mailto:vavasilev@gmail.com] 
Sent: Tuesday, June 28, 2011 6:02 AM
To: user@mahout.apache.org
Subject: Re: Incorrect calculation of pdf

In fact my idea was very simple, although I do not know if it will work OK:
Do all calculations on logarithmic level and just before return -
exponentiate the result. This will not change the function's expected result

On Mon, Jun 27, 2011 at 9:03 PM, Ted Dunning <te...@gmail.com> wrote:

> Actually, pdf() should always be a pdf(), not a logPdf().  Many algorithms
> want one or the other.  Some don't much care because log is monotonic.  But
> we should do what the name implies.
>
> On Mon, Jun 27, 2011 at 10:15 AM, Jeff Eastman <je...@narus.com> wrote:
>
> > A better approach would be to create a new Model and ModelDistribution
> that
> > uses log arithmetic of your choosing. The initial models are very simple
> > minded and are likely not adequate for real applications.
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Monday, June 27, 2011 7:51 AM
> > To: user@mahout.apache.org
> > Subject: Re: Incorrect calculation of pdf
> >
> > There should not be a change to an existing method.
> >
> > It would be find to add another method, perhaps called logPdf, that does
> > what you suggest.  This loss of precision is common with the normal
> > distribution in high dimensions.
> >
> > On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <va...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Recently I wanted to use Dirichlet clustering algorithm to cluster
> > vectors
> > > directly taken out of vectorized text, whose dimensionality was around
> > > 50000. In this situation the algorithm fails to calculate the pdf of a
> > > vector corresponding to cluster center due to problems with numerical
> > > precision during multiplication.
> > >
> > > In this regard, what do you think of modifying the
> GaussianCluster.pdf()
> > > method in such way that it works with logarithmic probabilities?
> > >
> > > Regards, Vasil
> > >
> >
>

Re: Incorrect calculation of pdf

Posted by Ted Dunning <te...@gmail.com>.
That is a fine idea and often works well, especially where you are
multiplying or just comparing probabilities.  You need a different method
there, of course, to get the log probability.

Where you are adding probabilities, it doesn't work quite so simply.  Even
there, though, the correct method is to get all the log probabilities,
subtract the maximum value so that the max probability is 0 and then add
only those probabilities where the log is large enough to matter (anything
smaller than -60 after offsetting is == 0).  Then you add back in the max
log prob and take the exponent.

So you are correct that it is important to have the log pdf available as a
call.

On Tue, Jun 28, 2011 at 6:01 AM, Vasil Vasilev <va...@gmail.com> wrote:

> In fact my idea was very simple, although I do not know if it will work OK:
> Do all calculations on logarithmic level and just before return -
> exponentiate the result. This will not change the function's expected
> result
>
> On Mon, Jun 27, 2011 at 9:03 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Actually, pdf() should always be a pdf(), not a logPdf().  Many
> algorithms
> > want one or the other.  Some don't much care because log is monotonic.
>  But
> > we should do what the name implies.
> >
> > On Mon, Jun 27, 2011 at 10:15 AM, Jeff Eastman <je...@narus.com>
> wrote:
> >
> > > A better approach would be to create a new Model and ModelDistribution
> > that
> > > uses log arithmetic of your choosing. The initial models are very
> simple
> > > minded and are likely not adequate for real applications.
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: Monday, June 27, 2011 7:51 AM
> > > To: user@mahout.apache.org
> > > Subject: Re: Incorrect calculation of pdf
> > >
> > > There should not be a change to an existing method.
> > >
> > > It would be find to add another method, perhaps called logPdf, that
> does
> > > what you suggest.  This loss of precision is common with the normal
> > > distribution in high dimensions.
> > >
> > > On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <va...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Recently I wanted to use Dirichlet clustering algorithm to cluster
> > > vectors
> > > > directly taken out of vectorized text, whose dimensionality was
> around
> > > > 50000. In this situation the algorithm fails to calculate the pdf of
> a
> > > > vector corresponding to cluster center due to problems with numerical
> > > > precision during multiplication.
> > > >
> > > > In this regard, what do you think of modifying the
> > GaussianCluster.pdf()
> > > > method in such way that it works with logarithmic probabilities?
> > > >
> > > > Regards, Vasil
> > > >
> > >
> >
>

Re: Incorrect calculation of pdf

Posted by Vasil Vasilev <va...@gmail.com>.
In fact my idea was very simple, although I do not know if it will work OK:
Do all calculations on logarithmic level and just before return -
exponentiate the result. This will not change the function's expected result

On Mon, Jun 27, 2011 at 9:03 PM, Ted Dunning <te...@gmail.com> wrote:

> Actually, pdf() should always be a pdf(), not a logPdf().  Many algorithms
> want one or the other.  Some don't much care because log is monotonic.  But
> we should do what the name implies.
>
> On Mon, Jun 27, 2011 at 10:15 AM, Jeff Eastman <je...@narus.com> wrote:
>
> > A better approach would be to create a new Model and ModelDistribution
> that
> > uses log arithmetic of your choosing. The initial models are very simple
> > minded and are likely not adequate for real applications.
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Monday, June 27, 2011 7:51 AM
> > To: user@mahout.apache.org
> > Subject: Re: Incorrect calculation of pdf
> >
> > There should not be a change to an existing method.
> >
> > It would be find to add another method, perhaps called logPdf, that does
> > what you suggest.  This loss of precision is common with the normal
> > distribution in high dimensions.
> >
> > On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <va...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Recently I wanted to use Dirichlet clustering algorithm to cluster
> > vectors
> > > directly taken out of vectorized text, whose dimensionality was around
> > > 50000. In this situation the algorithm fails to calculate the pdf of a
> > > vector corresponding to cluster center due to problems with numerical
> > > precision during multiplication.
> > >
> > > In this regard, what do you think of modifying the
> GaussianCluster.pdf()
> > > method in such way that it works with logarithmic probabilities?
> > >
> > > Regards, Vasil
> > >
> >
>

Re: Incorrect calculation of pdf

Posted by Ted Dunning <te...@gmail.com>.
Actually, pdf() should always be a pdf(), not a logPdf().  Many algorithms
want one or the other.  Some don't much care because log is monotonic.  But
we should do what the name implies.

On Mon, Jun 27, 2011 at 10:15 AM, Jeff Eastman <je...@narus.com> wrote:

> A better approach would be to create a new Model and ModelDistribution that
> uses log arithmetic of your choosing. The initial models are very simple
> minded and are likely not adequate for real applications.
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Monday, June 27, 2011 7:51 AM
> To: user@mahout.apache.org
> Subject: Re: Incorrect calculation of pdf
>
> There should not be a change to an existing method.
>
> It would be find to add another method, perhaps called logPdf, that does
> what you suggest.  This loss of precision is common with the normal
> distribution in high dimensions.
>
> On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <va...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Recently I wanted to use Dirichlet clustering algorithm to cluster
> vectors
> > directly taken out of vectorized text, whose dimensionality was around
> > 50000. In this situation the algorithm fails to calculate the pdf of a
> > vector corresponding to cluster center due to problems with numerical
> > precision during multiplication.
> >
> > In this regard, what do you think of modifying the GaussianCluster.pdf()
> > method in such way that it works with logarithmic probabilities?
> >
> > Regards, Vasil
> >
>

RE: Incorrect calculation of pdf

Posted by Jeff Eastman <je...@Narus.com>.
A better approach would be to create a new Model and ModelDistribution that uses log arithmetic of your choosing. The initial models are very simple minded and are likely not adequate for real applications. 

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, June 27, 2011 7:51 AM
To: user@mahout.apache.org
Subject: Re: Incorrect calculation of pdf

There should not be a change to an existing method.

It would be find to add another method, perhaps called logPdf, that does
what you suggest.  This loss of precision is common with the normal
distribution in high dimensions.

On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <va...@gmail.com> wrote:

> Hi,
>
> Recently I wanted to use Dirichlet clustering algorithm to cluster vectors
> directly taken out of vectorized text, whose dimensionality was around
> 50000. In this situation the algorithm fails to calculate the pdf of a
> vector corresponding to cluster center due to problems with numerical
> precision during multiplication.
>
> In this regard, what do you think of modifying the GaussianCluster.pdf()
> method in such way that it works with logarithmic probabilities?
>
> Regards, Vasil
>

Re: Incorrect calculation of pdf

Posted by Ted Dunning <te...@gmail.com>.
There should not be a change to an existing method.

It would be find to add another method, perhaps called logPdf, that does
what you suggest.  This loss of precision is common with the normal
distribution in high dimensions.

On Mon, Jun 27, 2011 at 1:49 AM, Vasil Vasilev <va...@gmail.com> wrote:

> Hi,
>
> Recently I wanted to use Dirichlet clustering algorithm to cluster vectors
> directly taken out of vectorized text, whose dimensionality was around
> 50000. In this situation the algorithm fails to calculate the pdf of a
> vector corresponding to cluster center due to problems with numerical
> precision during multiplication.
>
> In this regard, what do you think of modifying the GaussianCluster.pdf()
> method in such way that it works with logarithmic probabilities?
>
> Regards, Vasil
>