You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2011/12/22 19:45:30 UTC

[jira] [Issue Comment Edited] (MAHOUT-846) Improve Scalability of Gaussian Cluster For Wide Vectors

    [ https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174953#comment-13174953 ] 

Jeff Eastman edited comment on MAHOUT-846 at 12/22/11 6:43 PM:
---------------------------------------------------------------

The following code refactors the pdf() function to sum the exponents and perform the exponentiation only once. It produces equivalent results to the old version for DisplayDirichlet and the math seems solid. I introduced two helper methods to make it all clearer. I believe it will handle wide vectors better than the original but have yet to verify that.

{code}
  @Override
  public double pdf(VectorWritable vw) {
    Vector x = vw.get();
    Vector m = getCenter();
    Vector s = getRadius().plus(0.0000001); // add a small prior to avoid divide
                                            // by zero
    return Math.exp(-(divideSquareAndSum(x.minus(m), s) / 2))
        / zProd(s.times(UncommonDistributions.SQRT2PI));
  }
  
  private double zProd(Vector s) {
    double prod = 1;
    for (int i = 0; i < s.size(); i++) {
      prod *= s.getQuick(i);
    }
    return prod;
  }
  
  private double divideSquareAndSum(Vector numerator, Vector denominator) {
    double result = 0;
    for (Iterator<Element> it = denominator.iterateNonZero(); it.hasNext();) {
      Element denom = it.next();
      double quotient = numerator.getQuick(denom.index()) / denom.get();
      result += quotient * quotient;
    }
    return result;
  }
{code}
                
      was (Author: jeastman):
    The following code refactors the pdf() function to sum the exponents and perform the exponentiation only once. It produces equivalent results to the old version for DisplayDirichlet and the math seems solid. I introduced two helper methods to make it all clearer. I believe it will handle wide vectors better than the original but have yet to verify that.

{code}
  public double pdf(VectorWritable vw) {
    Vector x = vw.get();
    Vector m = getCenter();
    Vector s = getRadius().plus(0.0000001); // add a small prior to avoid divide by zero
    Vector xms2 = divideAndSquare(x.minus(m), s);
    return Math.exp(-(xms2.zSum() / 2))
        / zProd(s.times(UncommonDistributions.SQRT2PI));
  }
  
  private double zProd(Vector s) {
    double prod = 1;
    for (int i = 0; i < s.size(); i++) {
      prod *= s.getQuick(i);
    }
    return prod;
  }
  
  private Vector divideAndSquare(Vector numerator, Vector denominator) {
    Vector result = numerator.like();
    for (Iterator<Element> it = denominator.iterateNonZero(); it.hasNext();) {
      Element denom = it.next();
      double quotient = numerator.getQuick(denom.index()) / denom.get();
      result.setQuick(denom.index(), quotient * quotient);
    }
    return result;
  }
{code}
                  
> Improve Scalability of Gaussian Cluster For Wide Vectors
> --------------------------------------------------------
>
>                 Key: MAHOUT-846
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-846
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.6
>
>
> The pdf() implementation in GaussianCluster is pretty lame. It is computing a running product of the element pdfs which, for wide input vectors (Reuters is 41,807), always underflows and returns 0. Here's the code:
> {noformat}
>   public double pdf(VectorWritable vw) {
>     Vector x = vw.get();
>     // return the product of the component pdfs
>     // TODO: is this reasonable? correct? It seems to work in some cases.
>     double pdf = 1;
>     for (int i = 0; i < x.size(); i++) {
>       // small prior on stdDev to avoid numeric instability when stdDev==0
>       pdf *= UncommonDistributions.dNorm(x.getQuick(i),
>           getCenter().getQuick(i), getRadius().getQuick(i) + 0.000001);
>     }
>     return pdf;
> {noformat}
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira