You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hao Ren (JIRA)" <ji...@apache.org> on 2016/11/25 14:04:59 UTC

[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

    [ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695942#comment-15695942 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:04 PM:
-----------------------------------------------------------

After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
    val delta = x - breezeMu
    val v = rootSigmaInv * delta
    u + v.t * v * -0.5 // u is used here
  }
{code}

If  `v.t * v * -0.5` is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration on degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 0.19116928711151343, 0.19218984168511, 0.22044130291811304, 0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
    val delta = x - breezeMu
    val v = rootSigmaInv * delta
    u + v.t * v * -0.5 // u is used here
  }
{code}

If { v.t * v * -0.5 } is a small negative number, then the logpdf will be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> ----------------------------------------------------------------------
>
>                 Key: SPARK-18581
>                 URL: https://issues.apache.org/jira/browse/SPARK-18581
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.2, 2.0.2
>            Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 1. That leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix is non invertible which is a contradiction to the assumption that it should be invertible.
> So we should check if there a single value is smaller than the tolerance before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org