You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Simon Dirmeier <si...@web.de> on 2018/05/24 07:19:43 UTC

Positive log-likelihood with Gaussian mixture

Dear all,

I am fitting a very trivial GMM with 2-10 components on 100 samples and 
5 features in pyspark and observe some of the log-likelihoods being 
positive (see below). I don't undestand how this is possible. Is this a 
bug or an intended behaviour? Furthermore, for different seeds, 
sometimes the likelihoods even change sign. Is this due to the EM only 
converging to a local maximum?

Cheers and thanks for your help,

Simon


```

for i in range(2, 10 + 1):

km = GaussianMixture(tol=0.00001, maxIter=1000, k=i, seed=23)

model = km.fit(df)

print(i, model.summary.logLikelihood)

2 -197.37852947736653
3 -129.9873268616941
4 252.856072127079
5 58.104854133211305
6 102.05184634221902
7 -438.69872950609897
8 -521.9157414809579
9 684.7223627089136
10 -596.7165760632951

for i in range(2, 10 + 1):

km = GaussianMixture(tol=0.00001, maxIter=1000, k=i, seed=5)

model = km.fit(df)

print(i, model.summary.logLikelihood)

2 -237.6569055995205
3 193.6716647064348
4 222.8175404052819
5 201.28821925102105
6 74.02720327261291
7 -540.8607659051879
8 144.837051544231
9 -507.48261722455305
10 -689.1844483249996
```

Re: Positive log-likelihood with Gaussian mixture

Posted by Simon Dirmeier <si...@web.de>.

I see, thanks for clearning that up. I was aware of the fact for uniform 
distributions, but not for normal ones.
So that would mean, some of the components have such a small variance 
that the loglik is positive in the end?

Cheers,
Simon

Am 30.05.18 um 11:22 schrieb robin.east@xense.co.uk:
> Positive log likelihoods for continuous distributions are not unusual. 
> You are evaluating a pdf not a probability. For example a univariate 
> Gaussian pdf returns greater than 1 at the mean when the variance goes 
> below 0.39, at which point the log pdf is positive.
>
> Sent from Polymail 
> <https://polymail.io/?utm_source=polymail&utm_medium=referral&utm_campaign=signature> 
>
>
> On Tue, 29 May 2018 at 12:08 Simon Dirmeier <Simon Dirmeier 
> <mailto:Simon%20Dirmeier%20%3Csimon.dirmeier@web.de%3E>> wrote:
>
>     Hey,
>
>     sorry for the late reply. I cannot share the data but the problem
>     can be reproduced easily, like below.
>     I wanted to check with sklearn and observe a similar behaviour,
>     i.e. a positive per-sample average log-likelihood
>     (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.score).
>
>     I don't think it is necessarily an issue with the implementation,
>     but maybe due to parameter identifiability or so?
>     As far as I can tell, the variances seem to be ok.
>
>     Thanks for looking into this.
>
>     Best,
>     Simon
>     /
>
>
>     import scipy//
>     import sklearn.mixture//
>     from scipy.stats import multivariate_normal//
>     from sklearn.mixture import GaussianMixture//
>     / /
>     / /scipy.random.seed(23)//
>     / /X = multivariate_normal.rvs(mean=scipy.ones(10), size=100)//
>     / /
>     / /dff = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), X)//
>     / /df = spark.createDataFrame(dff, schema=["label", "features"])//
>     / /
>     /
>     /for i in [100, 90, 80, 70, 60, 50]://
>     / /    km = pyspark.ml.clustering.GaussianMixture(k=10,
>     seed=23).fit(df.limit(i))//
>     / /sk_gmm = sklearn.mixture.GaussianMixture(10,
>     random_state=23).fit(X[:i, :])//
>     / /    print(df.limit(i).count(), X[:i, :].shape[0],
>     km.summary.logLikelihood, sk_gmm.score(X[:i, :]))//
>     / /
>     /
>
>     /100 100 368.37475644171036 -1.54949312502 90 90 1026.084529101155
>     1.16196607062 80 80 2245.427539835042 4.25769131857 70 70
>     1940.0122633489268 10.0949992881 60 60 2255.002313247103
>     14.0497823725 50 50 -140.82605873444814 21.2423016046/
>
>

Re: Positive log-likelihood with Gaussian mixture

Posted by ro...@xense.co.uk.

Positive log likelihoods for continuous distributions are not unusual. You are evaluating a pdf not a probability. For example a univariate Gaussian pdf returns greater than 1 at the mean when the variance goes below 0.39, at which point the log pdf is positive.

Sent from Polymail ( https://polymail.io/?utm_source=polymail&utm_medium=referral&utm_campaign=signature )

On Tue, 29 May 2018 at 12:08 Simon Dirmeier < Simon Dirmeier ( Simon Dirmeier <si...@web.de> ) > wrote:

> 
> 
> Hey,
> 
> sorry for the late reply. I cannot share the data but the problem can be
> reproduced easily, like below.
> I wanted to check with sklearn and observe a similar behaviour, i.e. a
> positive per-sample average log-likelihood ( http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.score
> ).
> 
> I don't think it is necessarily an issue with the implementation, but
> maybe due to parameter identifiability or so?
> As far as I can tell, the variances seem to be ok.
> 
> Thanks for looking into this.
> 
> Best,
> Simon
> 
> 
> 
> import scipy
> import sklearn.mixture
> from scipy.stats import multivariate_normal
> from sklearn.mixture import GaussianMixture
> 
> scipy.random.seed(23)
> X = multivariate_normal.rvs(mean=scipy.ones(10), size=100)
> 
> dff = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), X)
> df = spark.createDataFrame(dff, schema=["label", "features"])
> 
> for i in [100, 90, 80, 70, 60, 50]:
>     km = pyspark.ml.clustering.GaussianMixture(k=10,
> seed=23).fit(df.limit(i))
>     sk_gmm = sklearn.mixture.GaussianMixture(10,
> random_state=23).fit(X[:i, :])
>     print(df.limit(i).count(), X[:i, :].shape[0],
> km.summary.logLikelihood, sk_gmm.score(X[:i, :]))
> 
> 100 100 368.37475644171036 -1.54949312502 90 90 1026.084529101155
> 1.16196607062 80 80 2245.427539835042 4.25769131857 70 70
> 1940.0122633489268 10.0949992881 60 60 2255.002313247103 14.0497823725 50
> 50 -140.82605873444814 21.2423016046
>

Re: Positive log-likelihood with Gaussian mixture

Posted by Simon Dirmeier <si...@web.de>.

Hey,

sorry for the late reply. I cannot share the data but the problem can be 
reproduced easily, like below.
I wanted to check with sklearn and observe a similar behaviour, i.e. a 
positive per-sample average log-likelihood 
(http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.score).

I don't think it is necessarily an issue with the implementation, but 
maybe due to parameter identifiability or so?
As far as I can tell, the variances seem to be ok.

Thanks for looking into this.

Best,
Simon
/


import scipy//
import sklearn.mixture//
from scipy.stats import multivariate_normal//
from sklearn.mixture import GaussianMixture//
//
//scipy.random.seed(23)//
//X = multivariate_normal.rvs(mean=scipy.ones(10), size=100)//
//
//dff = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), X)//
//df = spark.createDataFrame(dff, schema=["label", "features"])//
//
/
/for i in [100, 90, 80, 70, 60, 50]://
//    km = pyspark.ml.clustering.GaussianMixture(k=10, 
seed=23).fit(df.limit(i))//
//    sk_gmm = sklearn.mixture.GaussianMixture(10, 
random_state=23).fit(X[:i, :])//
//    print(df.limit(i).count(), X[:i, :].shape[0], 
km.summary.logLikelihood, sk_gmm.score(X[:i, :]))//
//
/

/100 100 368.37475644171036 -1.54949312502 90 90 1026.084529101155 
1.16196607062 80 80 2245.427539835042 4.25769131857 70 70 
1940.0122633489268 10.0949992881 60 60 2255.002313247103 14.0497823725 
50 50 -140.82605873444814 21.2423016046/