You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yanbo Liang <yb...@gmail.com> on 2016/10/07 15:35:59 UTC

Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying
the implementation from mllib to ml currently, and then exposing the
maximum log likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wa...@didichuxing.com>
wrote:

>
> Hi,
>
> Do you guys sometimes need to get the log likelihood of EM algorithm in
> MLLIB?
>
> I mean the value in this line https://github.com/apache/spark/blob/master/
> mllib/src/main/scala/org/apache/spark/mllib/clustering/
> GaussianMixture.scala#L228
>
> Now copying the code here:
>
>
> val sums = breezeData.treeAggregate(ExpectationSum.zero(k,
> d))(compute.value, _ += _)
> // Create new distributions based on the partial assignments
> // (often referred to as the "M" step in literature)
> val sumWeights = sums.weights.sum
> if (shouldDistributeGaussians) {
> val numPartitions = math.min(k, 1024)
> val tuples =
> Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
> val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean,
> sigma, weight) =>
> updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
> }.collect().unzip
> Array.copy(ws.toArray, 0, weights, 0, ws.length)
> Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
> } else {
> var i = 0
> while (i < k) {
> val (weight, gaussian) =
> updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i),
> sumWeights)
> weights(i) = weight
> gaussians(i) = gaussian
> i = i + 1
> }
> }
> llhp = llh // current becomes previous
> llh = sums.logLikelihood // this is the freshly computed log-likelihood
> iter += 1
> compute.destroy(blocking = false) In my application, I need to know log
> likelihood to compare effect for different number of clusters.
> And then I use the cluster number with the maximum log likelihood.
>
> Is it a good idea to expose this value?
>
>
>
>

Re: Could we expose log likelihood of EM algorithm in MLLIB?

Posted by Yanbo Liang <yb...@gmail.com>.
Let's move the discussion to JIRA. Thanks!

On Fri, Oct 7, 2016 at 8:43 PM, 王磊(安全部) <wa...@didichuxing.com>
wrote:

> https://issues.apache.org/jira/browse/SPARK-17825
>
> Actually I had created a JIRA. Could you let me your progress to avoid
> duplicated work.
>
> Thanks!
>
> 发件人: didi <wa...@didichuxing.com>
> 日期: 2016年10月8日 星期六 上午12:21
> 至: Yanbo Liang <yb...@gmail.com>
>
> 抄送: "dev@spark.apache.org" <de...@spark.apache.org>, "user@spark.apache.org"
> <us...@spark.apache.org>
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> Thanks for replying.
> When could you send out the PR?
>
> 发件人: Yanbo Liang <yb...@gmail.com>
> 日期: 2016年10月7日 星期五 下午11:35
> 至: didi <wa...@didichuxing.com>
> 抄送: "dev@spark.apache.org" <de...@spark.apache.org>, "user@spark.apache.org"
> <us...@spark.apache.org>
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> It's a good question and I had similar requirement in my work. I'm copying
> the implementation from mllib to ml currently, and then exposing the
> maximum log likelihood. I will send this PR soon.
>
> Thanks.
> Yanbo
>
> On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wa...@didichuxing.com>
> wrote:
>
>>
>> Hi,
>>
>> Do you guys sometimes need to get the log likelihood of EM algorithm in
>> MLLIB?
>>
>> I mean the value in this line https://github.com/apache
>> /spark/blob/master/mllib/src/main/scala/org/apache/spark/
>> mllib/clustering/GaussianMixture.scala#L228
>>
>> Now copying the code here:
>>
>>
>> val sums = breezeData.treeAggregate(ExpectationSum.zero(k,
>> d))(compute.value, _ += _)
>> // Create new distributions based on the partial assignments
>> // (often referred to as the "M" step in literature)
>> val sumWeights = sums.weights.sum
>> if (shouldDistributeGaussians) {
>> val numPartitions = math.min(k, 1024)
>> val tuples =
>> Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
>> val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean,
>> sigma, weight) =>
>> updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
>> }.collect().unzip
>> Array.copy(ws.toArray, 0, weights, 0, ws.length)
>> Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
>> } else {
>> var i = 0
>> while (i < k) {
>> val (weight, gaussian) =
>> updateWeightsAndGaussians(sums.means(i), sums.sigmas(i),
>> sums.weights(i), sumWeights)
>> weights(i) = weight
>> gaussians(i) = gaussian
>> i = i + 1
>> }
>> }
>> llhp = llh // current becomes previous
>> llh = sums.logLikelihood // this is the freshly computed log-likelihood
>> iter += 1
>> compute.destroy(blocking = false) In my application, I need to know log
>> likelihood to compare effect for different number of clusters.
>> And then I use the cluster number with the maximum log likelihood.
>>
>> Is it a good idea to expose this value?
>>
>>
>>
>>
>

Re: Could we expose log likelihood of EM algorithm in MLLIB?

Posted by Yanbo Liang <yb...@gmail.com>.
Let's move the discussion to JIRA. Thanks!

On Fri, Oct 7, 2016 at 8:43 PM, 王磊(安全部) <wa...@didichuxing.com>
wrote:

> https://issues.apache.org/jira/browse/SPARK-17825
>
> Actually I had created a JIRA. Could you let me your progress to avoid
> duplicated work.
>
> Thanks!
>
> 发件人: didi <wa...@didichuxing.com>
> 日期: 2016年10月8日 星期六 上午12:21
> 至: Yanbo Liang <yb...@gmail.com>
>
> 抄送: "dev@spark.apache.org" <de...@spark.apache.org>, "user@spark.apache.org"
> <us...@spark.apache.org>
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> Thanks for replying.
> When could you send out the PR?
>
> 发件人: Yanbo Liang <yb...@gmail.com>
> 日期: 2016年10月7日 星期五 下午11:35
> 至: didi <wa...@didichuxing.com>
> 抄送: "dev@spark.apache.org" <de...@spark.apache.org>, "user@spark.apache.org"
> <us...@spark.apache.org>
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> It's a good question and I had similar requirement in my work. I'm copying
> the implementation from mllib to ml currently, and then exposing the
> maximum log likelihood. I will send this PR soon.
>
> Thanks.
> Yanbo
>
> On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wa...@didichuxing.com>
> wrote:
>
>>
>> Hi,
>>
>> Do you guys sometimes need to get the log likelihood of EM algorithm in
>> MLLIB?
>>
>> I mean the value in this line https://github.com/apache
>> /spark/blob/master/mllib/src/main/scala/org/apache/spark/
>> mllib/clustering/GaussianMixture.scala#L228
>>
>> Now copying the code here:
>>
>>
>> val sums = breezeData.treeAggregate(ExpectationSum.zero(k,
>> d))(compute.value, _ += _)
>> // Create new distributions based on the partial assignments
>> // (often referred to as the "M" step in literature)
>> val sumWeights = sums.weights.sum
>> if (shouldDistributeGaussians) {
>> val numPartitions = math.min(k, 1024)
>> val tuples =
>> Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
>> val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean,
>> sigma, weight) =>
>> updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
>> }.collect().unzip
>> Array.copy(ws.toArray, 0, weights, 0, ws.length)
>> Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
>> } else {
>> var i = 0
>> while (i < k) {
>> val (weight, gaussian) =
>> updateWeightsAndGaussians(sums.means(i), sums.sigmas(i),
>> sums.weights(i), sumWeights)
>> weights(i) = weight
>> gaussians(i) = gaussian
>> i = i + 1
>> }
>> }
>> llhp = llh // current becomes previous
>> llh = sums.logLikelihood // this is the freshly computed log-likelihood
>> iter += 1
>> compute.destroy(blocking = false) In my application, I need to know log
>> likelihood to compare effect for different number of clusters.
>> And then I use the cluster number with the maximum log likelihood.
>>
>> Is it a good idea to expose this value?
>>
>>
>>
>>
>

Re: Could we expose log likelihood of EM algorithm in MLLIB?

Posted by "王磊 (安全部)" <wa...@didichuxing.com>.
https://issues.apache.org/jira/browse/SPARK-17825

Actually I had created a JIRA. Could you let me your progress to avoid duplicated work.

Thanks!

发件人: didi <wa...@didichuxing.com>>
日期: 2016年10月8日 星期六 上午12:21
至: Yanbo Liang <yb...@gmail.com>>
抄送: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

Thanks for replying.
When could you send out the PR?

发件人: Yanbo Liang <yb...@gmail.com>>
日期: 2016年10月7日 星期五 下午11:35
至: didi <wa...@didichuxing.com>>
抄送: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wa...@didichuxing.com>> wrote:

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


        val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)

        // Create new distributions based on the partial assignments
        // (often referred to as the "M" step in literature)
        val sumWeights = sums.weights.sum

        if (shouldDistributeGaussians) {
        val numPartitions = math.min(k, 1024)
        val tuples =
        Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
        val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, sigma, weight) =>
        updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
        }.collect().unzip
        Array.copy(ws.toArray, 0, weights, 0, ws.length)
        Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
        } else {
        var i = 0
        while (i < k) {
        val (weight, gaussian) =
        updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i), sumWeights)
        weights(i) = weight
        gaussians(i) = gaussian
        i = i + 1
        }
        }

        llhp = llh // current becomes previous
        llh = sums.logLikelihood // this is the freshly computed log-likelihood
        iter += 1
        compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for different number of clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?





Re: Could we expose log likelihood of EM algorithm in MLLIB?

Posted by "王磊 (安全部)" <wa...@didichuxing.com>.
https://issues.apache.org/jira/browse/SPARK-17825

Actually I had created a JIRA. Could you let me your progress to avoid duplicated work.

Thanks!

发件人: didi <wa...@didichuxing.com>>
日期: 2016年10月8日 星期六 上午12:21
至: Yanbo Liang <yb...@gmail.com>>
抄送: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

Thanks for replying.
When could you send out the PR?

发件人: Yanbo Liang <yb...@gmail.com>>
日期: 2016年10月7日 星期五 下午11:35
至: didi <wa...@didichuxing.com>>
抄送: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wa...@didichuxing.com>> wrote:

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


        val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)

        // Create new distributions based on the partial assignments
        // (often referred to as the "M" step in literature)
        val sumWeights = sums.weights.sum

        if (shouldDistributeGaussians) {
        val numPartitions = math.min(k, 1024)
        val tuples =
        Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
        val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, sigma, weight) =>
        updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
        }.collect().unzip
        Array.copy(ws.toArray, 0, weights, 0, ws.length)
        Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
        } else {
        var i = 0
        while (i < k) {
        val (weight, gaussian) =
        updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i), sumWeights)
        weights(i) = weight
        gaussians(i) = gaussian
        i = i + 1
        }
        }

        llhp = llh // current becomes previous
        llh = sums.logLikelihood // this is the freshly computed log-likelihood
        iter += 1
        compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for different number of clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?





Re: Could we expose log likelihood of EM algorithm in MLLIB?

Posted by "王磊 (安全部)" <wa...@didichuxing.com>.
Thanks for replying.
When could you send out the PR?

发件人: Yanbo Liang <yb...@gmail.com>>
日期: 2016年10月7日 星期五 下午11:35
至: didi <wa...@didichuxing.com>>
抄送: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wa...@didichuxing.com>> wrote:

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


        val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)

        // Create new distributions based on the partial assignments
        // (often referred to as the "M" step in literature)
        val sumWeights = sums.weights.sum

        if (shouldDistributeGaussians) {
        val numPartitions = math.min(k, 1024)
        val tuples =
        Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
        val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, sigma, weight) =>
        updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
        }.collect().unzip
        Array.copy(ws.toArray, 0, weights, 0, ws.length)
        Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
        } else {
        var i = 0
        while (i < k) {
        val (weight, gaussian) =
        updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i), sumWeights)
        weights(i) = weight
        gaussians(i) = gaussian
        i = i + 1
        }
        }

        llhp = llh // current becomes previous
        llh = sums.logLikelihood // this is the freshly computed log-likelihood
        iter += 1
        compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for different number of clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?





Re: Could we expose log likelihood of EM algorithm in MLLIB?

Posted by "王磊 (安全部)" <wa...@didichuxing.com>.
Thanks for replying.
When could you send out the PR?

发件人: Yanbo Liang <yb...@gmail.com>>
日期: 2016年10月7日 星期五 下午11:35
至: didi <wa...@didichuxing.com>>
抄送: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wa...@didichuxing.com>> wrote:

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


        val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)

        // Create new distributions based on the partial assignments
        // (often referred to as the "M" step in literature)
        val sumWeights = sums.weights.sum

        if (shouldDistributeGaussians) {
        val numPartitions = math.min(k, 1024)
        val tuples =
        Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
        val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, sigma, weight) =>
        updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
        }.collect().unzip
        Array.copy(ws.toArray, 0, weights, 0, ws.length)
        Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
        } else {
        var i = 0
        while (i < k) {
        val (weight, gaussian) =
        updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i), sumWeights)
        weights(i) = weight
        gaussians(i) = gaussian
        i = i + 1
        }
        }

        llhp = llh // current becomes previous
        llh = sums.logLikelihood // this is the freshly computed log-likelihood
        iter += 1
        compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for different number of clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?