You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by yoshihiro fujimoto <yo...@gmail.com> on 2012/12/25 05:57:44 UTC

About Dirichlet clustering's threshold

Hi all,


https://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html

According to this page, it can specify threshold to Dirichlet Driver.
This page explain that threshold of 0 will emit all clusters with their
associated probabilities for each vector.
So, I've run Dirichlet Clustering using threshold 0.
But, clusteredPoints/part-m-00000 sequence file is empty( length is 120
byte).

In Dirichlet Process, is there a case of empty result using threshold 0?

Thanks,

Yoshihiro

Re: About Dirichlet clustering's threshold

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
It could be a contradiction indeed. I wonder if you can help us to 
characterize it further, perhaps by reading the code or by running your 
data in sequential debug mode? Without a little more information it is 
difficult to get to the root of your problem.


On 12/25/12 8:21 PM, yoshihiro fujimoto wrote:
> Hi Jeff.
>
>> Did you turn off most-likely classification?
> Yes, I specified most-likely option to false.
> In general, pdf's range is between 0 and 1.
> So, if pdf threshold is specified 0, all points classified to all of the
> clusters.
> Actually, sequence file is empty.
>
> I feel contradiction.
> I may be wrong but this is bug?
>
> Thanks,
> Yoshihiro.
>
>
>
> 2012/12/26 Jeff Eastman <jd...@windwardsolutions.com>
>
>> Here's a response to a similar question from a couple of months ago:
>>
>> The classification phase of Dirichlet uses a most-likely assignment of
>> points to clusters by default. This means that, unlike the training phase
>> where points are assigned statistically to likely clusters, the
>> classification may result in empty clusters even though those clusters have
>> nonzero counts in the final iteration. You can disable most-likely
>> assignment and set a pdf threshold - check the documentation - and points
>> will be classified to all of the clusters that have pdf greater than the
>> threshold.
>>
>> Does this help? Did you turn off most-likely classification?
>> Jeff
>>
>>
>> On 12/24/12 11:57 PM, yoshihiro fujimoto wrote:
>>
>>> Hi all,
>>>
>>>
>>> https://cwiki.apache.org/**MAHOUT/dirichlet-process-**clustering.html<https://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html>
>>>
>>> According to this page, it can specify threshold to Dirichlet Driver.
>>> This page explain that threshold of 0 will emit all clusters with their
>>> associated probabilities for each vector.
>>> So, I've run Dirichlet Clustering using threshold 0.
>>> But, clusteredPoints/part-m-00000 sequence file is empty( length is 120
>>> byte).
>>>
>>> In Dirichlet Process, is there a case of empty result using threshold 0?
>>>
>>> Thanks,
>>>
>>> Yoshihiro
>>>
>>>


Re: About Dirichlet clustering's threshold

Posted by yoshihiro fujimoto <yo...@gmail.com>.
Hi Jeff.

> Did you turn off most-likely classification?

Yes, I specified most-likely option to false.
In general, pdf's range is between 0 and 1.
So, if pdf threshold is specified 0, all points classified to all of the
clusters.
Actually, sequence file is empty.

I feel contradiction.
I may be wrong but this is bug?

Thanks,
Yoshihiro.



2012/12/26 Jeff Eastman <jd...@windwardsolutions.com>

> Here's a response to a similar question from a couple of months ago:
>
> The classification phase of Dirichlet uses a most-likely assignment of
> points to clusters by default. This means that, unlike the training phase
> where points are assigned statistically to likely clusters, the
> classification may result in empty clusters even though those clusters have
> nonzero counts in the final iteration. You can disable most-likely
> assignment and set a pdf threshold - check the documentation - and points
> will be classified to all of the clusters that have pdf greater than the
> threshold.
>
> Does this help? Did you turn off most-likely classification?
> Jeff
>
>
> On 12/24/12 11:57 PM, yoshihiro fujimoto wrote:
>
>> Hi all,
>>
>>
>> https://cwiki.apache.org/**MAHOUT/dirichlet-process-**clustering.html<https://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html>
>>
>> According to this page, it can specify threshold to Dirichlet Driver.
>> This page explain that threshold of 0 will emit all clusters with their
>> associated probabilities for each vector.
>> So, I've run Dirichlet Clustering using threshold 0.
>> But, clusteredPoints/part-m-00000 sequence file is empty( length is 120
>> byte).
>>
>> In Dirichlet Process, is there a case of empty result using threshold 0?
>>
>> Thanks,
>>
>> Yoshihiro
>>
>>
>

Re: About Dirichlet clustering's threshold

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Here's a response to a similar question from a couple of months ago:

The classification phase of Dirichlet uses a most-likely assignment of 
points to clusters by default. This means that, unlike the training 
phase where points are assigned statistically to likely clusters, the 
classification may result in empty clusters even though those clusters 
have nonzero counts in the final iteration. You can disable most-likely 
assignment and set a pdf threshold - check the documentation - and 
points will be classified to all of the clusters that have pdf greater 
than the threshold.

Does this help? Did you turn off most-likely classification?
Jeff


On 12/24/12 11:57 PM, yoshihiro fujimoto wrote:
> Hi all,
>
>
> https://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html
>
> According to this page, it can specify threshold to Dirichlet Driver.
> This page explain that threshold of 0 will emit all clusters with their
> associated probabilities for each vector.
> So, I've run Dirichlet Clustering using threshold 0.
> But, clusteredPoints/part-m-00000 sequence file is empty( length is 120
> byte).
>
> In Dirichlet Process, is there a case of empty result using threshold 0?
>
> Thanks,
>
> Yoshihiro
>