You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/05/22 20:02:29 UTC

CDbw and Evaluator results

I'm using mahout 0.6 and so may not be seeing the same results as you.

I take it that the inter-cluster distance of 0 is a bug and pruning 
should not happen very often?

I haven't used this before so I'm not sure if my CDbw or Evaluator 
results are wrong in other ways.

Should I create a bug for this in Jira?

On 5/17/12 2:33 PM, Jeff Eastman wrote:
> Hi Pat,
>
> I don't have a good answer here. Evidently, something in CDbw has 
> become broken and you are the first to notice. When I run 
> TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly 
> incorrect. The values for Canopy, MeanShift and Dirichlet are not so 
> obviously incorrect but I remain suspicious. Something must have 
> become broken in the recent clustering refactoring.
>
> From the method CDbwEvaluator.invalidCluster comment (used to enable 
> pruning):
>    * Return if the cluster is valid. Valid clusters must have more 
> than 2 representative points,
>    * and at least one of them must be different than the cluster 
> center. This is because the
>    * representative points extraction will duplicate the cluster 
> center if it is empty.
>
> Oddly enough, inspection of the test log indicates that only k-means 
> and fuzzy-k are not pruning clusters. Clearly some more investigation 
> is needed. I will take a look at it tomorrow. In the mean time if you 
> develop any additional insight please do share it with us.
>
> Thanks,
> Jeff
>
> On 5/17/12 3:53 PM, Pat Ferrel wrote:
>> I built a tool that iterates through a list of values for k on the 
>> same data and spits out the CDbw and ClusterEvaluator results each time.
>>
>> When the evaluator or CDbw prunes a cluster, how do I interpret that? 
>> They seem to throw out the same clusters on a given run. Also CDbw 
>> always returns an inter-cluster density of 0?
>>
>> On 5/17/12 5:58 AM, Jeff Eastman wrote:
>>> Yes, that is the paper I used to implement CDbw. I've tried it a few 
>>> times along with the simpler ClusterEvaluator metrics I took from 
>>> Mahout In Action and they look to be reasonable - see the tests - 
>>> though I have no way to judge their absolute values. Anything you 
>>> can contribute in this area would be most welcome. Perhaps a wiki page?
>>>
>>>
>>> On 5/16/12 1:14 PM, Pat Ferrel wrote:
>>>> The reference was in the code for 
>>>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>>>>
>>>> On 5/16/12 9:56 AM, Pat Ferrel wrote:
>>>>> Thanks, I've been looking at that. Is there a description of how 
>>>>> to interpret those values? An academic paper maybe? The 
>>>>> intra-cluster distance intuitively seems to correspond to 
>>>>> something like cohesion. I don't get the intuition behind 
>>>>> inter-cluster distances but Ted thinks they are the most important.
>>>>>
>>>>> On 5/16/12 7:32 AM, Jeff Eastman wrote:
>>>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute 
>>>>>> some quality metrics (inter-cluster distance, 
>>>>>> intra-cluster-distance, ...) that you may find useful. Both 
>>>>>> calculate a set of representative points from the clustering 
>>>>>> output and compute the (n^2) metrics over these points rather 
>>>>>> than all of the points in each cluster.
>>>>>>
>>>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote:
>>>>>>> So many questions about best k, how to choose t1 and t2, how 
>>>>>>> much help is dimensional reduction would have clear answers if 
>>>>>>> we had a way to judge the quality of clusters.
>>>>>>>
>>>>>>> Various methods were discussed here for a time: 
>>>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
>>>>>>>
>>>>>>> Has there been any work on building a measure of quality?
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: CDbw and Evaluator results

Posted by Pat Ferrel <pa...@farfetchers.com>.

I tried this a few days ago but I use bixo to generate seqfiles and the 
trunk puts me in dependency hell (some 4j version incompatibilities). I 
couldn't use kmeans on the old seqfiles either, it looks like the 
formats have changed?

Happy to write up a JIRA report though.

On 5/23/12 6:27 AM, Jeff Eastman wrote:
> Can you try this again using trunk? If there is no improvement I think 
> a JIRA to investigate would be useful.
>
> On 5/22/12 2:02 PM, Pat Ferrel wrote:
>> I'm using mahout 0.6 and so may not be seeing the same results as you.
>>
>> I take it that the inter-cluster distance of 0 is a bug and pruning 
>> should not happen very often?
>>
>> I haven't used this before so I'm not sure if my CDbw or Evaluator 
>> results are wrong in other ways.
>>
>> Should I create a bug for this in Jira?
>>
>> On 5/17/12 2:33 PM, Jeff Eastman wrote:
>>> Hi Pat,
>>>
>>> I don't have a good answer here. Evidently, something in CDbw has 
>>> become broken and you are the first to notice. When I run 
>>> TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly 
>>> incorrect. The values for Canopy, MeanShift and Dirichlet are not so 
>>> obviously incorrect but I remain suspicious. Something must have 
>>> become broken in the recent clustering refactoring.
>>>
>>> From the method CDbwEvaluator.invalidCluster comment (used to enable 
>>> pruning):
>>>    * Return if the cluster is valid. Valid clusters must have more 
>>> than 2 representative points,
>>>    * and at least one of them must be different than the cluster 
>>> center. This is because the
>>>    * representative points extraction will duplicate the cluster 
>>> center if it is empty.
>>>
>>> Oddly enough, inspection of the test log indicates that only k-means 
>>> and fuzzy-k are not pruning clusters. Clearly some more 
>>> investigation is needed. I will take a look at it tomorrow. In the 
>>> mean time if you develop any additional insight please do share it 
>>> with us.
>>>
>>> Thanks,
>>> Jeff
>>>
>>> On 5/17/12 3:53 PM, Pat Ferrel wrote:
>>>> I built a tool that iterates through a list of values for k on the 
>>>> same data and spits out the CDbw and ClusterEvaluator results each 
>>>> time.
>>>>
>>>> When the evaluator or CDbw prunes a cluster, how do I interpret 
>>>> that? They seem to throw out the same clusters on a given run. Also 
>>>> CDbw always returns an inter-cluster density of 0?
>>>>
>>>> On 5/17/12 5:58 AM, Jeff Eastman wrote:
>>>>> Yes, that is the paper I used to implement CDbw. I've tried it a 
>>>>> few times along with the simpler ClusterEvaluator metrics I took 
>>>>> from Mahout In Action and they look to be reasonable - see the 
>>>>> tests - though I have no way to judge their absolute values. 
>>>>> Anything you can contribute in this area would be most welcome. 
>>>>> Perhaps a wiki page?
>>>>>
>>>>>
>>>>> On 5/16/12 1:14 PM, Pat Ferrel wrote:
>>>>>> The reference was in the code for 
>>>>>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf 
>>>>>>
>>>>>>
>>>>>> On 5/16/12 9:56 AM, Pat Ferrel wrote:
>>>>>>> Thanks, I've been looking at that. Is there a description of how 
>>>>>>> to interpret those values? An academic paper maybe? The 
>>>>>>> intra-cluster distance intuitively seems to correspond to 
>>>>>>> something like cohesion. I don't get the intuition behind 
>>>>>>> inter-cluster distances but Ted thinks they are the most important.
>>>>>>>
>>>>>>> On 5/16/12 7:32 AM, Jeff Eastman wrote:
>>>>>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute 
>>>>>>>> some quality metrics (inter-cluster distance, 
>>>>>>>> intra-cluster-distance, ...) that you may find useful. Both 
>>>>>>>> calculate a set of representative points from the clustering 
>>>>>>>> output and compute the (n^2) metrics over these points rather 
>>>>>>>> than all of the points in each cluster.
>>>>>>>>
>>>>>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote:
>>>>>>>>> So many questions about best k, how to choose t1 and t2, how 
>>>>>>>>> much help is dimensional reduction would have clear answers if 
>>>>>>>>> we had a way to judge the quality of clusters.
>>>>>>>>>
>>>>>>>>> Various methods were discussed here for a time: 
>>>>>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
>>>>>>>>>
>>>>>>>>> Has there been any work on building a measure of quality?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: CDbw and Evaluator results

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Can you try this again using trunk? If there is no improvement I think a 
JIRA to investigate would be useful.

On 5/22/12 2:02 PM, Pat Ferrel wrote:
> I'm using mahout 0.6 and so may not be seeing the same results as you.
>
> I take it that the inter-cluster distance of 0 is a bug and pruning 
> should not happen very often?
>
> I haven't used this before so I'm not sure if my CDbw or Evaluator 
> results are wrong in other ways.
>
> Should I create a bug for this in Jira?
>
> On 5/17/12 2:33 PM, Jeff Eastman wrote:
>> Hi Pat,
>>
>> I don't have a good answer here. Evidently, something in CDbw has 
>> become broken and you are the first to notice. When I run 
>> TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly 
>> incorrect. The values for Canopy, MeanShift and Dirichlet are not so 
>> obviously incorrect but I remain suspicious. Something must have 
>> become broken in the recent clustering refactoring.
>>
>> From the method CDbwEvaluator.invalidCluster comment (used to enable 
>> pruning):
>>    * Return if the cluster is valid. Valid clusters must have more 
>> than 2 representative points,
>>    * and at least one of them must be different than the cluster 
>> center. This is because the
>>    * representative points extraction will duplicate the cluster 
>> center if it is empty.
>>
>> Oddly enough, inspection of the test log indicates that only k-means 
>> and fuzzy-k are not pruning clusters. Clearly some more investigation 
>> is needed. I will take a look at it tomorrow. In the mean time if you 
>> develop any additional insight please do share it with us.
>>
>> Thanks,
>> Jeff
>>
>> On 5/17/12 3:53 PM, Pat Ferrel wrote:
>>> I built a tool that iterates through a list of values for k on the 
>>> same data and spits out the CDbw and ClusterEvaluator results each 
>>> time.
>>>
>>> When the evaluator or CDbw prunes a cluster, how do I interpret 
>>> that? They seem to throw out the same clusters on a given run. Also 
>>> CDbw always returns an inter-cluster density of 0?
>>>
>>> On 5/17/12 5:58 AM, Jeff Eastman wrote:
>>>> Yes, that is the paper I used to implement CDbw. I've tried it a 
>>>> few times along with the simpler ClusterEvaluator metrics I took 
>>>> from Mahout In Action and they look to be reasonable - see the 
>>>> tests - though I have no way to judge their absolute values. 
>>>> Anything you can contribute in this area would be most welcome. 
>>>> Perhaps a wiki page?
>>>>
>>>>
>>>> On 5/16/12 1:14 PM, Pat Ferrel wrote:
>>>>> The reference was in the code for 
>>>>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>>>>>
>>>>> On 5/16/12 9:56 AM, Pat Ferrel wrote:
>>>>>> Thanks, I've been looking at that. Is there a description of how 
>>>>>> to interpret those values? An academic paper maybe? The 
>>>>>> intra-cluster distance intuitively seems to correspond to 
>>>>>> something like cohesion. I don't get the intuition behind 
>>>>>> inter-cluster distances but Ted thinks they are the most important.
>>>>>>
>>>>>> On 5/16/12 7:32 AM, Jeff Eastman wrote:
>>>>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute 
>>>>>>> some quality metrics (inter-cluster distance, 
>>>>>>> intra-cluster-distance, ...) that you may find useful. Both 
>>>>>>> calculate a set of representative points from the clustering 
>>>>>>> output and compute the (n^2) metrics over these points rather 
>>>>>>> than all of the points in each cluster.
>>>>>>>
>>>>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote:
>>>>>>>> So many questions about best k, how to choose t1 and t2, how 
>>>>>>>> much help is dimensional reduction would have clear answers if 
>>>>>>>> we had a way to judge the quality of clusters.
>>>>>>>>
>>>>>>>> Various methods were discussed here for a time: 
>>>>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
>>>>>>>>
>>>>>>>> Has there been any work on building a measure of quality?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>