You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/31 19:30:24 UTC

Cluster hierarchy with RowSimilarityJob

I need to calculate similar clusters and get cluster to cluster 
distances for several reasons.

The most likely tool for this is the RowSimilarityJob. I imagine it 
would take a list of vectors (clusterid, list of the centroid's 
termid->weights) and calculate the list of vectors (clusterid, list of 
clusterid->distance)

The clusters file is of type Key class: class org.apache.hadoop.io.Text 
(named vectors) Value Class: class 
org.apache.mahout.clustering.kmeans.Cluster and does not work as input 
to the RowID job. Looking at the actual values in the file I suspect the 
algorithm would work but since the classname is Cluster, RowID dies 
asking for org.apache.mahout.math.VectorWritable

What is the easiest way to get RowID and RowSimilarity to work in this 
case?

If I need to mod one of these, which do you recommend? Maybe a new job 
that takes the Clusters and outputs the "center" as a IntWriteable 
(clusterID) VectorWritable (centroid from the Cluster class)?

Re: Cluster hierarchy with RowSimilarityJob

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ah, but reading about top down I found ClusterOutputPostProcessorDriver. 
It looks like this will extract the centroid vectors. Maybe all I need 
is top down and I can calculate distances with CosineDistanceMeasure 
directly since this should never require a mapreduce implementation. The 
sub-clusters are never huge in number.

On 3/31/12 10:53 AM, Pat Ferrel wrote:
> Yes, I understand but I'm trying something different and in any case 
> need cluster to cluster distances.
>
> On 3/31/12 10:37 AM, Paritosh Ranjan wrote:
>> You can also try Top Down Clustering if this suits your use case. 
>> Find out bigger clusters first, and then, find out smaller clusters 
>> in bigger clusters and so on.
>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
>>
>> On 31-03-2012 23:00, Pat Ferrel wrote:
>>> I need to calculate similar clusters and get cluster to cluster 
>>> distances for several reasons.
>>>
>>> The most likely tool for this is the RowSimilarityJob. I imagine it 
>>> would take a list of vectors (clusterid, list of the centroid's 
>>> termid->weights) and calculate the list of vectors (clusterid, list 
>>> of clusterid->distance)
>>>
>>> The clusters file is of type Key class: class 
>>> org.apache.hadoop.io.Text (named vectors) Value Class: class 
>>> org.apache.mahout.clustering.kmeans.Cluster and does not work as 
>>> input to the RowID job. Looking at the actual values in the file I 
>>> suspect the algorithm would work but since the classname is Cluster, 
>>> RowID dies asking for org.apache.mahout.math.VectorWritable
>>>
>>> What is the easiest way to get RowID and RowSimilarity to work in 
>>> this case?
>>>
>>> If I need to mod one of these, which do you recommend? Maybe a new 
>>> job that takes the Clusters and outputs the "center" as a 
>>> IntWriteable (clusterID) VectorWritable (centroid from the Cluster 
>>> class)?
>>>
>>>
>>>
>>
>>

Re: Cluster hierarchy with RowSimilarityJob

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Yes, I understand but I'm trying something different and in any case 
need cluster to cluster distances.

On 3/31/12 10:37 AM, Paritosh Ranjan wrote:
> You can also try Top Down Clustering if this suits your use case. Find 
> out bigger clusters first, and then, find out smaller clusters in 
> bigger clusters and so on.
> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
>
> On 31-03-2012 23:00, Pat Ferrel wrote:
>> I need to calculate similar clusters and get cluster to cluster 
>> distances for several reasons.
>>
>> The most likely tool for this is the RowSimilarityJob. I imagine it 
>> would take a list of vectors (clusterid, list of the centroid's 
>> termid->weights) and calculate the list of vectors (clusterid, list 
>> of clusterid->distance)
>>
>> The clusters file is of type Key class: class 
>> org.apache.hadoop.io.Text (named vectors) Value Class: class 
>> org.apache.mahout.clustering.kmeans.Cluster and does not work as 
>> input to the RowID job. Looking at the actual values in the file I 
>> suspect the algorithm would work but since the classname is Cluster, 
>> RowID dies asking for org.apache.mahout.math.VectorWritable
>>
>> What is the easiest way to get RowID and RowSimilarity to work in 
>> this case?
>>
>> If I need to mod one of these, which do you recommend? Maybe a new 
>> job that takes the Clusters and outputs the "center" as a 
>> IntWriteable (clusterID) VectorWritable (centroid from the Cluster 
>> class)?
>>
>>
>>
>
>

Re: Cluster hierarchy with RowSimilarityJob

Posted by Paritosh Ranjan <pr...@xebia.com>.

You can also try Top Down Clustering if this suits your use case. Find 
out bigger clusters first, and then, find out smaller clusters in bigger 
clusters and so on.
https://cwiki.apache.org/MAHOUT/top-down-clustering.html

On 31-03-2012 23:00, Pat Ferrel wrote:
> I need to calculate similar clusters and get cluster to cluster 
> distances for several reasons.
>
> The most likely tool for this is the RowSimilarityJob. I imagine it 
> would take a list of vectors (clusterid, list of the centroid's 
> termid->weights) and calculate the list of vectors (clusterid, list of 
> clusterid->distance)
>
> The clusters file is of type Key class: class 
> org.apache.hadoop.io.Text (named vectors) Value Class: class 
> org.apache.mahout.clustering.kmeans.Cluster and does not work as input 
> to the RowID job. Looking at the actual values in the file I suspect 
> the algorithm would work but since the classname is Cluster, RowID 
> dies asking for org.apache.mahout.math.VectorWritable
>
> What is the easiest way to get RowID and RowSimilarity to work in this 
> case?
>
> If I need to mod one of these, which do you recommend? Maybe a new job 
> that takes the Clusters and outputs the "center" as a IntWriteable 
> (clusterID) VectorWritable (centroid from the Cluster class)?
>
>
>