You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/31 19:30:24 UTC
Cluster hierarchy with RowSimilarityJob
I need to calculate similar clusters and get cluster to cluster
distances for several reasons.
The most likely tool for this is the RowSimilarityJob. I imagine it
would take a list of vectors (clusterid, list of the centroid's
termid->weights) and calculate the list of vectors (clusterid, list of
clusterid->distance)
The clusters file is of type Key class: class org.apache.hadoop.io.Text
(named vectors) Value Class: class
org.apache.mahout.clustering.kmeans.Cluster and does not work as input
to the RowID job. Looking at the actual values in the file I suspect the
algorithm would work but since the classname is Cluster, RowID dies
asking for org.apache.mahout.math.VectorWritable
What is the easiest way to get RowID and RowSimilarity to work in this
case?
If I need to mod one of these, which do you recommend? Maybe a new job
that takes the Clusters and outputs the "center" as a IntWriteable
(clusterID) VectorWritable (centroid from the Cluster class)?
Re: Cluster hierarchy with RowSimilarityJob
Posted by Pat Ferrel <pa...@occamsmachete.com>.
Ah, but reading about top down I found ClusterOutputPostProcessorDriver.
It looks like this will extract the centroid vectors. Maybe all I need
is top down and I can calculate distances with CosineDistanceMeasure
directly since this should never require a mapreduce implementation. The
sub-clusters are never huge in number.
On 3/31/12 10:53 AM, Pat Ferrel wrote:
> Yes, I understand but I'm trying something different and in any case
> need cluster to cluster distances.
>
> On 3/31/12 10:37 AM, Paritosh Ranjan wrote:
>> You can also try Top Down Clustering if this suits your use case.
>> Find out bigger clusters first, and then, find out smaller clusters
>> in bigger clusters and so on.
>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
>>
>> On 31-03-2012 23:00, Pat Ferrel wrote:
>>> I need to calculate similar clusters and get cluster to cluster
>>> distances for several reasons.
>>>
>>> The most likely tool for this is the RowSimilarityJob. I imagine it
>>> would take a list of vectors (clusterid, list of the centroid's
>>> termid->weights) and calculate the list of vectors (clusterid, list
>>> of clusterid->distance)
>>>
>>> The clusters file is of type Key class: class
>>> org.apache.hadoop.io.Text (named vectors) Value Class: class
>>> org.apache.mahout.clustering.kmeans.Cluster and does not work as
>>> input to the RowID job. Looking at the actual values in the file I
>>> suspect the algorithm would work but since the classname is Cluster,
>>> RowID dies asking for org.apache.mahout.math.VectorWritable
>>>
>>> What is the easiest way to get RowID and RowSimilarity to work in
>>> this case?
>>>
>>> If I need to mod one of these, which do you recommend? Maybe a new
>>> job that takes the Clusters and outputs the "center" as a
>>> IntWriteable (clusterID) VectorWritable (centroid from the Cluster
>>> class)?
>>>
>>>
>>>
>>
>>
Re: Cluster hierarchy with RowSimilarityJob
Posted by Pat Ferrel <pa...@occamsmachete.com>.
Yes, I understand but I'm trying something different and in any case
need cluster to cluster distances.
On 3/31/12 10:37 AM, Paritosh Ranjan wrote:
> You can also try Top Down Clustering if this suits your use case. Find
> out bigger clusters first, and then, find out smaller clusters in
> bigger clusters and so on.
> https://cwiki.apache.org/MAHOUT/top-down-clustering.html
>
> On 31-03-2012 23:00, Pat Ferrel wrote:
>> I need to calculate similar clusters and get cluster to cluster
>> distances for several reasons.
>>
>> The most likely tool for this is the RowSimilarityJob. I imagine it
>> would take a list of vectors (clusterid, list of the centroid's
>> termid->weights) and calculate the list of vectors (clusterid, list
>> of clusterid->distance)
>>
>> The clusters file is of type Key class: class
>> org.apache.hadoop.io.Text (named vectors) Value Class: class
>> org.apache.mahout.clustering.kmeans.Cluster and does not work as
>> input to the RowID job. Looking at the actual values in the file I
>> suspect the algorithm would work but since the classname is Cluster,
>> RowID dies asking for org.apache.mahout.math.VectorWritable
>>
>> What is the easiest way to get RowID and RowSimilarity to work in
>> this case?
>>
>> If I need to mod one of these, which do you recommend? Maybe a new
>> job that takes the Clusters and outputs the "center" as a
>> IntWriteable (clusterID) VectorWritable (centroid from the Cluster
>> class)?
>>
>>
>>
>
>
Re: Cluster hierarchy with RowSimilarityJob
Posted by Paritosh Ranjan <pr...@xebia.com>.
You can also try Top Down Clustering if this suits your use case. Find
out bigger clusters first, and then, find out smaller clusters in bigger
clusters and so on.
https://cwiki.apache.org/MAHOUT/top-down-clustering.html
On 31-03-2012 23:00, Pat Ferrel wrote:
> I need to calculate similar clusters and get cluster to cluster
> distances for several reasons.
>
> The most likely tool for this is the RowSimilarityJob. I imagine it
> would take a list of vectors (clusterid, list of the centroid's
> termid->weights) and calculate the list of vectors (clusterid, list of
> clusterid->distance)
>
> The clusters file is of type Key class: class
> org.apache.hadoop.io.Text (named vectors) Value Class: class
> org.apache.mahout.clustering.kmeans.Cluster and does not work as input
> to the RowID job. Looking at the actual values in the file I suspect
> the algorithm would work but since the classname is Cluster, RowID
> dies asking for org.apache.mahout.math.VectorWritable
>
> What is the easiest way to get RowID and RowSimilarity to work in this
> case?
>
> If I need to mod one of these, which do you recommend? Maybe a new job
> that takes the Clusters and outputs the "center" as a IntWriteable
> (clusterID) VectorWritable (centroid from the Cluster class)?
>
>
>