You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by karthikeyan palanisamy <ka...@gmail.com> on 2009/11/18 07:51:32 UTC

Help needed for identifying the Clustered words

Hi Mahout Team,

     Thank you for Mahout,and making it open source. I want to use the
results of Mahout for a research application that I am working on.I am
trying to look into and compare the results obtained* *k-means,Mean-Shift
and Dirichlet* *algorithms.I see that the clustering drivers take
sparse-vectors as their input and give Keyword-id---Cluster-id pair as their
output. Please help me retrieve the actual words from the
keyword-ids(Integers). Please brief me on how I can obtain the words
corresponding to the Integers.

Thankyou,
Karthikeyan.  * *

Re: Help needed for identifying the Clustered words

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 18, 2009, at 1:51 AM, karthikeyan palanisamy wrote:

> Hi Mahout Team,
> 
>     Thank you for Mahout,and making it open source. I want to use the
> results of Mahout for a research application that I am working on.I am
> trying to look into and compare the results obtained* *k-means,Mean-Shift
> and Dirichlet* *algorithms.I see that the clustering drivers take
> sparse-vectors as their input and give Keyword-id---Cluster-id pair as their
> output. Please help me retrieve the actual words from the
> keyword-ids(Integers). Please brief me on how I can obtain the words
> corresponding to the Integers.

Part of it is going to depend on how you created the vectors.  If you created them from Lucene using the stuff in the utils module, then you should have a dictionary file that does the mapping.  If you created them on your own, you need to maintain the mapping.

FWIW, have a look at the ClusterDumper class in the utils submodule.   There is also the SequenceFileDumper and the VectorDumper which may come in handy.

-Grant