You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Muruga Prabu M <mu...@gmail.com> on 2011/01/28 07:39:43 UTC

Output the unique document id with the cluster dump utility

Hi,

I created a sequence file from a directory of text documents using the
'seqdirectory' in mahout. From the sequence file, a mahout vector file was
created using the 'seq2sparse' command in mahout. Then I used k-means
clustering to cluster the data. The command used is as follows. I am running
the programs in a hadoop cluster with the HADOOP_HOME and HADOOP_CONF
environment variable set.

./mahout kmeans -i /home/exthadoop1/mahout-vector/tfidf-vectors -o
/home/exthadoop1/output -c clusters -dm org.apache.mahout.common.
distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 5 -cl

To read/analyze the output, I use the cluster dump utility. The clusterdump
utility is invoked with the following options.

 ./mahout clusterdump --seqFileDir /home/exthadoop1/output/clusters-1
--pointsDir /home/exthadoop1/output/clusteredPoints --output cluster.txt

In my cluster.txt file, I get the clustername, the number of points in it,
the co-ordinates of the centroid, radius of the cluster and the weights and
set of documents in the cluster. The problem is, the document is represented
as points in an n-dimensional space. Is there any way to make clusterdump to
output the unique document id also along with the co-ordinates of the
document. It would be easier for me to see what are the documents in each
cluster. I am also attaching my cluster.txt output.

Regards,
Murugaprabu Marimuthu