You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Hossein Kazemi <ho...@gridline.nl> on 2012/04/11 12:15:03 UTC

Kmeans cluster mapping to actual document IDs

Hi,
I have clustered a set of documents using the Mahout's Kmeans 
(map-reduce) I used Sparse Vectors due to the large size of my corpus. 
In the book it says that the folder named ClusteredPoints contains the 
mapping between the clustered documents and the document IDs. However, 
all I can see is just a "1:0" , a feature-vector and a ClusterID. where 
can I find the actual document names/ids ?
thx

Re: Kmeans cluster mapping to actual document IDs

Posted by Baoqiang Cao <bq...@gmail.com>.

My very limited experience is that

in seq2sparse step, you need use "-nv" option so that in clusterdump
output, you will see document ID.

Best,
Baoqiang

On Wed, Apr 11, 2012 at 5:15 AM, Hossein Kazemi <ho...@gridline.nl> wrote:
> Hi,
> I have clustered a set of documents using the Mahout's Kmeans (map-reduce) I
> used Sparse Vectors due to the large size of my corpus. In the book it says
> that the folder named ClusteredPoints contains the mapping between the
> clustered documents and the document IDs. However, all I can see is just a
> "1:0" , a feature-vector and a ClusterID. where can I find the actual
> document names/ids ?
> thx
>