You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Kasi Subrahmanyam <ka...@gmail.com> on 2014/02/11 07:02:10 UTC

Generating individual file for each record in clustering

Hi,
I have gone through the k means clustering and canopy clustering. Here I
can see that before running clustering we need to convert the text files to
sequence files using a function called seqdirectory in mahout. For this
function the input is a directory with one file per record and filename is
record id.

But  I have more than 10 million records initially in not more than 5 to 10
files in text format in HDFS.
So now creating 10 million files as input to this seqdirectory function
doesn't seem right. I have I'd and record tab separated and 1 record per
line in my text file. So is there any other way.

Thanks,
Subbu