You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Lukáš Kryške <lu...@hotmail.cz> on 2012/04/28 14:45:03 UTC
KMeans clustering on Hadoop infrastructure
Hello,
I am successfully running K-Means clustering sample from the 'Mahout In Action' book (example in Chapter 7.3) in my Hadoop environment.Now I need to extend the program to take the vectors from a file located in my HDFS. I need to process clustering of millions or billions of vectors which are represented by comma-separated values in a .txt file in HDFS. Data are stored in this pattern:
x1,y1x2,y2....xn,yn
As I understood from the book, I need to transform my .txt file with vectors into Hadoop's SequenceFile first - how to do it most efficiently? And how to tell to the KMeansDriver that the input path contains SequenceFile with vectors?
Thanks for help.
_________________Best Regards,Lukas Kryske
Re: KMeans clustering on Hadoop infrastructure
Posted by Robert Evans <ev...@yahoo-inc.com>.
You are likely going to get more help from talking to the Mahout mailing list.
https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives
--Bobby Evans
On 4/28/12 7:45 AM, "Lukáš Kryške" <lu...@hotmail.cz> wrote:
Hello,
I am successfully running K-Means clustering sample from the 'Mahout In Action' book (example in Chapter 7.3) in my Hadoop environment.Now I need to extend the program to take the vectors from a file located in my HDFS. I need to process clustering of millions or billions of vectors which are represented by comma-separated values in a .txt file in HDFS. Data are stored in this pattern:
x1,y1x2,y2....xn,yn
As I understood from the book, I need to transform my .txt file with vectors into Hadoop's SequenceFile first - how to do it most efficiently? And how to tell to the KMeansDriver that the input path contains SequenceFile with vectors?
Thanks for help.
_________________Best Regards,Lukas Kryske