You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "McConnell, Christopher (GE Global Research)" <mc...@ge.com> on 2011/02/02 20:34:45 UTC

KMeans Clustering Issues

All,

I've begun to look into Mahout on top of Hadoop, specifically for large scale 
cluster analysis.

I am running into an issue however, attempting to run the 
KMeansDriver.run(Configuration, Path, Path, Path, DistanceMeasure, double, 
int, Boolean, Boolean) with the last (runSequential) false when the data is 
stored on HDFS.

I've seen multiple listings about this claiming a fix within the KMeansDriver 
by adding the job.setJarByClass() method call, however I am still getting the 
typical ClassNotFoundException: org.apache.mahout.math.Vector.

A quick overview, we've created a Map job to take our current dataset and 
convert it into the Sequence files required for the driver to be executed. We 
have then tried a few different ways of calling the KMeansDriver.run() - 
either within the same driver as the previous MR job or separately for a new 
JVM. Both of these tests were run through the Hadoop environment. Next, I've 
tried running a standalone Java application, setting up the configuration to 
read from HDFS, but not run within the Hadoop environment - this gives us the 
same ClassNotFoundException.

Our versions are Mahout 0.4, Hadoop 0.20.1+169.89 and Hadoop 0.20.2 (We have 
multiple clusters for testing).

I have done other tests with the KMeansDriver that did work, for example, 
utilizing the method within memory works fine. We can also run the clustering 
over MapReduce, if the job is launched through a java -jar command and data 
stored locally. Finally, I can execute the mahout binary with the kmeans 
argument (./mahout kmeans -c path -i path -x #) which also works fine, however 
we do not want to rely on creating multiple stages/running multiple (and 
separate) applications.

Any thoughts are appreciated.
Thanks,
Chris


Christopher McConnell
Computer Scientist
Advanced Computing Lab
Edison Engineering Development Program
GE Global Research

T  +1 518 387 5176
mcconnel@ge.com

One Research Circle
Niskayuna, NY 12309

GE Imagination at Work

Re: KMeans Clustering Issues

Posted by Timothy Potter <th...@gmail.com>.

Hi Chris,

If I'm reading your message correctly, it sounds like you are trying to pass
sequence files as input to the clustering job. The clustering jobs require
vectors as input, not just sequence files. So make sure you are pointing to
the output of seq2sparse, which would be something like: path/tfidf-vectors
or path/tf-vectors

Cheers,
Tim

On Wed, Feb 2, 2011 at 1:21 PM, Jeff Eastman <je...@narus.com> wrote:

> Sounds like you might not be using the mahout-core-0.4-job.jar file? Also,
> we don't run on Hadoop 0.20.1, only 20.2. Finally, trunk always has the
> latest and greatest patches in it and the clustering stuff is quite stable
> there.
>
> Jeff
>
> -----Original Message-----
> From: McConnell, Christopher (GE Global Research) [mailto:mcconnel@ge.com]
> Sent: Wednesday, February 02, 2011 11:35 AM
> To: user@mahout.apache.org
> Subject: KMeans Clustering Issues
>
> All,
>
> I've begun to look into Mahout on top of Hadoop, specifically for large
> scale
> cluster analysis.
>
> I am running into an issue however, attempting to run the
> KMeansDriver.run(Configuration, Path, Path, Path, DistanceMeasure, double,
> int, Boolean, Boolean) with the last (runSequential) false when the data is
> stored on HDFS.
>
> I've seen multiple listings about this claiming a fix within the
> KMeansDriver
> by adding the job.setJarByClass() method call, however I am still getting
> the
> typical ClassNotFoundException: org.apache.mahout.math.Vector.
>
> A quick overview, we've created a Map job to take our current dataset and
> convert it into the Sequence files required for the driver to be executed.
> We
> have then tried a few different ways of calling the KMeansDriver.run() -
> either within the same driver as the previous MR job or separately for a
> new
> JVM. Both of these tests were run through the Hadoop environment. Next,
> I've
> tried running a standalone Java application, setting up the configuration
> to
> read from HDFS, but not run within the Hadoop environment - this gives us
> the
> same ClassNotFoundException.
>
> Our versions are Mahout 0.4, Hadoop 0.20.1+169.89 and Hadoop 0.20.2 (We
> have
> multiple clusters for testing).
>
> I have done other tests with the KMeansDriver that did work, for example,
> utilizing the method within memory works fine. We can also run the
> clustering
> over MapReduce, if the job is launched through a java -jar command and data
> stored locally. Finally, I can execute the mahout binary with the kmeans
> argument (./mahout kmeans -c path -i path -x #) which also works fine,
> however
> we do not want to rely on creating multiple stages/running multiple (and
> separate) applications.
>
> Any thoughts are appreciated.
> Thanks,
> Chris
>
>
> Christopher McConnell
> Computer Scientist
> Advanced Computing Lab
> Edison Engineering Development Program
> GE Global Research
>
> T  +1 518 387 5176
> mcconnel@ge.com
>
> One Research Circle
> Niskayuna, NY 12309
>
> GE Imagination at Work
>
>
>

RE: KMeans Clustering Issues

Posted by Jeff Eastman <je...@Narus.com>.

Sounds like you might not be using the mahout-core-0.4-job.jar file? Also, we don't run on Hadoop 0.20.1, only 20.2. Finally, trunk always has the latest and greatest patches in it and the clustering stuff is quite stable there.

Jeff

-----Original Message-----
From: McConnell, Christopher (GE Global Research) [mailto:mcconnel@ge.com] 
Sent: Wednesday, February 02, 2011 11:35 AM
To: user@mahout.apache.org
Subject: KMeans Clustering Issues

All,

I've begun to look into Mahout on top of Hadoop, specifically for large scale 
cluster analysis.

I am running into an issue however, attempting to run the 
KMeansDriver.run(Configuration, Path, Path, Path, DistanceMeasure, double, 
int, Boolean, Boolean) with the last (runSequential) false when the data is 
stored on HDFS.

I've seen multiple listings about this claiming a fix within the KMeansDriver 
by adding the job.setJarByClass() method call, however I am still getting the 
typical ClassNotFoundException: org.apache.mahout.math.Vector.

A quick overview, we've created a Map job to take our current dataset and 
convert it into the Sequence files required for the driver to be executed. We 
have then tried a few different ways of calling the KMeansDriver.run() - 
either within the same driver as the previous MR job or separately for a new 
JVM. Both of these tests were run through the Hadoop environment. Next, I've 
tried running a standalone Java application, setting up the configuration to 
read from HDFS, but not run within the Hadoop environment - this gives us the 
same ClassNotFoundException.

Our versions are Mahout 0.4, Hadoop 0.20.1+169.89 and Hadoop 0.20.2 (We have 
multiple clusters for testing).

I have done other tests with the KMeansDriver that did work, for example, 
utilizing the method within memory works fine. We can also run the clustering 
over MapReduce, if the job is launched through a java -jar command and data 
stored locally. Finally, I can execute the mahout binary with the kmeans 
argument (./mahout kmeans -c path -i path -x #) which also works fine, however 
we do not want to rely on creating multiple stages/running multiple (and 
separate) applications.

Any thoughts are appreciated.
Thanks,
Chris

Christopher McConnell
Computer Scientist
Advanced Computing Lab
Edison Engineering Development Program
GE Global Research

T  +1 518 387 5176
mcconnel@ge.com

One Research Circle
Niskayuna, NY 12309

GE Imagination at Work