You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lokendra Singh <ls...@gmail.com> on 2011/01/11 15:48:19 UTC

Running KMeans Directly from Java Program on Hadoop-0.20.2 and 'Vector' ClassNotFound error

Hi Everyone,

I have been trying to run KMeans Clustering from mahout versions 0.3 & 0.4
,directly from my Java Program by calling the methods KMeansDriver.runJob()
(mahout-0.3) and KMeansDriver.run() (mahout-0.4) on my Hadoop Cluster
 (0.20.2).

Unfortunately, it fails for both the versions of mahout (see error below)
which seems to arise from the Vector class not found, although the
mahout-math.jar (which contains org.apache.mahout.math.Vector) is present in
the classpath while running the programs. (I have also included
$HADOOP_HOME/conf in my classpath)

The programs are able to write the vectors,clusters in the HDFS but fail
during map-reducing step:

Error(mahout-0.3): http://pastebin.mandriva.com/21630
Error (mahout-0.4) : http://pastebin.mandriva.com/21631

PS: However, I am able to run the kmeans from Hadoop commands

Re: Running KMeans Directly from Java Program on Hadoop-0.20.2 and 'Vector' ClassNotFound error

Posted by Lokendra Singh <ls...@gmail.com>.
Hi all,

i) I included the extra needed jars/libs in the env variables
HADOOP_CLASSPATH and CLASSPATH (all jars including hadoop installation jars)
in $HADOOP_HOME/conf/hadoop-env.sh
and restarted the hadoop cluster

ii) While running the java program, I pointed the class-path to
$HADOOP_HOME/conf i.e
java -cp $HADOOP_HOME/conf prog

iii) And it worked for both the mahout versions (0.3,0.4) !!


Regards
Lokendra


On Wed, Jan 12, 2011 at 10:59 AM, Ted Dunning <te...@gmail.com> wrote:

> It isn't a matter of classpath.
>
> It is a matter of your program having to run on many parallel machines that
> share nothing.
>
> That means that the classpath on your original machine may contain the
> correct jars, but those jars aren't even on the other machines involved in
> the computation.  Remember hadoop is for DISTRIBUTED computation.
>
> If it is just a matter of mahout's dependencies, then you can use the the
> job jar instead of the mahout-core jar and be good to go.
>
> If it is your jars that are falling out then you need to tell hadoop about
> them.  I am unsure about what command line methods that might require.
>
> On Tue, Jan 11, 2011 at 9:01 PM, Lokendra Singh <lsingh.969@gmail.com
> >wrote:
>
> > @Ted: If the classpath jars are not visible to hadoop internally,  what
> is
> > the global Classpath used by hadoop and how to change it ? I had tried
> > putting the required jars in '$HADOOP_HOME/lib' also, thinking that this
> > would be a central repository of libs, but no use.
> >
>

Re: Running KMeans Directly from Java Program on Hadoop-0.20.2 and 'Vector' ClassNotFound error

Posted by Ted Dunning <te...@gmail.com>.
It isn't a matter of classpath.

It is a matter of your program having to run on many parallel machines that
share nothing.

That means that the classpath on your original machine may contain the
correct jars, but those jars aren't even on the other machines involved in
the computation.  Remember hadoop is for DISTRIBUTED computation.

If it is just a matter of mahout's dependencies, then you can use the the
job jar instead of the mahout-core jar and be good to go.

If it is your jars that are falling out then you need to tell hadoop about
them.  I am unsure about what command line methods that might require.

On Tue, Jan 11, 2011 at 9:01 PM, Lokendra Singh <ls...@gmail.com>wrote:

> @Ted: If the classpath jars are not visible to hadoop internally,  what is
> the global Classpath used by hadoop and how to change it ? I had tried
> putting the required jars in '$HADOOP_HOME/lib' also, thinking that this
> would be a central repository of libs, but no use.
>

Re: Running KMeans Directly from Java Program on Hadoop-0.20.2 and 'Vector' ClassNotFound error

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I guess this stands close to my question in another email thread.

There is a MR option for hadoop to grab extra jars but i never used it in
conjunction with ToolRunner which is what Mahout is presumably uses
everywhere. Besides, I am not sure that i am not going to be overriding
results of JobConf.setJar() which i think most Mahout jobs use to set the MR
classpath.

i guess a little experimenation is a key, but i am fairly sure placing jars
into $HADOOP_HOME/lib is not going to help by default.

On Tue, Jan 11, 2011 at 9:01 PM, Lokendra Singh <ls...@gmail.com>wrote:

> Hi,
>
> @Sean: I am actually directly trying to run KMeans from
> KMeansDriver.runJob() method in my java program (and not using hadoop
> classes like RunJar etc), hence even putting .job file in my classpath is
> not making any difference.
>
> @Ted: If the classpath jars are not visible to hadoop internally,  what is
> the global Classpath used by hadoop and how to change it ? I had tried
> putting the required jars in '$HADOOP_HOME/lib' also, thinking that this
> would be a central repository of libs, but no use.
>
>
> Regards
> Lokendra
>
>
> On Tue, Jan 11, 2011 at 10:01 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Yes, you should be using the "job" .jar file, not the regular .jar
> > file. The "job" file is a .jar file which contains all dependencies.
> > Are you doing this?
> >
> > On Tue, Jan 11, 2011 at 4:29 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > I don't know the ultimate cause, but the proximal cause is probably
> that
> > the
> > > mahout-math.jar is not
> > > being passed to the mappers and reducers.  The fact that it is in your
> > class
> > > path is only part of the
> > > battle because hadoop has to know that it needs to pass this jar to the
> > > other machines running your
> > > program.
> > >
> >
>

Re: Running KMeans Directly from Java Program on Hadoop-0.20.2 and 'Vector' ClassNotFound error

Posted by Lokendra Singh <ls...@gmail.com>.
Hi,

@Sean: I am actually directly trying to run KMeans from
KMeansDriver.runJob() method in my java program (and not using hadoop
classes like RunJar etc), hence even putting .job file in my classpath is
not making any difference.

@Ted: If the classpath jars are not visible to hadoop internally,  what is
the global Classpath used by hadoop and how to change it ? I had tried
putting the required jars in '$HADOOP_HOME/lib' also, thinking that this
would be a central repository of libs, but no use.


Regards
Lokendra


On Tue, Jan 11, 2011 at 10:01 PM, Sean Owen <sr...@gmail.com> wrote:

> Yes, you should be using the "job" .jar file, not the regular .jar
> file. The "job" file is a .jar file which contains all dependencies.
> Are you doing this?
>
> On Tue, Jan 11, 2011 at 4:29 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > I don't know the ultimate cause, but the proximal cause is probably that
> the
> > mahout-math.jar is not
> > being passed to the mappers and reducers.  The fact that it is in your
> class
> > path is only part of the
> > battle because hadoop has to know that it needs to pass this jar to the
> > other machines running your
> > program.
> >
>

Re: Running KMeans Directly from Java Program on Hadoop-0.20.2 and 'Vector' ClassNotFound error

Posted by Sean Owen <sr...@gmail.com>.
Yes, you should be using the "job" .jar file, not the regular .jar
file. The "job" file is a .jar file which contains all dependencies.
Are you doing this?

On Tue, Jan 11, 2011 at 4:29 PM, Ted Dunning <te...@gmail.com> wrote:
>
> I don't know the ultimate cause, but the proximal cause is probably that the
> mahout-math.jar is not
> being passed to the mappers and reducers.  The fact that it is in your class
> path is only part of the
> battle because hadoop has to know that it needs to pass this jar to the
> other machines running your
> program.
>

Re: Running KMeans Directly from Java Program on Hadoop-0.20.2 and 'Vector' ClassNotFound error

Posted by Ted Dunning <te...@gmail.com>.
I don't know the ultimate cause, but the proximal cause is probably that the
mahout-math.jar is not
being passed to the mappers and reducers.  The fact that it is in your class
path is only part of the
battle because hadoop has to know that it needs to pass this jar to the
other machines running your
program.

Others can comment about exactly why this might be happening, but I would
strongly suspect that there is something happening in the normal mahout
command line program to indicate that the mahout-math jar file is important.
 Obviously, the corollary is that you are not doing this in your program.

Btw... don't use 0.3.  It is very old and mahout is still pretty fast
moving.

On Tue, Jan 11, 2011 at 6:48 AM, Lokendra Singh <ls...@gmail.com>wrote:

> Hi Everyone,
>
> I have been trying to run KMeans Clustering from mahout versions 0.3 & 0.4
> ,directly from my Java Program by calling the methods KMeansDriver.runJob()
> (mahout-0.3) and KMeansDriver.run() (mahout-0.4) on my Hadoop Cluster
>  (0.20.2).
>
> Unfortunately, it fails for both the versions of mahout (see error below)
> which seems to arise from the Vector class not found, although the
> mahout-math.jar (which contains org.apache.mahout.math.Vector) is present
> in
> the classpath while running the programs. (I have also included
> $HADOOP_HOME/conf in my classpath)
>
> The programs are able to write the vectors,clusters in the HDFS but fail
> during map-reducing step:
>
> Error(mahout-0.3): http://pastebin.mandriva.com/21630
> Error (mahout-0.4) : http://pastebin.mandriva.com/21631
>
> PS: However, I am able to run the kmeans from Hadoop commands
>