You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Gabor Makrai <ma...@gmail.com> on 2012/12/18 12:47:33 UTC

Run Mahout remotely

Hi,

I have a huge Java project where the created jar file is larger then 50 MB.
I've already integreted Mahout K-Means clustering algorithm by using the
KMeansDriver.run function with the appropriate parameters. This program is
running on the users' computer, so when Mahout tries to start a new MR job,
it is uploading the whole jar file every time. (it sometimes takes more
then 10 minutes, because of the client's slow internet connection)
I know that Hadoop MR supports Job submission without uploading the jar
file by skipping the command jobConf.setJar or jobConf.setJarByName calls.
I found in the Mahout source code, that the previously mentioned functions
are always called, but for me it will be good to somehow skip them. Is it
possible to solve my problem without copypasting every corresponding code
and commenting these functions?
(of course if these functions are skipped, it is necessary to upload
manually the jar file and add it to the DistributedCache for every job
submission)

Thanks,
Gabor

Re: Run Mahout remotely

Posted by Ted Dunning <te...@gmail.com>.
One method for dealing with this is to always submit jobs from a machine
near or in the cluster.  This is a pain because you wind up having to
compile twice or transfer the jar occasionally to this machine.  The
(slightly) good news is that rsync is often quite clever about moving jars
incrementally.  On my home connection, this can be 10x speed up for moving
a jar.

Another thing you can do which is a bit more invasive is try using MapR's
distribution for Hadoop.  The trick there is to put most of your dependency
jars into the dfs and then make sure that all nodes can see it via locally
mounted NFS.  If you put these jars into the class path of your map-reduce
program, you can then submit just a tiny job jar because all of the
dependencies will already be in the cluster.  This can *massively* decrease
job startup times.

On Tue, Dec 18, 2012 at 3:47 AM, Gabor Makrai <ma...@gmail.com> wrote:

> Hi,
>
> I have a huge Java project where the created jar file is larger then 50 MB.
> I've already integreted Mahout K-Means clustering algorithm by using the
> KMeansDriver.run function with the appropriate parameters. This program is
> running on the users' computer, so when Mahout tries to start a new MR job,
> it is uploading the whole jar file every time. (it sometimes takes more
> then 10 minutes, because of the client's slow internet connection)
> I know that Hadoop MR supports Job submission without uploading the jar
> file by skipping the command jobConf.setJar or jobConf.setJarByName calls.
> I found in the Mahout source code, that the previously mentioned functions
> are always called, but for me it will be good to somehow skip them. Is it
> possible to solve my problem without copypasting every corresponding code
> and commenting these functions?
> (of course if these functions are skipped, it is necessary to upload
> manually the jar file and add it to the DistributedCache for every job
> submission)
>
> Thanks,
> Gabor
>