You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Something Something <ma...@gmail.com> on 2011/11/16 07:24:30 UTC

Distributing our jars to all machines in a cluster

Until now we were manually copying our Jars to all machines in a Hadoop
cluster.  This used to work until our cluster size was small.  Now our
cluster is getting bigger.  What's the best way to start a Hadoop Job that
automatically distributes the Jar to all machines in a cluster?

I read the doc at:
http://hadoop.apache.org/common/docs/current/commands_manual.html#jar

Would -libjars do the trick?  But we need to use 'hadoop job' for that,
right?  Until now, we were using 'hadoop jar' to start all our jobs.

Needless to say, we are getting our feet wet with Hadoop, so appreciate
your help with our dumb questions.

Thanks.

PS:  We use Pig a lot, which automatically does this, so there must be a
clean way to do this.

Re: Distributing our jars to all machines in a cluster

Posted by Bejoy Ks <be...@gmail.com>.

Hi
       To distribute application specific jars or files you can just do the
same with 'hadoop jar command' like
*hadoop jar* sample.jar com.test.Samples.Application *-files* *file1.txt,
file2.csv* *-libjars* *custom_connector.jar, json_util.jar* input_dir
output_dir. But this would happen for every time the job is run, if the job
is more frequent, there are more number of jars to distribute and there are
multiple jobs that would depend on the same jars then rather then
distributing the jars every time you trigger the job it is better to pre
distribute the same across your nodes and include the same in classpath of
all the nodes
        AFAIK you dont use "hadoop job" to submit your MR job. It is used
for playing around with your job(like setting priorities,killing,
monitoring status etc) once your job is registered with job tracker(
running jobs).

Hope it helps!...

Regards
Bejoy.K.S

On Wed, Nov 16, 2011 at 12:09 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Until now we were manually copying our Jars to all machines in a Hadoop
> cluster.  This used to work until our cluster size was small.  Now our
> cluster is getting bigger.  What's the best way to start a Hadoop Job that
> automatically distributes the Jar to all machines in a cluster?
>
> I read the doc at:
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>
> Would -libjars do the trick?  But we need to use 'hadoop job' for that,
> right?  Until now, we were using 'hadoop jar' to start all our jobs.
>
> Needless to say, we are getting our feet wet with Hadoop, so appreciate
> your help with our dumb questions.
>
> Thanks.
>
> PS:  We use Pig a lot, which automatically does this, so there must be a
> clean way to do this.
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

Until now we were manually copying our Jars to all machines in a Hadoop
cluster.  This used to work until our cluster size was small.  Now our
cluster is getting bigger.  What's the best way to start a Hadoop Job that
automatically distributes the Jar to all machines in a cluster?

I read the doc at:
http://hadoop.apache.org/common/docs/current/commands_manual.html#jar

Would -libjars do the trick?  But we need to use 'hadoop job' for that,
right?  Until now, we were using 'hadoop jar' to start all our jobs.

Needless to say, we are getting our feet wet with Hadoop, so appreciate
your help with our dumb questions.

Thanks.

PS:  We use Pig a lot, which automatically does this, so there must be a
clean way to do this.

Re: Distributing our jars to all machines in a cluster

Posted by Bejoy Ks <be...@gmail.com>.

Hi
      You can find the usage examples of libjars and files at the following
apache url
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Usage

*"Running wordcount example with -libjars, -files and -archives:
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
mylib.jar -archives myarchive.zip input output Here, myarchive.zip will be
placed and unzipped into a directory by the name "myarchive.zip". "*

I have implemented some map reduce projects successfully with -files option
and I never faced any issue. I have used the libjars option with SQOOP as
well, there also it worked flawlessly.

If your jars keep changing often then libjars would be the preferred
option. If it is kind of static and there are more number of dependent jars
then job tracker doesn't need to distribute jars to task tracker nodes
every time you submit a job, so in those cases predistributing the
dependent jars explicity across nodes one time would be a better approach.

I'm not extremely sure how it works for hive and pig.I believe Pig and hive
would be parsing the pig latin/ hive QL into map reduce jobs and may be the
jobs are packed and distributed across nodes. Though I'm really not sure.
 Experts, please correct me if I'm wrong.
AFAIK shipping jars/files across cluster should be done internally using
libjars/files option in pretty much of all tools that use map reduce under
the hood.

Regards
Bejoy.K.S

On Wed, Nov 16, 2011 at 8:12 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
> 'hadoop jar'.  Also, as per the documentation (
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>
> Generic Options
>
> The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
> , fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
> , fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
> , job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
>  and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
> .
>
>
>
> Does it work for you?  If it does, please let me know.  "Pre-distributing"
> definitely works, but is that the best way?  If you have a big cluster and
> Jars are changing often it will be time-consuming.
>
> Also, how does Pig do it?  We update Pig UDFs often and put them only on
> the 'client' machine (machine that starts the Pig job) and the UDF becomes
> available to all machines in the cluster - automagically!  Is Pig doing the
> pre-distributing for us?
>
> Thanks for your patience & help with our questions.
>
> On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
> mailinglists19@gmail.com> wrote:
>
>> Hmm... there must be a different way 'cause we don't need to do that to
>> run Pig jobs.
>>
>>
>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>wrote:
>>
>>> There might be different ways but currently we are storing our jars onto
>>> HDFS and register them from there. They will be copied to the machine once
>>> the job starts. Is that an option?
>>>
>>> Daan.
>>>
>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>
>>> > Until now we were manually copying our Jars to all machines in a Hadoop
>>> > cluster.  This used to work until our cluster size was small.  Now our
>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>> that
>>> > automatically distributes the Jar to all machines in a cluster?
>>> >
>>> > I read the doc at:
>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>> >
>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>> >
>>> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
>>> > your help with our dumb questions.
>>> >
>>> > Thanks.
>>> >
>>> > PS:  We use Pig a lot, which automatically does this, so there must be
>>> a
>>> > clean way to do this.
>>>
>>>
>>
>

Re: Distributing our jars to all machines in a cluster

Posted by Praveen Sripati <pr...@gmail.com>.

Hi,

Here are the different ways of distributing 3rd party jars with the
application.

http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Thanks,
Praveen

On Wed, Nov 16, 2011 at 11:30 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Libjars works if your MR job is initialized correctly. Here's a code
> snippet:
>
>  public static void main(String[] args) throws Exception {
>    GenericOptionsParser optParser = new GenericOptionsParser(args);
>    int exitCode = ToolRunner.run(optParser.getConfiguration(),
>        new MyMRJob(),
>        optParser.getRemainingArgs());
>    System.exit(exitCode);
>  }
>
> Pig works by re-jarring your whole application, and there's an
> outstanding patch to make it run libjars -- which works, I've been
> running it in production at Twitter.
>
> -D
>
> On Wed, Nov 16, 2011 at 9:00 AM, Something Something
> <ma...@gmail.com> wrote:
> > I agree.  It will eventually get us in trouble.  That's why we want to
> get
> > the -libjars option to work, but it's not working.. arrrghhh..  It's the
> > simplest things in engineering that take the longest time... -:)
> >
> > Can you see why this may not work?
> >
> > /Users/xyz/hadoop-0.20.2/bin/hadoop jar
> > /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
> > -libjars /Users/xyz/modules/something/target/my.jar,
> > /Users/xyz/avro-tools-1.5.4.jar
> >
> > On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven
> > <fv...@xebia.com> wrote:
> >>
> >> You use maven jar-with-deps default assembly? That layout works too, but
> >> it will give you problems eventually when you have different classes
> with
> >> the same package and name.
> >> Java jar files are regular ZIP files. They can contain duplicate
> entries.
> >> I don't know whether your packaging creates duplicates in them, but if
> it
> >> does, it could be the cause of your problem.
> >> Try checking your jar for a duplicate license dir in the META-INF
> >> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort
> |
> >> uniq -d)
> >>
> >> Friso
> >>
> >> On 16 nov. 2011, at 17:33, Something Something wrote:
> >>
> >> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by
> Maven
> >> I get this:
> >>
> >> Mkdirs failed to create
> >> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
> >>
> >>
> >> Do you recall coming across this?  Our 'all-in-one' jar is not exactly
> how
> >> you have described it.  It doesn't contain any JARs, but it has all the
> >> classes from all the dependent JARs.
> >>
> >>
> >> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven
> >> <fv...@xebia.com> wrote:
> >>>
> >>> We usually package my jobs as a single jar that contains a /lib
> directory
> >>> in the jar that contains all other jars that the job code depends on.
> Hadoop
> >>> understands this layout when run as 'hadoop jar'. So the jar layout
> would be
> >>> something like:
> >>> /META-INF/manifest.mf
> >>> /com/mypackage/MyMapperClass.class
> >>> /com/mypackage/MyReducerClass.class
> >>> /lib/dependency1.jar
> >>> /lib/dependency2.jar
> >>> etc.
> >>> If you use Maven or some other build tool with dependency management,
> you
> >>> can usually produce this jar as part of your build. We also have Maven
> write
> >>> the main class to the manifest, such that there is no need to type it.
> So
> >>> for us, submitting a job looks like:
> >>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
> >>> Then Hadoop will take care of submitting and distributing, etc. Of
> course
> >>> you pay the penalty of always sending all of your dependencies over
> the wire
> >>> (the job jar gets replicated to 10 machines by
> default). Pre-distributing
> >>> sounds tedious and error prone to me. What if you have different jobs
> that
> >>> require different versions of the same dependency?
> >>>
> >>> HTH,
> >>> Friso
> >>>
> >>>
> >>>
> >>>
> >>> On 16 nov. 2011, at 15:42, Something Something wrote:
> >>>
> >>> Bejoy - Thanks for the reply.  The '-libjars' is not working for me
> with
> >>> 'hadoop jar'.  Also, as per the documentation
> >>> (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> ):
> >>>
> >>> Generic Options
> >>>
> >>> The following options are supported
> >>> by dfsadmin, fs, fsck, job and fetchdt.
> >>>
> >>>
> >>>
> >>> Does it work for you?  If it does, please let me know.
> >>>  "Pre-distributing" definitely works, but is that the best way?  If
> you have
> >>> a big cluster and Jars are changing often it will be time-consuming.
> >>>
> >>> Also, how does Pig do it?  We update Pig UDFs often and put them only
> on
> >>> the 'client' machine (machine that starts the Pig job) and the UDF
> becomes
> >>> available to all machines in the cluster - automagically!  Is Pig
> doing the
> >>> pre-distributing for us?
> >>>
> >>> Thanks for your patience & help with our questions.
> >>>
> >>> On Wed, Nov 16, 2011 at 6:29 AM, Something Something
> >>> <ma...@gmail.com> wrote:
> >>>>
> >>>> Hmm... there must be a different way 'cause we don't need to do that
> to
> >>>> run Pig jobs.
> >>>>
> >>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> There might be different ways but currently we are storing our jars
> >>>>> onto HDFS and register them from there. They will be copied to the
> machine
> >>>>> once the job starts. Is that an option?
> >>>>>
> >>>>> Daan.
> >>>>>
> >>>>> On 16 Nov 2011, at 07:24, Something Something wrote:
> >>>>>
> >>>>> > Until now we were manually copying our Jars to all machines in a
> >>>>> > Hadoop
> >>>>> > cluster.  This used to work until our cluster size was small.  Now
> >>>>> > our
> >>>>> > cluster is getting bigger.  What's the best way to start a Hadoop
> Job
> >>>>> > that
> >>>>> > automatically distributes the Jar to all machines in a cluster?
> >>>>> >
> >>>>> > I read the doc at:
> >>>>> >
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> >>>>> >
> >>>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
> >>>>> > that,
> >>>>> > right?  Until now, we were using 'hadoop jar' to start all our
> jobs.
> >>>>> >
> >>>>> > Needless to say, we are getting our feet wet with Hadoop, so
> >>>>> > appreciate
> >>>>> > your help with our dumb questions.
> >>>>> >
> >>>>> > Thanks.
> >>>>> >
> >>>>> > PS:  We use Pig a lot, which automatically does this, so there must
> >>>>> > be a
> >>>>> > clean way to do this.
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>

Re: Distributing our jars to all machines in a cluster

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Libjars works if your MR job is initialized correctly. Here's a code snippet:

  public static void main(String[] args) throws Exception {
    GenericOptionsParser optParser = new GenericOptionsParser(args);
    int exitCode = ToolRunner.run(optParser.getConfiguration(),
        new MyMRJob(),
        optParser.getRemainingArgs());
    System.exit(exitCode);
  }

Pig works by re-jarring your whole application, and there's an
outstanding patch to make it run libjars -- which works, I've been
running it in production at Twitter.

-D

On Wed, Nov 16, 2011 at 9:00 AM, Something Something
<ma...@gmail.com> wrote:
> I agree.  It will eventually get us in trouble.  That's why we want to get
> the -libjars option to work, but it's not working.. arrrghhh..  It's the
> simplest things in engineering that take the longest time... -:)
>
> Can you see why this may not work?
>
> /Users/xyz/hadoop-0.20.2/bin/hadoop jar
> /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
> -libjars /Users/xyz/modules/something/target/my.jar,
> /Users/xyz/avro-tools-1.5.4.jar
>
> On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven
> <fv...@xebia.com> wrote:
>>
>> You use maven jar-with-deps default assembly? That layout works too, but
>> it will give you problems eventually when you have different classes with
>> the same package and name.
>> Java jar files are regular ZIP files. They can contain duplicate entries.
>> I don't know whether your packaging creates duplicates in them, but if it
>> does, it could be the cause of your problem.
>> Try checking your jar for a duplicate license dir in the META-INF
>> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort |
>> uniq -d)
>>
>> Friso
>>
>> On 16 nov. 2011, at 17:33, Something Something wrote:
>>
>> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
>> I get this:
>>
>> Mkdirs failed to create
>> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
>>
>>
>> Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
>> you have described it.  It doesn't contain any JARs, but it has all the
>> classes from all the dependent JARs.
>>
>>
>> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven
>> <fv...@xebia.com> wrote:
>>>
>>> We usually package my jobs as a single jar that contains a /lib directory
>>> in the jar that contains all other jars that the job code depends on. Hadoop
>>> understands this layout when run as 'hadoop jar'. So the jar layout would be
>>> something like:
>>> /META-INF/manifest.mf
>>> /com/mypackage/MyMapperClass.class
>>> /com/mypackage/MyReducerClass.class
>>> /lib/dependency1.jar
>>> /lib/dependency2.jar
>>> etc.
>>> If you use Maven or some other build tool with dependency management, you
>>> can usually produce this jar as part of your build. We also have Maven write
>>> the main class to the manifest, such that there is no need to type it. So
>>> for us, submitting a job looks like:
>>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
>>> Then Hadoop will take care of submitting and distributing, etc. Of course
>>> you pay the penalty of always sending all of your dependencies over the wire
>>> (the job jar gets replicated to 10 machines by default). Pre-distributing
>>> sounds tedious and error prone to me. What if you have different jobs that
>>> require different versions of the same dependency?
>>>
>>> HTH,
>>> Friso
>>>
>>>
>>>
>>>
>>> On 16 nov. 2011, at 15:42, Something Something wrote:
>>>
>>> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
>>> 'hadoop jar'.  Also, as per the documentation
>>> (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>>>
>>> Generic Options
>>>
>>> The following options are supported
>>> by dfsadmin, fs, fsck, job and fetchdt.
>>>
>>>
>>>
>>> Does it work for you?  If it does, please let me know.
>>>  "Pre-distributing" definitely works, but is that the best way?  If you have
>>> a big cluster and Jars are changing often it will be time-consuming.
>>>
>>> Also, how does Pig do it?  We update Pig UDFs often and put them only on
>>> the 'client' machine (machine that starts the Pig job) and the UDF becomes
>>> available to all machines in the cluster - automagically!  Is Pig doing the
>>> pre-distributing for us?
>>>
>>> Thanks for your patience & help with our questions.
>>>
>>> On Wed, Nov 16, 2011 at 6:29 AM, Something Something
>>> <ma...@gmail.com> wrote:
>>>>
>>>> Hmm... there must be a different way 'cause we don't need to do that to
>>>> run Pig jobs.
>>>>
>>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>
>>>> wrote:
>>>>>
>>>>> There might be different ways but currently we are storing our jars
>>>>> onto HDFS and register them from there. They will be copied to the machine
>>>>> once the job starts. Is that an option?
>>>>>
>>>>> Daan.
>>>>>
>>>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>>>
>>>>> > Until now we were manually copying our Jars to all machines in a
>>>>> > Hadoop
>>>>> > cluster.  This used to work until our cluster size was small.  Now
>>>>> > our
>>>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>>>> > that
>>>>> > automatically distributes the Jar to all machines in a cluster?
>>>>> >
>>>>> > I read the doc at:
>>>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>>>> >
>>>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
>>>>> > that,
>>>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>>>> >
>>>>> > Needless to say, we are getting our feet wet with Hadoop, so
>>>>> > appreciate
>>>>> > your help with our dumb questions.
>>>>> >
>>>>> > Thanks.
>>>>> >
>>>>> > PS:  We use Pig a lot, which automatically does this, so there must
>>>>> > be a
>>>>> > clean way to do this.
>>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Libjars works if your MR job is initialized correctly. Here's a code snippet:

  public static void main(String[] args) throws Exception {
    GenericOptionsParser optParser = new GenericOptionsParser(args);
    int exitCode = ToolRunner.run(optParser.getConfiguration(),
        new MyMRJob(),
        optParser.getRemainingArgs());
    System.exit(exitCode);
  }

Pig works by re-jarring your whole application, and there's an
outstanding patch to make it run libjars -- which works, I've been
running it in production at Twitter.

-D

On Wed, Nov 16, 2011 at 9:00 AM, Something Something
<ma...@gmail.com> wrote:
> I agree.  It will eventually get us in trouble.  That's why we want to get
> the -libjars option to work, but it's not working.. arrrghhh..  It's the
> simplest things in engineering that take the longest time... -:)
>
> Can you see why this may not work?
>
> /Users/xyz/hadoop-0.20.2/bin/hadoop jar
> /Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
> -libjars /Users/xyz/modules/something/target/my.jar,
> /Users/xyz/avro-tools-1.5.4.jar
>
> On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven
> <fv...@xebia.com> wrote:
>>
>> You use maven jar-with-deps default assembly? That layout works too, but
>> it will give you problems eventually when you have different classes with
>> the same package and name.
>> Java jar files are regular ZIP files. They can contain duplicate entries.
>> I don't know whether your packaging creates duplicates in them, but if it
>> does, it could be the cause of your problem.
>> Try checking your jar for a duplicate license dir in the META-INF
>> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort |
>> uniq -d)
>>
>> Friso
>>
>> On 16 nov. 2011, at 17:33, Something Something wrote:
>>
>> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
>> I get this:
>>
>> Mkdirs failed to create
>> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
>>
>>
>> Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
>> you have described it.  It doesn't contain any JARs, but it has all the
>> classes from all the dependent JARs.
>>
>>
>> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven
>> <fv...@xebia.com> wrote:
>>>
>>> We usually package my jobs as a single jar that contains a /lib directory
>>> in the jar that contains all other jars that the job code depends on. Hadoop
>>> understands this layout when run as 'hadoop jar'. So the jar layout would be
>>> something like:
>>> /META-INF/manifest.mf
>>> /com/mypackage/MyMapperClass.class
>>> /com/mypackage/MyReducerClass.class
>>> /lib/dependency1.jar
>>> /lib/dependency2.jar
>>> etc.
>>> If you use Maven or some other build tool with dependency management, you
>>> can usually produce this jar as part of your build. We also have Maven write
>>> the main class to the manifest, such that there is no need to type it. So
>>> for us, submitting a job looks like:
>>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
>>> Then Hadoop will take care of submitting and distributing, etc. Of course
>>> you pay the penalty of always sending all of your dependencies over the wire
>>> (the job jar gets replicated to 10 machines by default). Pre-distributing
>>> sounds tedious and error prone to me. What if you have different jobs that
>>> require different versions of the same dependency?
>>>
>>> HTH,
>>> Friso
>>>
>>>
>>>
>>>
>>> On 16 nov. 2011, at 15:42, Something Something wrote:
>>>
>>> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
>>> 'hadoop jar'.  Also, as per the documentation
>>> (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>>>
>>> Generic Options
>>>
>>> The following options are supported
>>> by dfsadmin, fs, fsck, job and fetchdt.
>>>
>>>
>>>
>>> Does it work for you?  If it does, please let me know.
>>>  "Pre-distributing" definitely works, but is that the best way?  If you have
>>> a big cluster and Jars are changing often it will be time-consuming.
>>>
>>> Also, how does Pig do it?  We update Pig UDFs often and put them only on
>>> the 'client' machine (machine that starts the Pig job) and the UDF becomes
>>> available to all machines in the cluster - automagically!  Is Pig doing the
>>> pre-distributing for us?
>>>
>>> Thanks for your patience & help with our questions.
>>>
>>> On Wed, Nov 16, 2011 at 6:29 AM, Something Something
>>> <ma...@gmail.com> wrote:
>>>>
>>>> Hmm... there must be a different way 'cause we don't need to do that to
>>>> run Pig jobs.
>>>>
>>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>
>>>> wrote:
>>>>>
>>>>> There might be different ways but currently we are storing our jars
>>>>> onto HDFS and register them from there. They will be copied to the machine
>>>>> once the job starts. Is that an option?
>>>>>
>>>>> Daan.
>>>>>
>>>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>>>
>>>>> > Until now we were manually copying our Jars to all machines in a
>>>>> > Hadoop
>>>>> > cluster.  This used to work until our cluster size was small.  Now
>>>>> > our
>>>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>>>> > that
>>>>> > automatically distributes the Jar to all machines in a cluster?
>>>>> >
>>>>> > I read the doc at:
>>>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>>>> >
>>>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
>>>>> > that,
>>>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>>>> >
>>>>> > Needless to say, we are getting our feet wet with Hadoop, so
>>>>> > appreciate
>>>>> > your help with our dumb questions.
>>>>> >
>>>>> > Thanks.
>>>>> >
>>>>> > PS:  We use Pig a lot, which automatically does this, so there must
>>>>> > be a
>>>>> > clean way to do this.
>>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

I agree.  It will eventually get us in trouble.  That's why we want to get
the -libjars option to work, but it's not working.. arrrghhh..  It's the
simplest things in engineering that take the longest time... -:)

Can you see why this may not work?

/Users/xyz/hadoop-0.20.2/bin/hadoop jar
/Users/xyz/modules/something/target/my.jar com.xyz.common.MyMapReduce
-libjars /Users/xyz/modules/something/target/my.jar,
/Users/xyz/avro-tools-1.5.4.jar


On Wed, Nov 16, 2011 at 8:51 AM, Friso van Vollenhoven <
fvanvollenhoven@xebia.com> wrote:

>  You use maven jar-with-deps default assembly? That layout works too, but
> it will give you problems eventually when you have different classes with
> the same package and name.
>
>  Java jar files are regular ZIP files. They can contain duplicate
> entries. I don't know whether your packaging creates duplicates in them,
> but if it does, it could be the cause of your problem.
>
>  Try checking your jar for a duplicate license dir in the META-INF
> (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort |
> uniq -d)
>
>
>  Friso
>
>
>  On 16 nov. 2011, at 17:33, Something Something wrote:
>
> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
> I get this:
>
> Mkdirs failed to create
> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
>
>
> Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
> you have described it.  It doesn't contain any JARs, but it has all the
> classes from all the dependent JARs.
>
>
> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <
> fvanvollenhoven@xebia.com> wrote:
>
>> We usually package my jobs as a single jar that contains a /lib directory
>> in the jar that contains all other jars that the job code depends on.
>> Hadoop understands this layout when run as 'hadoop jar'. So the jar layout
>> would be something like:
>>
>> /META-INF/manifest.mf
>>  /com/mypackage/MyMapperClass.class
>>  /com/mypackage/MyReducerClass.class
>>  /lib/dependency1.jar
>>  /lib/dependency2.jar
>>  etc.
>>
>>  If you use Maven or some other build tool with dependency management,
>> you can usually produce this jar as part of your build. We also have Maven
>> write the main class to the manifest, such that there is no need to type
>> it. So for us, submitting a job looks like:
>> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
>>
>>  Then Hadoop will take care of submitting and distributing, etc. Of
>> course you pay the penalty of always sending all of your dependencies over
>> the wire (the job jar gets replicated to 10 machines by
>> default). Pre-distributing sounds tedious and error prone to me. What if
>> you have different jobs that require different versions of the same
>> dependency?
>>
>>
>>  HTH,
>> Friso
>>
>>
>>
>>
>>
>>  On 16 nov. 2011, at 15:42, Something Something wrote:
>>
>> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
>> 'hadoop jar'.  Also, as per the documentation (
>> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>>
>>  Generic Options
>>
>> The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
>> , fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
>> , fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
>> , job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
>>  and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
>> .
>>
>>
>>
>> Does it work for you?  If it does, please let me know.
>>  "Pre-distributing" definitely works, but is that the best way?  If you
>> have a big cluster and Jars are changing often it will be time-consuming.
>>
>> Also, how does Pig do it?  We update Pig UDFs often and put them only on
>> the 'client' machine (machine that starts the Pig job) and the UDF becomes
>> available to all machines in the cluster - automagically!  Is Pig doing the
>> pre-distributing for us?
>>
>> Thanks for your patience & help with our questions.
>>
>>  On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> Hmm... there must be a different way 'cause we don't need to do that to
>>> run Pig jobs.
>>>
>>>
>>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>wrote:
>>>
>>>> There might be different ways but currently we are storing our jars
>>>> onto HDFS and register them from there. They will be copied to the machine
>>>> once the job starts. Is that an option?
>>>>
>>>> Daan.
>>>>
>>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>>
>>>> > Until now we were manually copying our Jars to all machines in a
>>>> Hadoop
>>>> > cluster.  This used to work until our cluster size was small.  Now our
>>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>>> that
>>>> > automatically distributes the Jar to all machines in a cluster?
>>>> >
>>>> > I read the doc at:
>>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>>> >
>>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
>>>> that,
>>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>>> >
>>>> > Needless to say, we are getting our feet wet with Hadoop, so
>>>> appreciate
>>>> > your help with our dumb questions.
>>>> >
>>>> > Thanks.
>>>> >
>>>> > PS:  We use Pig a lot, which automatically does this, so there must
>>>> be a
>>>> > clean way to do this.
>>>>
>>>>
>>>
>>
>>
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Friso van Vollenhoven <fv...@xebia.com>.

You use maven jar-with-deps default assembly? That layout works too, but it will give you problems eventually when you have different classes with the same package and name.

Java jar files are regular ZIP files. They can contain duplicate entries. I don't know whether your packaging creates duplicates in them, but if it does, it could be the cause of your problem.

Try checking your jar for a duplicate license dir in the META-INF (something like: unzip -l <your-jar-name>.jar | awk '{print $4}' | sort | uniq -d)

Friso

On 16 nov. 2011, at 17:33, Something Something wrote:

Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven I get this:

Mkdirs failed to create /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license

Do you recall coming across this?  Our 'all-in-one' jar is not exactly how you have described it.  It doesn't contain any JARs, but it has all the classes from all the dependent JARs.

On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <fv...@xebia.com>> wrote:
We usually package my jobs as a single jar that contains a /lib directory in the jar that contains all other jars that the job code depends on. Hadoop understands this layout when run as 'hadoop jar'. So the jar layout would be something like:

/META-INF/manifest.mf
/com/mypackage/MyMapperClass.class
/com/mypackage/MyReducerClass.class
/lib/dependency1.jar
/lib/dependency2.jar
etc.

If you use Maven or some other build tool with dependency management, you can usually produce this jar as part of your build. We also have Maven write the main class to the manifest, such that there is no need to type it. So for us, submitting a job looks like:
hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN

Then Hadoop will take care of submitting and distributing, etc. Of course you pay the penalty of always sending all of your dependencies over the wire (the job jar gets replicated to 10 machines by default). Pre-distributing sounds tedious and error prone to me. What if you have different jobs that require different versions of the same dependency?

HTH,
Friso

On 16 nov. 2011, at 15:42, Something Something wrote:

Bejoy - Thanks for the reply.  The '-libjars' is not working for me with 'hadoop jar'.  Also, as per the documentation (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):

Generic Options

The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>, fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>, fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>, job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job> and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>.

Does it work for you?  If it does, please let me know.  "Pre-distributing" definitely works, but is that the best way?  If you have a big cluster and Jars are changing often it will be time-consuming.

Also, how does Pig do it?  We update Pig UDFs often and put them only on the 'client' machine (machine that starts the Pig job) and the UDF becomes available to all machines in the cluster - automagically!  Is Pig doing the pre-distributing for us?

Thanks for your patience & help with our questions.

On Wed, Nov 16, 2011 at 6:29 AM, Something Something <ma...@gmail.com>> wrote:
Hmm... there must be a different way 'cause we don't need to do that to run Pig jobs.

On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>> wrote:
There might be different ways but currently we are storing our jars onto HDFS and register them from there. They will be copied to the machine once the job starts. Is that an option?

Daan.

On 16 Nov 2011, at 07:24, Something Something wrote:

> Until now we were manually copying our Jars to all machines in a Hadoop
> cluster.  This used to work until our cluster size was small.  Now our
> cluster is getting bigger.  What's the best way to start a Hadoop Job that
> automatically distributes the Jar to all machines in a cluster?
>
> I read the doc at:
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>
> Would -libjars do the trick?  But we need to use 'hadoop job' for that,
> right?  Until now, we were using 'hadoop jar' to start all our jobs.
>
> Needless to say, we are getting our feet wet with Hadoop, so appreciate
> your help with our dumb questions.
>
> Thanks.
>
> PS:  We use Pig a lot, which automatically does this, so there must be a
> clean way to do this.

Re: Distributing our jars to all machines in a cluster

Posted by John Conwell <jo...@iamjohn.me>.

I think a small program that writes the jars to the distributed cache
should take care of your issue as mentioned
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html


On Wed, Nov 16, 2011 at 8:33 AM, Something Something <
mailinglists19@gmail.com> wrote:

> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
> I get this:
>
> Mkdirs failed to create
> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
>
>
> Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
> you have described it.  It doesn't contain any JARs, but it has all the
> classes from all the dependent JARs.
>
>
> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <
> fvanvollenhoven@xebia.com> wrote:
>
> >  We usually package my jobs as a single jar that contains a /lib
> directory
> > in the jar that contains all other jars that the job code depends on.
> > Hadoop understands this layout when run as 'hadoop jar'. So the jar
> layout
> > would be something like:
> >
> > /META-INF/manifest.mf
> >  /com/mypackage/MyMapperClass.class
> >  /com/mypackage/MyReducerClass.class
> >  /lib/dependency1.jar
> >  /lib/dependency2.jar
> >  etc.
> >
> >  If you use Maven or some other build tool with dependency management,
> > you can usually produce this jar as part of your build. We also have
> Maven
> > write the main class to the manifest, such that there is no need to type
> > it. So for us, submitting a job looks like:
> > hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
> >
> >  Then Hadoop will take care of submitting and distributing, etc. Of
> > course you pay the penalty of always sending all of your dependencies
> over
> > the wire (the job jar gets replicated to 10 machines by
> > default). Pre-distributing sounds tedious and error prone to me. What if
> > you have different jobs that require different versions of the same
> > dependency?
> >
> >
> >  HTH,
> > Friso
> >
> >
> >
> >
> >
> >  On 16 nov. 2011, at 15:42, Something Something wrote:
> >
> > Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
> > 'hadoop jar'.  Also, as per the documentation (
> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
> >
> >  Generic Options
> >
> > The following options are supported by dfsadmin<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin
> >
> > , fs<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
> > , fsck<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
> > , job<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
> >  and fetchdt<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
> > .
> >
> >
> >
> > Does it work for you?  If it does, please let me know.
>  "Pre-distributing"
> > definitely works, but is that the best way?  If you have a big cluster
> and
> > Jars are changing often it will be time-consuming.
> >
> > Also, how does Pig do it?  We update Pig UDFs often and put them only on
> > the 'client' machine (machine that starts the Pig job) and the UDF
> becomes
> > available to all machines in the cluster - automagically!  Is Pig doing
> the
> > pre-distributing for us?
> >
> > Thanks for your patience & help with our questions.
> >
> >  On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
> > mailinglists19@gmail.com> wrote:
> >
> >> Hmm... there must be a different way 'cause we don't need to do that to
> >> run Pig jobs.
> >>
> >>
> >> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <daan.gerits@gmail.com
> >wrote:
> >>
> >>> There might be different ways but currently we are storing our jars
> onto
> >>> HDFS and register them from there. They will be copied to the machine
> once
> >>> the job starts. Is that an option?
> >>>
> >>> Daan.
> >>>
> >>> On 16 Nov 2011, at 07:24, Something Something wrote:
> >>>
> >>> > Until now we were manually copying our Jars to all machines in a
> Hadoop
> >>> > cluster.  This used to work until our cluster size was small.  Now
> our
> >>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
> >>> that
> >>> > automatically distributes the Jar to all machines in a cluster?
> >>> >
> >>> > I read the doc at:
> >>> >
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> >>> >
> >>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
> that,
> >>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
> >>> >
> >>> > Needless to say, we are getting our feet wet with Hadoop, so
> appreciate
> >>> > your help with our dumb questions.
> >>> >
> >>> > Thanks.
> >>> >
> >>> > PS:  We use Pig a lot, which automatically does this, so there must
> be
> >>> a
> >>> > clean way to do this.
> >>>
> >>>
> >>
> >
> >
>



-- 

Thanks,
John C

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
I get this:

Mkdirs failed to create
/Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license


Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
you have described it.  It doesn't contain any JARs, but it has all the
classes from all the dependent JARs.


On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <
fvanvollenhoven@xebia.com> wrote:

>  We usually package my jobs as a single jar that contains a /lib directory
> in the jar that contains all other jars that the job code depends on.
> Hadoop understands this layout when run as 'hadoop jar'. So the jar layout
> would be something like:
>
> /META-INF/manifest.mf
>  /com/mypackage/MyMapperClass.class
>  /com/mypackage/MyReducerClass.class
>  /lib/dependency1.jar
>  /lib/dependency2.jar
>  etc.
>
>  If you use Maven or some other build tool with dependency management,
> you can usually produce this jar as part of your build. We also have Maven
> write the main class to the manifest, such that there is no need to type
> it. So for us, submitting a job looks like:
> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
>
>  Then Hadoop will take care of submitting and distributing, etc. Of
> course you pay the penalty of always sending all of your dependencies over
> the wire (the job jar gets replicated to 10 machines by
> default). Pre-distributing sounds tedious and error prone to me. What if
> you have different jobs that require different versions of the same
> dependency?
>
>
>  HTH,
> Friso
>
>
>
>
>
>  On 16 nov. 2011, at 15:42, Something Something wrote:
>
> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
> 'hadoop jar'.  Also, as per the documentation (
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>
>  Generic Options
>
> The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
> , fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
> , fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
> , job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
>  and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
> .
>
>
>
> Does it work for you?  If it does, please let me know.  "Pre-distributing"
> definitely works, but is that the best way?  If you have a big cluster and
> Jars are changing often it will be time-consuming.
>
> Also, how does Pig do it?  We update Pig UDFs often and put them only on
> the 'client' machine (machine that starts the Pig job) and the UDF becomes
> available to all machines in the cluster - automagically!  Is Pig doing the
> pre-distributing for us?
>
> Thanks for your patience & help with our questions.
>
>  On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
> mailinglists19@gmail.com> wrote:
>
>> Hmm... there must be a different way 'cause we don't need to do that to
>> run Pig jobs.
>>
>>
>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>wrote:
>>
>>> There might be different ways but currently we are storing our jars onto
>>> HDFS and register them from there. They will be copied to the machine once
>>> the job starts. Is that an option?
>>>
>>> Daan.
>>>
>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>
>>> > Until now we were manually copying our Jars to all machines in a Hadoop
>>> > cluster.  This used to work until our cluster size was small.  Now our
>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>> that
>>> > automatically distributes the Jar to all machines in a cluster?
>>> >
>>> > I read the doc at:
>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>> >
>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>> >
>>> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
>>> > your help with our dumb questions.
>>> >
>>> > Thanks.
>>> >
>>> > PS:  We use Pig a lot, which automatically does this, so there must be
>>> a
>>> > clean way to do this.
>>>
>>>
>>
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
I get this:

Mkdirs failed to create
/Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license


Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
you have described it.  It doesn't contain any JARs, but it has all the
classes from all the dependent JARs.


On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <
fvanvollenhoven@xebia.com> wrote:

>  We usually package my jobs as a single jar that contains a /lib directory
> in the jar that contains all other jars that the job code depends on.
> Hadoop understands this layout when run as 'hadoop jar'. So the jar layout
> would be something like:
>
> /META-INF/manifest.mf
>  /com/mypackage/MyMapperClass.class
>  /com/mypackage/MyReducerClass.class
>  /lib/dependency1.jar
>  /lib/dependency2.jar
>  etc.
>
>  If you use Maven or some other build tool with dependency management,
> you can usually produce this jar as part of your build. We also have Maven
> write the main class to the manifest, such that there is no need to type
> it. So for us, submitting a job looks like:
> hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
>
>  Then Hadoop will take care of submitting and distributing, etc. Of
> course you pay the penalty of always sending all of your dependencies over
> the wire (the job jar gets replicated to 10 machines by
> default). Pre-distributing sounds tedious and error prone to me. What if
> you have different jobs that require different versions of the same
> dependency?
>
>
>  HTH,
> Friso
>
>
>
>
>
>  On 16 nov. 2011, at 15:42, Something Something wrote:
>
> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
> 'hadoop jar'.  Also, as per the documentation (
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>
>  Generic Options
>
> The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
> , fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
> , fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
> , job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
>  and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
> .
>
>
>
> Does it work for you?  If it does, please let me know.  "Pre-distributing"
> definitely works, but is that the best way?  If you have a big cluster and
> Jars are changing often it will be time-consuming.
>
> Also, how does Pig do it?  We update Pig UDFs often and put them only on
> the 'client' machine (machine that starts the Pig job) and the UDF becomes
> available to all machines in the cluster - automagically!  Is Pig doing the
> pre-distributing for us?
>
> Thanks for your patience & help with our questions.
>
>  On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
> mailinglists19@gmail.com> wrote:
>
>> Hmm... there must be a different way 'cause we don't need to do that to
>> run Pig jobs.
>>
>>
>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>wrote:
>>
>>> There might be different ways but currently we are storing our jars onto
>>> HDFS and register them from there. They will be copied to the machine once
>>> the job starts. Is that an option?
>>>
>>> Daan.
>>>
>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>
>>> > Until now we were manually copying our Jars to all machines in a Hadoop
>>> > cluster.  This used to work until our cluster size was small.  Now our
>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>> that
>>> > automatically distributes the Jar to all machines in a cluster?
>>> >
>>> > I read the doc at:
>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>> >
>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>> >
>>> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
>>> > your help with our dumb questions.
>>> >
>>> > Thanks.
>>> >
>>> > PS:  We use Pig a lot, which automatically does this, so there must be
>>> a
>>> > clean way to do this.
>>>
>>>
>>
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Friso van Vollenhoven <fv...@xebia.com>.

We usually package my jobs as a single jar that contains a /lib directory in the jar that contains all other jars that the job code depends on. Hadoop understands this layout when run as 'hadoop jar'. So the jar layout would be something like:

/META-INF/manifest.mf
/com/mypackage/MyMapperClass.class
/com/mypackage/MyReducerClass.class
/lib/dependency1.jar
/lib/dependency2.jar
etc.

If you use Maven or some other build tool with dependency management, you can usually produce this jar as part of your build. We also have Maven write the main class to the manifest, such that there is no need to type it. So for us, submitting a job looks like:
hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN

Then Hadoop will take care of submitting and distributing, etc. Of course you pay the penalty of always sending all of your dependencies over the wire (the job jar gets replicated to 10 machines by default). Pre-distributing sounds tedious and error prone to me. What if you have different jobs that require different versions of the same dependency?

HTH,
Friso

On 16 nov. 2011, at 15:42, Something Something wrote:

Bejoy - Thanks for the reply.  The '-libjars' is not working for me with 'hadoop jar'.  Also, as per the documentation (http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):

Generic Options

The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>, fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>, fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>, job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job> and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>.

Does it work for you?  If it does, please let me know.  "Pre-distributing" definitely works, but is that the best way?  If you have a big cluster and Jars are changing often it will be time-consuming.

Also, how does Pig do it?  We update Pig UDFs often and put them only on the 'client' machine (machine that starts the Pig job) and the UDF becomes available to all machines in the cluster - automagically!  Is Pig doing the pre-distributing for us?

Thanks for your patience & help with our questions.

On Wed, Nov 16, 2011 at 6:29 AM, Something Something <ma...@gmail.com>> wrote:
Hmm... there must be a different way 'cause we don't need to do that to run Pig jobs.

On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>> wrote:
There might be different ways but currently we are storing our jars onto HDFS and register them from there. They will be copied to the machine once the job starts. Is that an option?

Daan.

On 16 Nov 2011, at 07:24, Something Something wrote:

> Until now we were manually copying our Jars to all machines in a Hadoop
> cluster.  This used to work until our cluster size was small.  Now our
> cluster is getting bigger.  What's the best way to start a Hadoop Job that
> automatically distributes the Jar to all machines in a cluster?
>
> I read the doc at:
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>
> Would -libjars do the trick?  But we need to use 'hadoop job' for that,
> right?  Until now, we were using 'hadoop jar' to start all our jobs.
>
> Needless to say, we are getting our feet wet with Hadoop, so appreciate
> your help with our dumb questions.
>
> Thanks.
>
> PS:  We use Pig a lot, which automatically does this, so there must be a
> clean way to do this.

Re: Distributing our jars to all machines in a cluster

Posted by Bejoy Ks <be...@gmail.com>.

Hi
      You can find the usage examples of libjars and files at the following
apache url
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Usage

*"Running wordcount example with -libjars, -files and -archives:
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
mylib.jar -archives myarchive.zip input output Here, myarchive.zip will be
placed and unzipped into a directory by the name "myarchive.zip". "*

I have implemented some map reduce projects successfully with -files option
and I never faced any issue. I have used the libjars option with SQOOP as
well, there also it worked flawlessly.

If your jars keep changing often then libjars would be the preferred
option. If it is kind of static and there are more number of dependent jars
then job tracker doesn't need to distribute jars to task tracker nodes
every time you submit a job, so in those cases predistributing the
dependent jars explicity across nodes one time would be a better approach.

I'm not extremely sure how it works for hive and pig.I believe Pig and hive
would be parsing the pig latin/ hive QL into map reduce jobs and may be the
jobs are packed and distributed across nodes. Though I'm really not sure.
 Experts, please correct me if I'm wrong.
AFAIK shipping jars/files across cluster should be done internally using
libjars/files option in pretty much of all tools that use map reduce under
the hood.

Regards
Bejoy.K.S

On Wed, Nov 16, 2011 at 8:12 PM, Something Something <
mailinglists19@gmail.com> wrote:

> Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
> 'hadoop jar'.  Also, as per the documentation (
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
>
> Generic Options
>
> The following options are supported by dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
> , fs<http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
> , fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
> , job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
>  and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
> .
>
>
>
> Does it work for you?  If it does, please let me know.  "Pre-distributing"
> definitely works, but is that the best way?  If you have a big cluster and
> Jars are changing often it will be time-consuming.
>
> Also, how does Pig do it?  We update Pig UDFs often and put them only on
> the 'client' machine (machine that starts the Pig job) and the UDF becomes
> available to all machines in the cluster - automagically!  Is Pig doing the
> pre-distributing for us?
>
> Thanks for your patience & help with our questions.
>
> On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
> mailinglists19@gmail.com> wrote:
>
>> Hmm... there must be a different way 'cause we don't need to do that to
>> run Pig jobs.
>>
>>
>> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>wrote:
>>
>>> There might be different ways but currently we are storing our jars onto
>>> HDFS and register them from there. They will be copied to the machine once
>>> the job starts. Is that an option?
>>>
>>> Daan.
>>>
>>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>>
>>> > Until now we were manually copying our Jars to all machines in a Hadoop
>>> > cluster.  This used to work until our cluster size was small.  Now our
>>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>>> that
>>> > automatically distributes the Jar to all machines in a cluster?
>>> >
>>> > I read the doc at:
>>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>>> >
>>> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
>>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>>> >
>>> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
>>> > your help with our dumb questions.
>>> >
>>> > Thanks.
>>> >
>>> > PS:  We use Pig a lot, which automatically does this, so there must be
>>> a
>>> > clean way to do this.
>>>
>>>
>>
>

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
'hadoop jar'.  Also, as per the documentation (
http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):

Generic Options

The following options are supported by
dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
, fs <http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
, fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
, job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
 and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
.

Does it work for you?  If it does, please let me know.  "Pre-distributing"
definitely works, but is that the best way?  If you have a big cluster and
Jars are changing often it will be time-consuming.

Also, how does Pig do it?  We update Pig UDFs often and put them only on
the 'client' machine (machine that starts the Pig job) and the UDF becomes
available to all machines in the cluster - automagically!  Is Pig doing the
pre-distributing for us?

Thanks for your patience & help with our questions.

On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
mailinglists19@gmail.com> wrote:

> Hmm... there must be a different way 'cause we don't need to do that to
> run Pig jobs.
>
>
> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>wrote:
>
>> There might be different ways but currently we are storing our jars onto
>> HDFS and register them from there. They will be copied to the machine once
>> the job starts. Is that an option?
>>
>> Daan.
>>
>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>
>> > Until now we were manually copying our Jars to all machines in a Hadoop
>> > cluster.  This used to work until our cluster size was small.  Now our
>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>> that
>> > automatically distributes the Jar to all machines in a cluster?
>> >
>> > I read the doc at:
>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>> >
>> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>> >
>> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
>> > your help with our dumb questions.
>> >
>> > Thanks.
>> >
>> > PS:  We use Pig a lot, which automatically does this, so there must be a
>> > clean way to do this.
>>
>>
>

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
'hadoop jar'.  Also, as per the documentation (
http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):

Generic Options

The following options are supported by
dfsadmin<http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin>
, fs <http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
, fsck<http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
, job<http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
 and fetchdt<http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
.

Does it work for you?  If it does, please let me know.  "Pre-distributing"
definitely works, but is that the best way?  If you have a big cluster and
Jars are changing often it will be time-consuming.

Also, how does Pig do it?  We update Pig UDFs often and put them only on
the 'client' machine (machine that starts the Pig job) and the UDF becomes
available to all machines in the cluster - automagically!  Is Pig doing the
pre-distributing for us?

Thanks for your patience & help with our questions.

On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
mailinglists19@gmail.com> wrote:

> Hmm... there must be a different way 'cause we don't need to do that to
> run Pig jobs.
>
>
> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com>wrote:
>
>> There might be different ways but currently we are storing our jars onto
>> HDFS and register them from there. They will be copied to the machine once
>> the job starts. Is that an option?
>>
>> Daan.
>>
>> On 16 Nov 2011, at 07:24, Something Something wrote:
>>
>> > Until now we were manually copying our Jars to all machines in a Hadoop
>> > cluster.  This used to work until our cluster size was small.  Now our
>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
>> that
>> > automatically distributes the Jar to all machines in a cluster?
>> >
>> > I read the doc at:
>> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
>> >
>> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
>> >
>> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
>> > your help with our dumb questions.
>> >
>> > Thanks.
>> >
>> > PS:  We use Pig a lot, which automatically does this, so there must be a
>> > clean way to do this.
>>
>>
>

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

Hmm... there must be a different way 'cause we don't need to do that to run
Pig jobs.

On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com> wrote:

> There might be different ways but currently we are storing our jars onto
> HDFS and register them from there. They will be copied to the machine once
> the job starts. Is that an option?
>
> Daan.
>
> On 16 Nov 2011, at 07:24, Something Something wrote:
>
> > Until now we were manually copying our Jars to all machines in a Hadoop
> > cluster.  This used to work until our cluster size was small.  Now our
> > cluster is getting bigger.  What's the best way to start a Hadoop Job
> that
> > automatically distributes the Jar to all machines in a cluster?
> >
> > I read the doc at:
> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> >
> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
> >
> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
> > your help with our dumb questions.
> >
> > Thanks.
> >
> > PS:  We use Pig a lot, which automatically does this, so there must be a
> > clean way to do this.
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Something Something <ma...@gmail.com>.

Hmm... there must be a different way 'cause we don't need to do that to run
Pig jobs.

On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <da...@gmail.com> wrote:

> There might be different ways but currently we are storing our jars onto
> HDFS and register them from there. They will be copied to the machine once
> the job starts. Is that an option?
>
> Daan.
>
> On 16 Nov 2011, at 07:24, Something Something wrote:
>
> > Until now we were manually copying our Jars to all machines in a Hadoop
> > cluster.  This used to work until our cluster size was small.  Now our
> > cluster is getting bigger.  What's the best way to start a Hadoop Job
> that
> > automatically distributes the Jar to all machines in a cluster?
> >
> > I read the doc at:
> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> >
> > Would -libjars do the trick?  But we need to use 'hadoop job' for that,
> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
> >
> > Needless to say, we are getting our feet wet with Hadoop, so appreciate
> > your help with our dumb questions.
> >
> > Thanks.
> >
> > PS:  We use Pig a lot, which automatically does this, so there must be a
> > clean way to do this.
>
>

Re: Distributing our jars to all machines in a cluster

Posted by Daan Gerits <da...@gmail.com>.

There might be different ways but currently we are storing our jars onto HDFS and register them from there. They will be copied to the machine once the job starts. Is that an option?

Daan.

On 16 Nov 2011, at 07:24, Something Something wrote:

> Until now we were manually copying our Jars to all machines in a Hadoop
> cluster.  This used to work until our cluster size was small.  Now our
> cluster is getting bigger.  What's the best way to start a Hadoop Job that
> automatically distributes the Jar to all machines in a cluster?
> 
> I read the doc at:
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> 
> Would -libjars do the trick?  But we need to use 'hadoop job' for that,
> right?  Until now, we were using 'hadoop jar' to start all our jobs.
> 
> Needless to say, we are getting our feet wet with Hadoop, so appreciate
> your help with our dumb questions.
> 
> Thanks.
> 
> PS:  We use Pig a lot, which automatically does this, so there must be a
> clean way to do this.