You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by web service <wb...@gmail.com> on 2010/11/12 03:17:30 UTC

running hadoop jobs from within a program

Hi,
  Currently I run my sample hadoop job from a bash script using the
following command ...

[code]
tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-$i/
/user/vadmin/output/output-$i/
$tmp
[/code]

However, I would want to write a timer that would do some cleanup after the
jobs are  complete and restart the jobs after x hours. What I am looking for
is
the ability to invoke job from within a program and not the jar command
thing.

-Mac

Re: running hadoop jobs from within a program

Posted by web service <wb...@gmail.com>.

Thanks, had figured it out. It is fun to figure out how things work :)

On Sun, Nov 14, 2010 at 4:22 AM, Harsh J <qw...@gmail.com> wrote:

> Hello,
>
> On Fri, Nov 12, 2010 at 10:25 PM, web service <wb...@gmail.com> wrote:
> > Thanks, but submitting three different jobs say using
> >
> > JobClient.submitjob(jobconf1);
> > JobClient.submitjob(jobconf2);
> > JobClient.submitjob(jobconf3)
> >
> > different from running -
> > tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-1/
> > /user/vadmin/output/output-1/
> > tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-2/
> > /user/vadmin/output/output-2/
> > tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-3/
> > /user/vadmin/output/output-3/
>
> It isn't different. In both cases a new JobID is assigned for each job
> created and its specific configuration is associated to it upon
> submission.
>
> >
> > I guess every job can have specific jvm options. and I hope that every
> > submitted job runs in a separate jvm, No ?
>
> Yes, each Task (Map or Reduce, under the Job) runs in a separate JVM
> (although JVMs can be reused using a tweak).
>
> --
> Harsh J
> www.harshj.com
>

Re: running hadoop jobs from within a program

Posted by Harsh J <qw...@gmail.com>.

Hello,

On Fri, Nov 12, 2010 at 10:25 PM, web service <wb...@gmail.com> wrote:
> Thanks, but submitting three different jobs say using
>
> JobClient.submitjob(jobconf1);
> JobClient.submitjob(jobconf2);
> JobClient.submitjob(jobconf3)
>
> different from running -
> tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-1/
> /user/vadmin/output/output-1/
> tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-2/
> /user/vadmin/output/output-2/
> tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-3/
> /user/vadmin/output/output-3/

It isn't different. In both cases a new JobID is assigned for each job
created and its specific configuration is associated to it upon
submission.

>
> I guess every job can have specific jvm options. and I hope that every
> submitted job runs in a separate jvm, No ?

Yes, each Task (Map or Reduce, under the Job) runs in a separate JVM
(although JVMs can be reused using a tweak).

-- 
Harsh J
www.harshj.com

Re: running hadoop jobs from within a program

Posted by web service <wb...@gmail.com>.

Thanks, but submitting three different jobs say using

JobClient.submitjob(jobconf1);
JobClient.submitjob(jobconf2);
JobClient.submitjob(jobconf3)

different from running -
tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-1/
/user/vadmin/output/output-1/
tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-2/
/user/vadmin/output/output-2/
tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-3/
/user/vadmin/output/output-3/

I guess every job can have specific jvm options. and I hope that every
submitted job runs in a separate jvm, No ?

On Fri, Nov 12, 2010 at 12:55 AM, daniel sikar <ds...@gmail.com> wrote:

> I suggest you write a loop in your bash script, grepping for finished,
> then take it from there.
> Also, you can submit the same job as many times as you like.
>
> On 12 November 2010 02:17, web service <wb...@gmail.com> wrote:
> > Hi,
> >  Currently I run my sample hadoop job from a bash script using the
> > following command ...
> >
> > [code]
> > tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-$i/
> > /user/vadmin/output/output-$i/
> > $tmp
> > [/code]
> >
> > However, I would want to write a timer that would do some cleanup after
> the
> > jobs are  complete and restart the jobs after x hours. What I am looking
> for
> > is
> > the ability to invoke job from within a program and not the jar command
> > thing.
> >
> > -Mac
> >
>

Re: running hadoop jobs from within a program

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

Mac,

You should a look at Oozie, it will allow you to do what you describe.

You can either build Oozie from https://github.com/yahoo/oozie or
download CDH3b3 distribution from http://www.cloudera.com/downloads/
(Oozie is preconfigured to work with CHD3b3 Hadoop).

Hope this helps.

Alejandro

On Fri, Nov 12, 2010 at 12:55 AM, daniel sikar <ds...@gmail.com> wrote:
> I suggest you write a loop in your bash script, grepping for finished,
> then take it from there.
> Also, you can submit the same job as many times as you like.
>
> On 12 November 2010 02:17, web service <wb...@gmail.com> wrote:
>> Hi,
>>  Currently I run my sample hadoop job from a bash script using the
>> following command ...
>>
>> [code]
>> tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-$i/
>> /user/vadmin/output/output-$i/
>> $tmp
>> [/code]
>>
>> However, I would want to write a timer that would do some cleanup after the
>> jobs are  complete and restart the jobs after x hours. What I am looking for
>> is
>> the ability to invoke job from within a program and not the jar command
>> thing.
>>
>> -Mac
>>
>

Re: running hadoop jobs from within a program

Posted by daniel sikar <ds...@gmail.com>.

I suggest you write a loop in your bash script, grepping for finished,
then take it from there.
Also, you can submit the same job as many times as you like.

On 12 November 2010 02:17, web service <wb...@gmail.com> wrote:
> Hi,
>  Currently I run my sample hadoop job from a bash script using the
> following command ...
>
> [code]
> tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-$i/
> /user/vadmin/output/output-$i/
> $tmp
> [/code]
>
> However, I would want to write a timer that would do some cleanup after the
> jobs are  complete and restart the jobs after x hours. What I am looking for
> is
> the ability to invoke job from within a program and not the jar command
> thing.
>
> -Mac
>

Re: running hadoop jobs from within a program

Posted by web service <wb...@gmail.com>.

would submitting, say for example 3 jobs from a jobclient be different than
invoking the below command 3 times ?

On Thu, Nov 11, 2010 at 7:17 PM, web service <wb...@gmail.com> wrote:

> Hi,
>   Currently I run my sample hadoop job from a bash script using the
> following command ...
>
> [code]
> tmp="$HADOOP_BIN jar $JAR_LOC  $MAIN_CLASS /user/joe/input/input-$i/
> /user/vadmin/output/output-$i/
> $tmp
> [/code]
>
> However, I would want to write a timer that would do some cleanup after the
> jobs are  complete and restart the jobs after x hours. What I am looking for
> is
> the ability to invoke job from within a program and not the jar command
> thing.
>
> -Mac
>