You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by David Parks <da...@yahoo.com> on 2012/12/14 05:39:05 UTC

How to submit Tool jobs programatically in parallel?

I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.

I'd like to run an s3distcp job in parallel as well, but the interface to
that job is a Tool, e.g. ToolRunner.run(...).

ToolRunner blocks until the job completes though, so presumably I'd need to
create a thread pool to run these jobs in parallel.

But creating multiple threads to submit concurrent jobs via ToolRunner,
blocking on the jobs completion, just feels improper. Is there an
alternative?


Re: How to submit Tool jobs programatically in parallel?

Posted by George Datskos <ge...@jp.fujitsu.com>.
Dave,

DistCp needs to be blocking (it intentionally uses runJob instead of the 
asynchronous submitJob).  After the job completes it needs to "finalize" 
permissions and other attributes (see the tools.DistCp.finalize method).

If you need to run multiple distcp's in parallel, I'd go with your 
initial suggestion of using a thread pool.



George


> Can I do that with s3distcp / distcp?  The job is being configured in 
> the run() method of s3distcp (as it implements Tool).  So I think I 
> can't use this approach. I use this for the jobs I control of course, 
> but the problem is things like distcp where I don't control the 
> configuration.
>
> Dave
>
> *From:*Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?
>
> David,
>
> You try like below instead of runJob() you can try submitJob().
>
> JobClient jc = new JobClient(job);
>
> jc.submitJob(job);
>
>
> Cheers!
>
> Manoj.
>
>
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <davidparks21@yahoo.com 
> <ma...@yahoo.com>> wrote:
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd 
> need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>


Re: How to submit Tool jobs programatically in parallel?

Posted by George Datskos <ge...@jp.fujitsu.com>.
Dave,

DistCp needs to be blocking (it intentionally uses runJob instead of the 
asynchronous submitJob).  After the job completes it needs to "finalize" 
permissions and other attributes (see the tools.DistCp.finalize method).

If you need to run multiple distcp's in parallel, I'd go with your 
initial suggestion of using a thread pool.



George


> Can I do that with s3distcp / distcp?  The job is being configured in 
> the run() method of s3distcp (as it implements Tool).  So I think I 
> can't use this approach. I use this for the jobs I control of course, 
> but the problem is things like distcp where I don't control the 
> configuration.
>
> Dave
>
> *From:*Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?
>
> David,
>
> You try like below instead of runJob() you can try submitJob().
>
> JobClient jc = new JobClient(job);
>
> jc.submitJob(job);
>
>
> Cheers!
>
> Manoj.
>
>
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <davidparks21@yahoo.com 
> <ma...@yahoo.com>> wrote:
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd 
> need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>


Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
Can you show some sample code of submitting distcp job?

Cheers!
Manoj.



On Fri, Dec 14, 2012 at 11:44 AM, David Parks <da...@yahoo.com>wrote:

> Can I do that with s3distcp / distcp?  The job is being configured in the
> run() method of s3distcp (as it implements Tool).  So I think I can’t use
> this approach. I use this for the jobs I control of course, but the problem
> is things like distcp where I don’t control the configuration.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?****
>
> ** **
>
> David,****
>
> ** **
>
> You try like below instead of runJob() you can try submitJob().****
>
> ** **
>
> JobClient jc = new JobClient(job);****
>
> jc.submitJob(job);****
>
> ** **
>
> ** **
>
>
> ****
>
> Cheers!****
>
> Manoj.****
>
>
>
> ****
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?****
>
> ** **
>

Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
Can you show some sample code of submitting distcp job?

Cheers!
Manoj.



On Fri, Dec 14, 2012 at 11:44 AM, David Parks <da...@yahoo.com>wrote:

> Can I do that with s3distcp / distcp?  The job is being configured in the
> run() method of s3distcp (as it implements Tool).  So I think I can’t use
> this approach. I use this for the jobs I control of course, but the problem
> is things like distcp where I don’t control the configuration.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?****
>
> ** **
>
> David,****
>
> ** **
>
> You try like below instead of runJob() you can try submitJob().****
>
> ** **
>
> JobClient jc = new JobClient(job);****
>
> jc.submitJob(job);****
>
> ** **
>
> ** **
>
>
> ****
>
> Cheers!****
>
> Manoj.****
>
>
>
> ****
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?****
>
> ** **
>

Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
Can you show some sample code of submitting distcp job?

Cheers!
Manoj.



On Fri, Dec 14, 2012 at 11:44 AM, David Parks <da...@yahoo.com>wrote:

> Can I do that with s3distcp / distcp?  The job is being configured in the
> run() method of s3distcp (as it implements Tool).  So I think I can’t use
> this approach. I use this for the jobs I control of course, but the problem
> is things like distcp where I don’t control the configuration.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?****
>
> ** **
>
> David,****
>
> ** **
>
> You try like below instead of runJob() you can try submitJob().****
>
> ** **
>
> JobClient jc = new JobClient(job);****
>
> jc.submitJob(job);****
>
> ** **
>
> ** **
>
>
> ****
>
> Cheers!****
>
> Manoj.****
>
>
>
> ****
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?****
>
> ** **
>

Re: How to submit Tool jobs programatically in parallel?

Posted by George Datskos <ge...@jp.fujitsu.com>.
Dave,

DistCp needs to be blocking (it intentionally uses runJob instead of the 
asynchronous submitJob).  After the job completes it needs to "finalize" 
permissions and other attributes (see the tools.DistCp.finalize method).

If you need to run multiple distcp's in parallel, I'd go with your 
initial suggestion of using a thread pool.



George


> Can I do that with s3distcp / distcp?  The job is being configured in 
> the run() method of s3distcp (as it implements Tool).  So I think I 
> can't use this approach. I use this for the jobs I control of course, 
> but the problem is things like distcp where I don't control the 
> configuration.
>
> Dave
>
> *From:*Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?
>
> David,
>
> You try like below instead of runJob() you can try submitJob().
>
> JobClient jc = new JobClient(job);
>
> jc.submitJob(job);
>
>
> Cheers!
>
> Manoj.
>
>
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <davidparks21@yahoo.com 
> <ma...@yahoo.com>> wrote:
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd 
> need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>


Re: How to submit Tool jobs programatically in parallel?

Posted by George Datskos <ge...@jp.fujitsu.com>.
Dave,

DistCp needs to be blocking (it intentionally uses runJob instead of the 
asynchronous submitJob).  After the job completes it needs to "finalize" 
permissions and other attributes (see the tools.DistCp.finalize method).

If you need to run multiple distcp's in parallel, I'd go with your 
initial suggestion of using a thread pool.



George


> Can I do that with s3distcp / distcp?  The job is being configured in 
> the run() method of s3distcp (as it implements Tool).  So I think I 
> can't use this approach. I use this for the jobs I control of course, 
> but the problem is things like distcp where I don't control the 
> configuration.
>
> Dave
>
> *From:*Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?
>
> David,
>
> You try like below instead of runJob() you can try submitJob().
>
> JobClient jc = new JobClient(job);
>
> jc.submitJob(job);
>
>
> Cheers!
>
> Manoj.
>
>
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <davidparks21@yahoo.com 
> <ma...@yahoo.com>> wrote:
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd 
> need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>


Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
Can you show some sample code of submitting distcp job?

Cheers!
Manoj.



On Fri, Dec 14, 2012 at 11:44 AM, David Parks <da...@yahoo.com>wrote:

> Can I do that with s3distcp / distcp?  The job is being configured in the
> run() method of s3distcp (as it implements Tool).  So I think I can’t use
> this approach. I use this for the jobs I control of course, but the problem
> is things like distcp where I don’t control the configuration.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Manoj Babu [mailto:manoj444@gmail.com]
> *Sent:* Friday, December 14, 2012 12:57 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: How to submit Tool jobs programatically in parallel?****
>
> ** **
>
> David,****
>
> ** **
>
> You try like below instead of runJob() you can try submitJob().****
>
> ** **
>
> JobClient jc = new JobClient(job);****
>
> jc.submitJob(job);****
>
> ** **
>
> ** **
>
>
> ****
>
> Cheers!****
>
> Manoj.****
>
>
>
> ****
>
> On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
> wrote:****
>
> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?****
>
> ** **
>

RE: How to submit Tool jobs programatically in parallel?

Posted by David Parks <da...@yahoo.com>.
Can I do that with s3distcp / distcp?  The job is being configured in the
run() method of s3distcp (as it implements Tool).  So I think I can't use
this approach. I use this for the jobs I control of course, but the problem
is things like distcp where I don't control the configuration.

 

Dave

 

 

From: Manoj Babu [mailto:manoj444@gmail.com] 
Sent: Friday, December 14, 2012 12:57 PM
To: user@hadoop.apache.org
Subject: Re: How to submit Tool jobs programatically in parallel?

 

David,

 

You try like below instead of runJob() you can try submitJob().

 

JobClient jc = new JobClient(job);

jc.submitJob(job);

 

 




Cheers!

Manoj.





On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
wrote:

I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.

I'd like to run an s3distcp job in parallel as well, but the interface to
that job is a Tool, e.g. ToolRunner.run(...).

ToolRunner blocks until the job completes though, so presumably I'd need to
create a thread pool to run these jobs in parallel.

But creating multiple threads to submit concurrent jobs via ToolRunner,
blocking on the jobs completion, just feels improper. Is there an
alternative?

 


RE: How to submit Tool jobs programatically in parallel?

Posted by David Parks <da...@yahoo.com>.
Can I do that with s3distcp / distcp?  The job is being configured in the
run() method of s3distcp (as it implements Tool).  So I think I can't use
this approach. I use this for the jobs I control of course, but the problem
is things like distcp where I don't control the configuration.

 

Dave

 

 

From: Manoj Babu [mailto:manoj444@gmail.com] 
Sent: Friday, December 14, 2012 12:57 PM
To: user@hadoop.apache.org
Subject: Re: How to submit Tool jobs programatically in parallel?

 

David,

 

You try like below instead of runJob() you can try submitJob().

 

JobClient jc = new JobClient(job);

jc.submitJob(job);

 

 




Cheers!

Manoj.





On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
wrote:

I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.

I'd like to run an s3distcp job in parallel as well, but the interface to
that job is a Tool, e.g. ToolRunner.run(...).

ToolRunner blocks until the job completes though, so presumably I'd need to
create a thread pool to run these jobs in parallel.

But creating multiple threads to submit concurrent jobs via ToolRunner,
blocking on the jobs completion, just feels improper. Is there an
alternative?

 


RE: How to submit Tool jobs programatically in parallel?

Posted by David Parks <da...@yahoo.com>.
Can I do that with s3distcp / distcp?  The job is being configured in the
run() method of s3distcp (as it implements Tool).  So I think I can't use
this approach. I use this for the jobs I control of course, but the problem
is things like distcp where I don't control the configuration.

 

Dave

 

 

From: Manoj Babu [mailto:manoj444@gmail.com] 
Sent: Friday, December 14, 2012 12:57 PM
To: user@hadoop.apache.org
Subject: Re: How to submit Tool jobs programatically in parallel?

 

David,

 

You try like below instead of runJob() you can try submitJob().

 

JobClient jc = new JobClient(job);

jc.submitJob(job);

 

 




Cheers!

Manoj.





On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
wrote:

I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.

I'd like to run an s3distcp job in parallel as well, but the interface to
that job is a Tool, e.g. ToolRunner.run(...).

ToolRunner blocks until the job completes though, so presumably I'd need to
create a thread pool to run these jobs in parallel.

But creating multiple threads to submit concurrent jobs via ToolRunner,
blocking on the jobs completion, just feels improper. Is there an
alternative?

 


RE: How to submit Tool jobs programatically in parallel?

Posted by David Parks <da...@yahoo.com>.
Can I do that with s3distcp / distcp?  The job is being configured in the
run() method of s3distcp (as it implements Tool).  So I think I can't use
this approach. I use this for the jobs I control of course, but the problem
is things like distcp where I don't control the configuration.

 

Dave

 

 

From: Manoj Babu [mailto:manoj444@gmail.com] 
Sent: Friday, December 14, 2012 12:57 PM
To: user@hadoop.apache.org
Subject: Re: How to submit Tool jobs programatically in parallel?

 

David,

 

You try like below instead of runJob() you can try submitJob().

 

JobClient jc = new JobClient(job);

jc.submitJob(job);

 

 




Cheers!

Manoj.





On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>
wrote:

I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.

I'd like to run an s3distcp job in parallel as well, but the interface to
that job is a Tool, e.g. ToolRunner.run(...).

ToolRunner blocks until the job completes though, so presumably I'd need to
create a thread pool to run these jobs in parallel.

But creating multiple threads to submit concurrent jobs via ToolRunner,
blocking on the jobs completion, just feels improper. Is there an
alternative?

 


Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
David,

You try like below instead of runJob() you can try submitJob().

JobClient jc = new JobClient(job);
jc.submitJob(job);



Cheers!
Manoj.



On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>wrote:

> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>
>

Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
David,

You try like below instead of runJob() you can try submitJob().

JobClient jc = new JobClient(job);
jc.submitJob(job);



Cheers!
Manoj.



On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>wrote:

> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>
>

Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
David,

You try like below instead of runJob() you can try submitJob().

JobClient jc = new JobClient(job);
jc.submitJob(job);



Cheers!
Manoj.



On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>wrote:

> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>
>

Re: How to submit Tool jobs programatically in parallel?

Posted by Manoj Babu <ma...@gmail.com>.
David,

You try like below instead of runJob() you can try submitJob().

JobClient jc = new JobClient(job);
jc.submitJob(job);



Cheers!
Manoj.



On Fri, Dec 14, 2012 at 10:09 AM, David Parks <da...@yahoo.com>wrote:

> I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
> in parallel.
>
> I'd like to run an s3distcp job in parallel as well, but the interface to
> that job is a Tool, e.g. ToolRunner.run(...).
>
> ToolRunner blocks until the job completes though, so presumably I'd need to
> create a thread pool to run these jobs in parallel.
>
> But creating multiple threads to submit concurrent jobs via ToolRunner,
> blocking on the jobs completion, just feels improper. Is there an
> alternative?
>
>