You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oozie.apache.org by Matt Goeke <go...@gmail.com> on 2012/11/06 21:36:51 UTC

Oozie launcher job pool differentiation

All,

I sent an email about this a while ago (deadlock in oozie due to launcher
over subscription) and we were able to avoid the situation temporarily by
staggering our coordinators in groups. We are now at a point where the
overhead of staggering the pools / the cost of maintaining that scheduling
structure is too high. I know I could avoid this situation if we had a
larger mapper pool but this is not possible at the moment with the
available hardware.

After finding a blog post that references submitting the launcher jobs to a
separate queue (
http://downright-amazed.blogspot.com/2012/02/configure-oozies-launcher-job.html)
I became curious if this could alleviate our problems even if we are using
user based pools in the fair scheduler.

Does anyone have any experience with this or know if this will work? What
is the practical differentiation of specifying a queue for Oozie when I am
being directed to a pool already?

--
Matt

Re: Oozie launcher job pool differentiation

Posted by Matt Goeke <go...@gmail.com>.

Quick follow-up: can this property be defined in a global sense (e.g.
oozie-site.xml)? I am having an issue getting this to be picked up unless I
specify it within the properties of each action which seems rather unwieldy.

--
Matt


On Wed, Nov 7, 2012 at 10:28 AM, Matt Goeke <go...@gmail.com> wrote:

> Harsh,
>
> That was a perfect explanation and a confirmation to some tests I just
> finished this morning :)
>
> Thank you again.
>
> --
> Matt
>
>
> On Wed, Nov 7, 2012 at 10:18 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> In a FairScheduler environment, especially where max-running-job
>> limits are configured, it is recommended to override the Oozie
>> launcher job's pool to be different than the actual required working
>> pool (for actions that launch other MR jobs).
>>
>> If your scheduler is configured to pick ${user.name} up automatically,
>> then your Oozie launcher config must use the super-override pool name
>> config:
>>
>> oozie.launcher.mapred.fairscheduler.pool=launcherpoolname
>>
>> Your target pool for launchers can still carry limitations, but it
>> should no longer deadlock your actual MR execution (after which the
>> launcher dies away anyway).
>>
>> Does this help, Matt?
>>
>> On Wed, Nov 7, 2012 at 2:06 AM, Matt Goeke <go...@gmail.com>
>> wrote:
>> > All,
>> >
>> > I sent an email about this a while ago (deadlock in oozie due to
>> launcher
>> > over subscription) and we were able to avoid the situation temporarily
>> by
>> > staggering our coordinators in groups. We are now at a point where the
>> > overhead of staggering the pools / the cost of maintaining that
>> scheduling
>> > structure is too high. I know I could avoid this situation if we had a
>> > larger mapper pool but this is not possible at the moment with the
>> > available hardware.
>> >
>> > After finding a blog post that references submitting the launcher jobs
>> to a
>> > separate queue (
>> >
>> http://downright-amazed.blogspot.com/2012/02/configure-oozies-launcher-job.html
>> )
>> > I became curious if this could alleviate our problems even if we are
>> using
>> > user based pools in the fair scheduler.
>> >
>> > Does anyone have any experience with this or know if this will work?
>> What
>> > is the practical differentiation of specifying a queue for Oozie when I
>> am
>> > being directed to a pool already?
>> >
>> > --
>> > Matt
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Oozie launcher job pool differentiation

Posted by Matt Goeke <go...@gmail.com>.

Harsh,

That was a perfect explanation and a confirmation to some tests I just
finished this morning :)

Thank you again.

--
Matt


On Wed, Nov 7, 2012 at 10:18 AM, Harsh J <ha...@cloudera.com> wrote:

> In a FairScheduler environment, especially where max-running-job
> limits are configured, it is recommended to override the Oozie
> launcher job's pool to be different than the actual required working
> pool (for actions that launch other MR jobs).
>
> If your scheduler is configured to pick ${user.name} up automatically,
> then your Oozie launcher config must use the super-override pool name
> config:
>
> oozie.launcher.mapred.fairscheduler.pool=launcherpoolname
>
> Your target pool for launchers can still carry limitations, but it
> should no longer deadlock your actual MR execution (after which the
> launcher dies away anyway).
>
> Does this help, Matt?
>
> On Wed, Nov 7, 2012 at 2:06 AM, Matt Goeke <go...@gmail.com>
> wrote:
> > All,
> >
> > I sent an email about this a while ago (deadlock in oozie due to launcher
> > over subscription) and we were able to avoid the situation temporarily by
> > staggering our coordinators in groups. We are now at a point where the
> > overhead of staggering the pools / the cost of maintaining that
> scheduling
> > structure is too high. I know I could avoid this situation if we had a
> > larger mapper pool but this is not possible at the moment with the
> > available hardware.
> >
> > After finding a blog post that references submitting the launcher jobs
> to a
> > separate queue (
> >
> http://downright-amazed.blogspot.com/2012/02/configure-oozies-launcher-job.html
> )
> > I became curious if this could alleviate our problems even if we are
> using
> > user based pools in the fair scheduler.
> >
> > Does anyone have any experience with this or know if this will work? What
> > is the practical differentiation of specifying a queue for Oozie when I
> am
> > being directed to a pool already?
> >
> > --
> > Matt
>
>
>
> --
> Harsh J
>

Re: Oozie launcher job pool differentiation

Posted by Matt Goeke <go...@gmail.com>.

I am sorry to keep reviving this issue but even after rolling out this fix
and confirming that launchers and actions are routed to separate pools
(verified on the fairscheduler page) we are still able to deadlock Oozie
after a set number of jobs are submitted. As soon as I kill do a 'hadoop
job -kill <workflow-id>' on a set number of the active workflow ids
everything just starts working again as if there were no issues. I am now
starting to wonder if this issue is more on the side of the fair scheduler
/ jobtracker than Oozie but overall I running out of ideas.

We are currently running about 32 pools in our fairscheduler config and the
general statistics are below:
- Our total capacity is roughly 250+ mappers and 100+ reducers
- Most pools have the default weight, 2 min mappers and 84 max mappers
- The Oozie action pool has a weight of 4, 100 min mappers, 200 max mappers
and 200 max concurrent jobs
- The Oozie launcher pool has a weight of 2, 100 min mappers, 200 max
mappers and 200 max concurrent jobs

Does anyone see any issues with this setup? Is there any reason why given
this config neither one of those pools can hit the total cap specified?

Thank you again for any suggestions and as always if you guys want any more
detailed information (logs, workflow descriptions, etc) I am more than
happy to provide them.

--
Matt

Re: Oozie launcher job pool differentiation

Posted by Harsh J <ha...@cloudera.com>.

In a FairScheduler environment, especially where max-running-job
limits are configured, it is recommended to override the Oozie
launcher job's pool to be different than the actual required working
pool (for actions that launch other MR jobs).

If your scheduler is configured to pick ${user.name} up automatically,
then your Oozie launcher config must use the super-override pool name
config:

oozie.launcher.mapred.fairscheduler.pool=launcherpoolname

Your target pool for launchers can still carry limitations, but it
should no longer deadlock your actual MR execution (after which the
launcher dies away anyway).

Does this help, Matt?

On Wed, Nov 7, 2012 at 2:06 AM, Matt Goeke <go...@gmail.com> wrote:
> All,
>
> I sent an email about this a while ago (deadlock in oozie due to launcher
> over subscription) and we were able to avoid the situation temporarily by
> staggering our coordinators in groups. We are now at a point where the
> overhead of staggering the pools / the cost of maintaining that scheduling
> structure is too high. I know I could avoid this situation if we had a
> larger mapper pool but this is not possible at the moment with the
> available hardware.
>
> After finding a blog post that references submitting the launcher jobs to a
> separate queue (
> http://downright-amazed.blogspot.com/2012/02/configure-oozies-launcher-job.html)
> I became curious if this could alleviate our problems even if we are using
> user based pools in the fair scheduler.
>
> Does anyone have any experience with this or know if this will work? What
> is the practical differentiation of specifying a queue for Oozie when I am
> being directed to a pool already?
>
> --
> Matt



-- 
Harsh J