You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Matt Goeke <go...@gmail.com> on 2012/10/10 02:00:44 UTC

Oozie jobs get stuck in RUNNING state indefinitely

All,

We have a nightly ETL process that has 80+ workflows associated with it all
staged through coordinators. As of right now I have to throttle the
start-up of groups of workflows across an 8 hour period but I would love to
just let them all run at the same time. The issue I run into is that up to
a certain number of workflows running all of the nodes transition perfectly
but as soon as I cross a threshold of X number of jobs running (I haven't
had the time to figure out the exacts yet but it is around 10-15) it is
almost as if I hit a resource deadlock within Oozie. Essentially all of the
node transitions freeze, even something as simple as a conditional block,
all of the MR jobs associated with the workflows either sit in a state of
100% or 0% (in the case of 0% a job has been staged but the map task has no
log associated with it) and new jobs can be staged but no transition will
occur. It can sit in this state indefinitely until I kill off some of the
workflows and once I get under that magic threshold everything starts back
up and transitions occur again as if nothing ever happened.

In an effort to figure out what is going on I have tried putting it into
this state and looked at many different things:
1) external resource contention
2) resource contention on the box it is staged on (including DB connections
to the MySQL instance that houses the oozie schema)
3) JMX data from the Oozie server
4) JobTracker/FairScheduler pool properties
5) log output found in /var/log/oozie/
and none of these indicate anything deadlocked or any resources being
capped.

I am to the point where next steps would be to do source diving / turn
debug on on the tomcat and try to set remote breakpoints but I would love
to see if anyone has any ideas on tweaks I can try first. I do know that
from the logs it seems as if threads are still actively checking to see if
jobs have completed (JMX stacktraces seem to indicate the same thing) so it
would almost seem as if there is some live locking mechanism that is being
hit where job callbacks not able to be processed.

We have a work around atm but obviously it is because of a virtual
limitation and not some external resource limitation so I would love to
know if this can be fixed. Logs, stack traces, oozie-site properties and
pretty much anything else can be provided if need be to help iron out what
is going on.

--
Matt

Re: Oozie jobs get stuck in RUNNING state indefinitely

Posted by Matt Goeke <go...@gmail.com>.
Alejandro and Mona,

Thank you for the quick response! One thing that I did do through the
course of our my testing was to kick off all of the jobs and then move half
of them to a separate pool manually through the fair scheduler page to see
if it was a pool resource conflict. It didn't refill up to the user cap for
the pool (allowing for anymore jobs to be kicked off in the initial pool)
after I did that but I am still happy to at least try bumping the cap to
see if I can raise that threshold.

I'll let you know the result after I change test that.

--
Matt

On Tue, Oct 9, 2012 at 7:37 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> It looks like a the cluster/queue capacity is being exceeded.
>
> Adding to Mona's answer, you could configure oozie launcher jobs to
> run in their own scheduler queue, thus not competing with regular jobs
> for slots.
>
> Thx
>
> On Tue, Oct 9, 2012 at 5:26 PM, Mona Chitnis <ch...@yahoo-inc.com>
> wrote:
> > Matt,
> >
> > When an Oozie job starts, the Launcher which is a map-only job, occupies
> a
> > map slot. Now, the jobtracker/resource-manager does a calculation of
> > maximum allowable M-R slots per user, which is a function of your cluster
> > size. We had come across this scenario once with a small 7-node cluster
> > where Launcher map-tasks themselves filled up the resource slot quota and
> > caused deadlock. I believe this issue looks similar. Can you try bumping
> > up the resource availability in the Capacity-scheduler.xml?
> > --
> > Mona Chitnis
> >
> >
> >
> >
> > On 10/9/12 5:00 PM, "Matt Goeke" <go...@gmail.com> wrote:
> >
> >>All,
> >>
> >>We have a nightly ETL process that has 80+ workflows associated with it
> >>all
> >>staged through coordinators. As of right now I have to throttle the
> >>start-up of groups of workflows across an 8 hour period but I would love
> >>to
> >>just let them all run at the same time. The issue I run into is that up
> to
> >>a certain number of workflows running all of the nodes transition
> >>perfectly
> >>but as soon as I cross a threshold of X number of jobs running (I haven't
> >>had the time to figure out the exacts yet but it is around 10-15) it is
> >>almost as if I hit a resource deadlock within Oozie. Essentially all of
> >>the
> >>node transitions freeze, even something as simple as a conditional block,
> >>all of the MR jobs associated with the workflows either sit in a state of
> >>100% or 0% (in the case of 0% a job has been staged but the map task has
> >>no
> >>log associated with it) and new jobs can be staged but no transition will
> >>occur. It can sit in this state indefinitely until I kill off some of the
> >>workflows and once I get under that magic threshold everything starts
> back
> >>up and transitions occur again as if nothing ever happened.
> >>
> >>In an effort to figure out what is going on I have tried putting it into
> >>this state and looked at many different things:
> >>1) external resource contention
> >>2) resource contention on the box it is staged on (including DB
> >>connections
> >>to the MySQL instance that houses the oozie schema)
> >>3) JMX data from the Oozie server
> >>4) JobTracker/FairScheduler pool properties
> >>5) log output found in /var/log/oozie/
> >>and none of these indicate anything deadlocked or any resources being
> >>capped.
> >>
> >>I am to the point where next steps would be to do source diving / turn
> >>debug on on the tomcat and try to set remote breakpoints but I would love
> >>to see if anyone has any ideas on tweaks I can try first. I do know that
> >>from the logs it seems as if threads are still actively checking to see
> if
> >>jobs have completed (JMX stacktraces seem to indicate the same thing) so
> >>it
> >>would almost seem as if there is some live locking mechanism that is
> being
> >>hit where job callbacks not able to be processed.
> >>
> >>We have a work around atm but obviously it is because of a virtual
> >>limitation and not some external resource limitation so I would love to
> >>know if this can be fixed. Logs, stack traces, oozie-site properties and
> >>pretty much anything else can be provided if need be to help iron out
> what
> >>is going on.
> >>
> >>--
> >>Matt
> >
>
>
>
> --
> Alejandro
>

Re: Oozie jobs get stuck in RUNNING state indefinitely

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
It looks like a the cluster/queue capacity is being exceeded.

Adding to Mona's answer, you could configure oozie launcher jobs to
run in their own scheduler queue, thus not competing with regular jobs
for slots.

Thx

On Tue, Oct 9, 2012 at 5:26 PM, Mona Chitnis <ch...@yahoo-inc.com> wrote:
> Matt,
>
> When an Oozie job starts, the Launcher which is a map-only job, occupies a
> map slot. Now, the jobtracker/resource-manager does a calculation of
> maximum allowable M-R slots per user, which is a function of your cluster
> size. We had come across this scenario once with a small 7-node cluster
> where Launcher map-tasks themselves filled up the resource slot quota and
> caused deadlock. I believe this issue looks similar. Can you try bumping
> up the resource availability in the Capacity-scheduler.xml?
> --
> Mona Chitnis
>
>
>
>
> On 10/9/12 5:00 PM, "Matt Goeke" <go...@gmail.com> wrote:
>
>>All,
>>
>>We have a nightly ETL process that has 80+ workflows associated with it
>>all
>>staged through coordinators. As of right now I have to throttle the
>>start-up of groups of workflows across an 8 hour period but I would love
>>to
>>just let them all run at the same time. The issue I run into is that up to
>>a certain number of workflows running all of the nodes transition
>>perfectly
>>but as soon as I cross a threshold of X number of jobs running (I haven't
>>had the time to figure out the exacts yet but it is around 10-15) it is
>>almost as if I hit a resource deadlock within Oozie. Essentially all of
>>the
>>node transitions freeze, even something as simple as a conditional block,
>>all of the MR jobs associated with the workflows either sit in a state of
>>100% or 0% (in the case of 0% a job has been staged but the map task has
>>no
>>log associated with it) and new jobs can be staged but no transition will
>>occur. It can sit in this state indefinitely until I kill off some of the
>>workflows and once I get under that magic threshold everything starts back
>>up and transitions occur again as if nothing ever happened.
>>
>>In an effort to figure out what is going on I have tried putting it into
>>this state and looked at many different things:
>>1) external resource contention
>>2) resource contention on the box it is staged on (including DB
>>connections
>>to the MySQL instance that houses the oozie schema)
>>3) JMX data from the Oozie server
>>4) JobTracker/FairScheduler pool properties
>>5) log output found in /var/log/oozie/
>>and none of these indicate anything deadlocked or any resources being
>>capped.
>>
>>I am to the point where next steps would be to do source diving / turn
>>debug on on the tomcat and try to set remote breakpoints but I would love
>>to see if anyone has any ideas on tweaks I can try first. I do know that
>>from the logs it seems as if threads are still actively checking to see if
>>jobs have completed (JMX stacktraces seem to indicate the same thing) so
>>it
>>would almost seem as if there is some live locking mechanism that is being
>>hit where job callbacks not able to be processed.
>>
>>We have a work around atm but obviously it is because of a virtual
>>limitation and not some external resource limitation so I would love to
>>know if this can be fixed. Logs, stack traces, oozie-site properties and
>>pretty much anything else can be provided if need be to help iron out what
>>is going on.
>>
>>--
>>Matt
>



-- 
Alejandro

Re: Oozie jobs get stuck in RUNNING state indefinitely

Posted by Mona Chitnis <ch...@yahoo-inc.com>.
Matt,

When an Oozie job starts, the Launcher which is a map-only job, occupies a
map slot. Now, the jobtracker/resource-manager does a calculation of
maximum allowable M-R slots per user, which is a function of your cluster
size. We had come across this scenario once with a small 7-node cluster
where Launcher map-tasks themselves filled up the resource slot quota and
caused deadlock. I believe this issue looks similar. Can you try bumping
up the resource availability in the Capacity-scheduler.xml?
--
Mona Chitnis




On 10/9/12 5:00 PM, "Matt Goeke" <go...@gmail.com> wrote:

>All,
>
>We have a nightly ETL process that has 80+ workflows associated with it
>all
>staged through coordinators. As of right now I have to throttle the
>start-up of groups of workflows across an 8 hour period but I would love
>to
>just let them all run at the same time. The issue I run into is that up to
>a certain number of workflows running all of the nodes transition
>perfectly
>but as soon as I cross a threshold of X number of jobs running (I haven't
>had the time to figure out the exacts yet but it is around 10-15) it is
>almost as if I hit a resource deadlock within Oozie. Essentially all of
>the
>node transitions freeze, even something as simple as a conditional block,
>all of the MR jobs associated with the workflows either sit in a state of
>100% or 0% (in the case of 0% a job has been staged but the map task has
>no
>log associated with it) and new jobs can be staged but no transition will
>occur. It can sit in this state indefinitely until I kill off some of the
>workflows and once I get under that magic threshold everything starts back
>up and transitions occur again as if nothing ever happened.
>
>In an effort to figure out what is going on I have tried putting it into
>this state and looked at many different things:
>1) external resource contention
>2) resource contention on the box it is staged on (including DB
>connections
>to the MySQL instance that houses the oozie schema)
>3) JMX data from the Oozie server
>4) JobTracker/FairScheduler pool properties
>5) log output found in /var/log/oozie/
>and none of these indicate anything deadlocked or any resources being
>capped.
>
>I am to the point where next steps would be to do source diving / turn
>debug on on the tomcat and try to set remote breakpoints but I would love
>to see if anyone has any ideas on tweaks I can try first. I do know that
>from the logs it seems as if threads are still actively checking to see if
>jobs have completed (JMX stacktraces seem to indicate the same thing) so
>it
>would almost seem as if there is some live locking mechanism that is being
>hit where job callbacks not able to be processed.
>
>We have a work around atm but obviously it is because of a virtual
>limitation and not some external resource limitation so I would love to
>know if this can be fixed. Logs, stack traces, oozie-site properties and
>pretty much anything else can be provided if need be to help iron out what
>is going on.
>
>--
>Matt