You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Chris White <ch...@gmail.com> on 2012/11/16 01:21:29 UTC

More Oozie 'deadlocking'

Similar to Andy Kartashov's recent posts on the hadoop user mailing list
('Oozie apparent concurrency deadlocking'), i am experiencing something
similar with my Oozie instance (CDH3u1). Recently the oozie server has
become very slow to respond (2+ mins) to job listings, job info requests
etc. Periods between actions completing and the next action executing has
also increased to 10s of minutes too (but requests for instrumentation /
config / queuedumps respond immediately).

Some background about my current state: I have around 20 coordinator jobs
all running, most of which are scheduled to run at 00:30, and operate on
similar input patterns (all data for the previous day). We had a data flow
outage recently and the input data (+parent folders) that the coord
instances depend on were not present for some 10 days, causing most of the
jobs to timeout (timeout for all jobs is 1 hour).

Examining the logs i see a number of things:

For months, my logs have been littered with exceptions similar to this:

CallableQueueService$CompositeCallable:528 - USER[-] GROUP[-] TOKEN APP JOB
ACTION exception callable [coord_action_ready], E0607: Other error in
operation [java.lang.NullPointerException], {1}
org.apache.oozie.command.CommandException: E0607: Other error in operation
[java.lang.NullPointerException], {1}
  at o.a.o.command.Command.call(Command.java:273)
  at
o.a.o.s.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
  at
o.a.o.s.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
Caused by: java.lang.NullPointerException
  at o.a.o.command.Command.setLogInfo(Command.java:427)
  at
o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:107)
  at o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:27)
  at o.a.o.c.Command.call(Command.java:202)

More recently, i see endless listings of the following log entry (20/30 a
second):

WARN CallableQueueService:528 max concurrent for callable
[coord_action_input] exceeded, requeing with [500]ms delay

If i perform a queuedump (oozie admin -queuedump), i see ~150
coord_action_input entries with delays between -1 and 500, occasionally
into the 10,000's. Querying this service every 5 minutes, i appear to be
gaining a few new entries every couple of minutes (i've set up a script to
log the count every 5 mins to see what happens overnight)

I have tried to retrospectively fix up the missing paths in HDFS, hoping
that might resolve the queue, and i've also gone in and forcibly set the
status of all the coordinator jobs in the COORD_JOBS mysql table to
SUSPENDED. The server comes and goes, everything works for a while after a
restart but then it gets back into the 'non-responsive' state for anything
job related.

Short of truncating my COORD_* tables and starting from scratch, has anyone
got any insight as to what might be happening? Or what i can do to diagnose
further

Thanks

Chris

Re: More Oozie 'deadlocking'

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
Chris, Oozie2 has some issues with coordinator jobs when there is
backlog, repetitive internal commands fill up the processing queue. In
Oozie3 this has been address by adding an ID to the commands and
deduping them in the processing queue (if already in, ignore the new
queueing). Because of significant internal changes from Oozie2 to
Oozie3 is not possible to backport this fix without lot of work.

In your case, if you already have all your backlog data available, and
you have enough cluster capacity to process it, restarting the oozie
server wouldd do the trick. Worst case you could kill the coordinator
job and manually purge its entries from the DB.

If you can move to Oozie3, this issue you are seeing would go away.

Thx

Alejandro
PS: if you have questions specific to CDH releases, please use the
cdh-user@cloudera.org alias.

On Thu, Nov 15, 2012 at 6:31 PM, Chris White <ch...@gmail.com> wrote:
> Mohammed,
>
> CDH3u1, which i think relates to something based around 2.3.0 (Jul 2011). I
> tried to run CDH3u4 version of the oozie tarball, but that has some
> incompatibilities with the hadoop core (0.20.2, again CDH3u1). Are there
> any JIRA ticket numbers that you can think of so i can try and manually
> patch (assuming they are 'simple' fixes).
>
> Thanks
>
> Chris
>
>
> On Thu, Nov 15, 2012 at 8:20 PM, Mohammad Islam <mi...@yahoo.com> wrote:
>
>> Hi Chris,
>>
>> Most of them are know issues. Those are mostly taken care of in recent
>> releases.
>>
>> What is your Oozie version? I'm sure it is something released more than a
>> year back (by looking into the log output). I would recommend to use the
>> latest 3.2 Oozie.
>>
>> Also there is hadoop bug regarding the callback to oozie to report the
>> hadoop job's status. Most of the case, that is the reason of 10 minutes
>> delay. You can minimize it by reducing the property value to 5 minutes
>> {say) from 10 minutes in oozie-site.xml.
>>
>>
>> <property>
>>         <name>oozie.service.ActionCheckerService.action.check.delay</name>
>>         <value>300</value>
>>         <description>
>>             The time, in seconds, between an ActionCheck for the same
>> action.
>>         </description>
>>     </property>
>>
>> Regards,
>> Mohammad
>>
>>
>> ________________________________
>>  From: Chris White <ch...@gmail.com>
>> To: user@oozie.apache.org
>> Sent: Thursday, November 15, 2012 4:21 PM
>> Subject: More Oozie 'deadlocking'
>>
>> Similar to Andy Kartashov's recent posts on the hadoop user mailing list
>> ('Oozie apparent concurrency deadlocking'), i am experiencing something
>> similar with my Oozie instance (CDH3u1). Recently the oozie server has
>> become very slow to respond (2+ mins) to job listings, job info requests
>> etc. Periods between actions completing and the next action executing has
>> also increased to 10s of minutes too (but requests for instrumentation /
>> config / queuedumps respond immediately).
>>
>> Some background about my current state: I have around 20 coordinator jobs
>> all running, most of which are scheduled to run at 00:30, and operate on
>> similar input patterns (all data for the previous day). We had a data flow
>> outage recently and the input data (+parent folders) that the coord
>> instances depend on were not present for some 10 days, causing most of the
>> jobs to timeout (timeout for all jobs is 1 hour).
>>
>> Examining the logs i see a number of things:
>>
>> For months, my logs have been littered with exceptions similar to this:
>>
>> CallableQueueService$CompositeCallable:528 - USER[-] GROUP[-] TOKEN APP JOB
>> ACTION exception callable [coord_action_ready], E0607: Other error in
>> operation [java.lang.NullPointerException], {1}
>> org.apache.oozie.command.CommandException: E0607: Other error in operation
>> [java.lang.NullPointerException], {1}
>>   at o.a.o.command.Command.call(Command.java:273)
>>   at
>>
>> o.a.o.s.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
>>   at
>>
>> o.a.o.s.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
>> Caused by: java.lang.NullPointerException
>>   at o.a.o.command.Command.setLogInfo(Command.java:427)
>>   at
>> o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:107)
>>   at
>> o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:27)
>>   at o.a.o.c.Command.call(Command.java:202)
>>
>> More recently, i see endless listings of the following log entry (20/30 a
>> second):
>>
>> WARN CallableQueueService:528 max concurrent for callable
>> [coord_action_input] exceeded, requeing with [500]ms delay
>>
>> If i perform a queuedump (oozie admin -queuedump), i see ~150
>> coord_action_input entries with delays between -1 and 500, occasionally
>> into the 10,000's. Querying this service every 5 minutes, i appear to be
>> gaining a few new entries every couple of minutes (i've set up a script to
>> log the count every 5 mins to see what happens overnight)
>>
>> I have tried to retrospectively fix up the missing paths in HDFS, hoping
>> that might resolve the queue, and i've also gone in and forcibly set the
>> status of all the coordinator jobs in the COORD_JOBS mysql table to
>> SUSPENDED. The server comes and goes, everything works for a while after a
>> restart but then it gets back into the 'non-responsive' state for anything
>> job related.
>>
>> Short of truncating my COORD_* tables and starting from scratch, has anyone
>> got any insight as to what might be happening? Or what i can do to diagnose
>> further
>>
>> Thanks
>>
>> Chris
>>



-- 
Alejandro

Re: More Oozie 'deadlocking'

Posted by Chris White <ch...@gmail.com>.
Mohammed,

CDH3u1, which i think relates to something based around 2.3.0 (Jul 2011). I
tried to run CDH3u4 version of the oozie tarball, but that has some
incompatibilities with the hadoop core (0.20.2, again CDH3u1). Are there
any JIRA ticket numbers that you can think of so i can try and manually
patch (assuming they are 'simple' fixes).

Thanks

Chris


On Thu, Nov 15, 2012 at 8:20 PM, Mohammad Islam <mi...@yahoo.com> wrote:

> Hi Chris,
>
> Most of them are know issues. Those are mostly taken care of in recent
> releases.
>
> What is your Oozie version? I'm sure it is something released more than a
> year back (by looking into the log output). I would recommend to use the
> latest 3.2 Oozie.
>
> Also there is hadoop bug regarding the callback to oozie to report the
> hadoop job's status. Most of the case, that is the reason of 10 minutes
> delay. You can minimize it by reducing the property value to 5 minutes
> {say) from 10 minutes in oozie-site.xml.
>
>
> <property>
>         <name>oozie.service.ActionCheckerService.action.check.delay</name>
>         <value>300</value>
>         <description>
>             The time, in seconds, between an ActionCheck for the same
> action.
>         </description>
>     </property>
>
> Regards,
> Mohammad
>
>
> ________________________________
>  From: Chris White <ch...@gmail.com>
> To: user@oozie.apache.org
> Sent: Thursday, November 15, 2012 4:21 PM
> Subject: More Oozie 'deadlocking'
>
> Similar to Andy Kartashov's recent posts on the hadoop user mailing list
> ('Oozie apparent concurrency deadlocking'), i am experiencing something
> similar with my Oozie instance (CDH3u1). Recently the oozie server has
> become very slow to respond (2+ mins) to job listings, job info requests
> etc. Periods between actions completing and the next action executing has
> also increased to 10s of minutes too (but requests for instrumentation /
> config / queuedumps respond immediately).
>
> Some background about my current state: I have around 20 coordinator jobs
> all running, most of which are scheduled to run at 00:30, and operate on
> similar input patterns (all data for the previous day). We had a data flow
> outage recently and the input data (+parent folders) that the coord
> instances depend on were not present for some 10 days, causing most of the
> jobs to timeout (timeout for all jobs is 1 hour).
>
> Examining the logs i see a number of things:
>
> For months, my logs have been littered with exceptions similar to this:
>
> CallableQueueService$CompositeCallable:528 - USER[-] GROUP[-] TOKEN APP JOB
> ACTION exception callable [coord_action_ready], E0607: Other error in
> operation [java.lang.NullPointerException], {1}
> org.apache.oozie.command.CommandException: E0607: Other error in operation
> [java.lang.NullPointerException], {1}
>   at o.a.o.command.Command.call(Command.java:273)
>   at
>
> o.a.o.s.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
>   at
>
> o.a.o.s.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
> Caused by: java.lang.NullPointerException
>   at o.a.o.command.Command.setLogInfo(Command.java:427)
>   at
> o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:107)
>   at
> o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:27)
>   at o.a.o.c.Command.call(Command.java:202)
>
> More recently, i see endless listings of the following log entry (20/30 a
> second):
>
> WARN CallableQueueService:528 max concurrent for callable
> [coord_action_input] exceeded, requeing with [500]ms delay
>
> If i perform a queuedump (oozie admin -queuedump), i see ~150
> coord_action_input entries with delays between -1 and 500, occasionally
> into the 10,000's. Querying this service every 5 minutes, i appear to be
> gaining a few new entries every couple of minutes (i've set up a script to
> log the count every 5 mins to see what happens overnight)
>
> I have tried to retrospectively fix up the missing paths in HDFS, hoping
> that might resolve the queue, and i've also gone in and forcibly set the
> status of all the coordinator jobs in the COORD_JOBS mysql table to
> SUSPENDED. The server comes and goes, everything works for a while after a
> restart but then it gets back into the 'non-responsive' state for anything
> job related.
>
> Short of truncating my COORD_* tables and starting from scratch, has anyone
> got any insight as to what might be happening? Or what i can do to diagnose
> further
>
> Thanks
>
> Chris
>

Re: More Oozie 'deadlocking'

Posted by Mohammad Islam <mi...@yahoo.com>.
Hi Chris,

Most of them are know issues. Those are mostly taken care of in recent releases.

What is your Oozie version? I'm sure it is something released more than a year back (by looking into the log output). I would recommend to use the latest 3.2 Oozie.

Also there is hadoop bug regarding the callback to oozie to report the hadoop job's status. Most of the case, that is the reason of 10 minutes delay. You can minimize it by reducing the property value to 5 minutes {say) from 10 minutes in oozie-site.xml.


<property>
        <name>oozie.service.ActionCheckerService.action.check.delay</name>
        <value>300</value>
        <description>
            The time, in seconds, between an ActionCheck for the same action.
        </description>
    </property>

Regards,
Mohammad


________________________________
 From: Chris White <ch...@gmail.com>
To: user@oozie.apache.org 
Sent: Thursday, November 15, 2012 4:21 PM
Subject: More Oozie 'deadlocking'
 
Similar to Andy Kartashov's recent posts on the hadoop user mailing list
('Oozie apparent concurrency deadlocking'), i am experiencing something
similar with my Oozie instance (CDH3u1). Recently the oozie server has
become very slow to respond (2+ mins) to job listings, job info requests
etc. Periods between actions completing and the next action executing has
also increased to 10s of minutes too (but requests for instrumentation /
config / queuedumps respond immediately).

Some background about my current state: I have around 20 coordinator jobs
all running, most of which are scheduled to run at 00:30, and operate on
similar input patterns (all data for the previous day). We had a data flow
outage recently and the input data (+parent folders) that the coord
instances depend on were not present for some 10 days, causing most of the
jobs to timeout (timeout for all jobs is 1 hour).

Examining the logs i see a number of things:

For months, my logs have been littered with exceptions similar to this:

CallableQueueService$CompositeCallable:528 - USER[-] GROUP[-] TOKEN APP JOB
ACTION exception callable [coord_action_ready], E0607: Other error in
operation [java.lang.NullPointerException], {1}
org.apache.oozie.command.CommandException: E0607: Other error in operation
[java.lang.NullPointerException], {1}
  at o.a.o.command.Command.call(Command.java:273)
  at
o.a.o.s.CallableQueueService$CompositeCallable.call(CallableQueueService.java:211)
  at
o.a.o.s.CallableQueueService$CallableWrapper.run(CallableQueueService.java:128)
Caused by: java.lang.NullPointerException
  at o.a.o.command.Command.setLogInfo(Command.java:427)
  at
o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:107)
  at o.a.o.c.c.CoordActionReadyCommand.call(CoordActionReadyCommand.java:27)
  at o.a.o.c.Command.call(Command.java:202)

More recently, i see endless listings of the following log entry (20/30 a
second):

WARN CallableQueueService:528 max concurrent for callable
[coord_action_input] exceeded, requeing with [500]ms delay

If i perform a queuedump (oozie admin -queuedump), i see ~150
coord_action_input entries with delays between -1 and 500, occasionally
into the 10,000's. Querying this service every 5 minutes, i appear to be
gaining a few new entries every couple of minutes (i've set up a script to
log the count every 5 mins to see what happens overnight)

I have tried to retrospectively fix up the missing paths in HDFS, hoping
that might resolve the queue, and i've also gone in and forcibly set the
status of all the coordinator jobs in the COORD_JOBS mysql table to
SUSPENDED. The server comes and goes, everything works for a while after a
restart but then it gets back into the 'non-responsive' state for anything
job related.

Short of truncating my COORD_* tables and starting from scratch, has anyone
got any insight as to what might be happening? Or what i can do to diagnose
further

Thanks

Chris