You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Purshotam Shah <pu...@yahoo-inc.com.INVALID> on 2015/09/01 02:23:53 UTC

Re: Killed coordinator keeping bundle from ending

This error message will cause any issue with bundle change command.If you look at the code, error message is thrown after calling DB update.
Can you check the bundle end time after change command and see if that is changed? There could be other reasons which might be keeping bundle running.

      From: Oren Mazor <or...@shopify.com>
 To: "user@oozie.apache.org" <us...@oozie.apache.org> 
 Sent: Monday, August 31, 2015 12:36 PM
 Subject: Killed coordinator keeping bundle from ending
   
Hey all,

I have a bundle where some jobs enter the killed state.

I attempt to update the endtime of the bundle, and this fails with:

< oozie-error-message: E1320: Bundle Job change error, [[
0001505-150826221231844-oozie-oozi-C : Coord is in killed state ][
0001496-150826221231844-oozie-oozi-C : Coord is in killed state ][
0001506-150826221231844-oozie-oozi-C : Coord is in killed state ][
0001514-150826221231844-oozie-oozi-C : Coord is in killed state ]]

The result is that all of the coordinators within it have terminated, but
the killed ones are keeping the bundle in the 'running' state because this
bit of code is only checking for the presence of results:
https://github.com/apache/oozie/blob/48b64bc9438137517e24b37b674c5a8893db67c3/core/src/main/java/org/apache/oozie/command/bundle/BundleJobChangeXCommand.java#L203

This is on oozie 4.1.0. Is the only solution to manually kill the bundle?

thanks!
Oren


   

Re: Killed coordinator keeping bundle from ending

Posted by Oren Mazor <or...@shopify.com>.
Hey Purshotam,

I did check the bundle end time in the pg database and it is definitely
updated.

I'm not seeing anything in the logs that would result in the bundle not
going into the done state (or even DONEWITHERROR, which is totally
acceptable).

I've started digging through the database and I noticed a few weird things:

1. in the coord_jobs table, the coordinators for that bundle are all either
SUCCEEDED or DONEWITHERROR
2. but in the bundle_actions table, there are 5 bundle_action coordinators
that are still marked as running.

I didn't realize oozie had this sort of sync situation. So I picked one of
these supposedly running coordinators and started looking through the logs
for that job, which I haven't before because it was marked as succeeded.

I found this:

"E1308: Bundle Action Status  [RUNNING] is not matching with coordinator
previous status [SUCCEEDED]., Error Code: E1308"

Digging further up, I noticed this:

"org.apache.oozie.command.CommandException: E1022: Cannot delete
running/completed coordinator action:
[0001507-150826221231844-oozie-oozi-C@2]"

This error came about from an earlier step where I attempted to set an
endtime on the bundle that was in the past, but is there a possibility that
this somehow caused these two tables to fall out of sync?

On Mon, Aug 31, 2015 at 8:24 PM Purshotam Shah
<pu...@yahoo-inc.com.invalid> wrote:

> This error message will cause any issue with bundle change command.If you
> look at the code, error message is thrown after calling DB update.
> Can you check the bundle end time after change command and see if that is
> changed? There could be other reasons which might be keeping bundle running.
>
>       From: Oren Mazor <or...@shopify.com>
>  To: "user@oozie.apache.org" <us...@oozie.apache.org>
>  Sent: Monday, August 31, 2015 12:36 PM
>  Subject: Killed coordinator keeping bundle from ending
>
> Hey all,
>
> I have a bundle where some jobs enter the killed state.
>
> I attempt to update the endtime of the bundle, and this fails with:
>
> < oozie-error-message: E1320: Bundle Job change error, [[
> 0001505-150826221231844-oozie-oozi-C : Coord is in killed state ][
> 0001496-150826221231844-oozie-oozi-C : Coord is in killed state ][
> 0001506-150826221231844-oozie-oozi-C : Coord is in killed state ][
> 0001514-150826221231844-oozie-oozi-C : Coord is in killed state ]]
>
> The result is that all of the coordinators within it have terminated, but
> the killed ones are keeping the bundle in the 'running' state because this
> bit of code is only checking for the presence of results:
>
> https://github.com/apache/oozie/blob/48b64bc9438137517e24b37b674c5a8893db67c3/core/src/main/java/org/apache/oozie/command/bundle/BundleJobChangeXCommand.java#L203
>
> This is on oozie 4.1.0. Is the only solution to manually kill the bundle?
>
> thanks!
> Oren
>
>
>