You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Robert Kanter (JIRA)" <ji...@apache.org> on 2014/05/17 03:03:34 UTC

[jira] [Commented] (OOZIE-1849) If the underlying job finishes while a Workflow is suspended, Oozie can take a while to realize it

    [ https://issues.apache.org/jira/browse/OOZIE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000577#comment-14000577 ] 

Robert Kanter commented on OOZIE-1849:
--------------------------------------

I think we should do the second solution.  When you SUSPEND a workflow, any currently RUNNING actions are still listed as RUNNING, so I think it makes sense that they could transition to a terminal state from there.  The workflow is SUSPENDED, not the current action.

> If the underlying job finishes while a Workflow is suspended, Oozie can take a while to realize it
> --------------------------------------------------------------------------------------------------
>
>                 Key: OOZIE-1849
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1849
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 4.0.1
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>
> Suppose you have a Workflow and you suspend it while one of the actions is still RUNNING.  The underlying MR/Pig/etc job will continue running (as expected, because we can't pause those).  However, if that job finishes while the workflow is SUSPENDED, the CallbackServlet will receive the callback, but the ActionCheckXCommand won't update the action:
> {noformat}
> 2014-05-16 17:40:57,959  INFO CallbackServlet:541 - SERVER[rkanter-mbp.local] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-140516173529928-oozie-rkan-W] ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] callback for action [0000002-140516173529928-oozie-rkan-W@mr-node]
> 2014-05-16 17:40:57,985  WARN ActionCheckXCommand:544 - SERVER[rkanter-mbp.local] USER[rkanter] GROUP[-] TOKEN[] APP[map-reduce-wf] JOB[0000002-140516173529928-oozie-rkan-W] ACTION[0000002-140516173529928-oozie-rkan-W@mr-node] E0818: Action [0000002-140516173529928-oozie-rkan-W@mr-node] status is running but WF Job [0000002-140516173529928-oozie-rkan-W] status is [SUSPENDED]. Expected status is RUNNING., Error Code: E0818
> {noformat}
> If you then resume the workflow, the action will stay RUNNING for up to 10 minutes (the default fallback polling interval), at which point the ActionCheckerService will run an ActionCheckXCommand that will pass, check the job, and finally mark the action as SUCCESSFUL.
> We should fix this by one of the following:
> # ResumeXCommand should also queue a ActionCheckXCommand (if the workflow was SUSPENDED) so we don't have to wait for the ActionCheckerService
> # ActionCheckXCommand's precondition check should allow SUSPENDED workflows



--
This message was sent by Atlassian JIRA
(v6.2#6252)