You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Robert Kanter (JIRA)" <ji...@apache.org> on 2014/06/12 04:44:02 UTC

[jira] [Updated] (OOZIE-1879) Workflow Rerun causes error depending on the order of forked nodes

     [ https://issues.apache.org/jira/browse/OOZIE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Kanter updated OOZIE-1879:
---------------------------------

    Attachment: OOZIE-1879.patch

The patch enforces the proper ordering of the calls to LiteWorkflowInstance#signal during reruns by sorting the list that is looped through to call #signal recursively.  It does this with a Comparator that looks up the endTime of the nodes and sorts them based on that.  This does mean that I had to update some code in other places to get the endTimes for each action into LiteWorkflowInstance, and also had to persist this information with the LiteWorkflowInstance object during serialization.

I verified that it works using a workflow as described in this JIRA (I also turned that into a unit test in the patch); I also checked it against a more complicated workflow with multiple fork levels and more actions per forks.

> Workflow Rerun causes error depending on the order of forked nodes
> ------------------------------------------------------------------
>
>                 Key: OOZIE-1879
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1879
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: trunk
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>            Priority: Blocker
>         Attachments: OOZIE-1879.patch
>
>
> Suppose you have a workflow like this:
> {noformat}
> start --> fork
> fork --> shell1, shell2
> shell1 --> join
> shell2 --> join
> join --> shell3
> shell3 --> end
> {noformat}
> And all but shell3 are successful.  
> Assuming you fix the problem with shell3, if you do a rerun, the following two outcomes can happen:
> # If shell1 finished before shell2, then the rerun succeeds
> # If shell2 finished before shell1, then the rerun fails
> The error in the second outcome is simply this log message:
> {noformat}
> 2014-05-29 17:17:03,735 ERROR org.apache.oozie.workflow.lite.LiteWorkflowInstance: SERVER[cdh5-1.cloudera.local] USER[pdvorak] GROUP[-] TOKEN[] APP[test-rerun-wf] JOB[0000004-140521220856264-oozie-oozi-W] ACTION[0000004-140521220856264-oozie-oozi-W@join] invalid execution path [/shell1/]
> {noformat}
> After a bunch of digging, I discovered that during a rerun with the above workflow or similar workflows, LiteWorkflowInstance#signal gets called for each action in the fork node in the order that they are listed in the fork node's XML; however, during the original run, LiteWorkflowInstance#signal gets called for each action in the order that they complete (i.e. endTime).  When these don't match, you get the above error.  The general fix for this is therefore to ensure that during a rerun, LiteWorkflowInstance#signal gets called for each action in the fork node in the order that they originally ran in.  And if you think about it, that is more correct than the current behavior anyway.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)