You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Janos Makai (Jira)" <ji...@apache.org> on 2022/11/10 16:33:00 UTC

[jira] [Created] (OOZIE-3670) Actions can stuck while running in a Fork-Join workflow

Janos Makai created OOZIE-3670:
----------------------------------

             Summary: Actions can stuck while running in a Fork-Join workflow
                 Key: OOZIE-3670
                 URL: https://issues.apache.org/jira/browse/OOZIE-3670
             Project: Oozie
          Issue Type: Bug
          Components: core
    Affects Versions: 5.2.1
            Reporter: Janos Makai
            Assignee: Janos Makai


Fork node splits one path of execution into multiple concurrent paths of execution and the join node waits until every concurrent execution path of a previous fork node arrives to it. Given a scenario, when one of the paths [action] fails for some exotic reason - in our case (see attachment) with an EL Error - then the workflow job itself will fail as well, however the other actions running parallelly under the same workflow job will stuck in RUNNING state until they are purged, which can lead to Oozie slow-down in extreme cases.

This behaviour can be reproduced using the attached [forkjoin.xml{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531918/531918_forkjoin.xml], [job.properties{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531916/531916_job.properties],  and [helloworld.sh{^}!https://jira.cloudera.com/images/icons/link_attachment_7.gif|width=7,height=7,align=absmiddle!{^}|https://jira.cloudera.com/secure/attachment/531917/531917_helloworld.sh].
In the above workflow, [action2] will fail due to ELError because
{code:java}
<value>${variableThatWillCauseELError}</value> {code}
could not be evaluated, but at the same time [action1] tries to complete itself but remains in RUNNING state.

We have examined the situation at surface level, but we need to get a deeper understanding regarding the mechanism of fork-join workflows to proceed further.

Suspected classes are for starting point:
 - org.apache.oozie.workflow.lite.LiteWorkflowInstance
 - org.apache.oozie.command.wf.ActionCheckXCommand
 - what if we do not throw Exception in org.apache.oozie.command.wf.ActionCheckXCommand#verifyPrecondition ?




--
This message was sent by Atlassian Jira
(v8.20.10#820010)