You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Peter Bacsko (JIRA)" <ji...@apache.org> on 2016/07/11 14:31:11 UTC

[jira] [Comment Edited] (OOZIE-1978) Forkjoin validation code is ridiculously slow in some cases

    [ https://issues.apache.org/jira/browse/OOZIE-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370872#comment-15370872 ] 

Peter Bacsko edited comment on OOZIE-1978 at 7/11/16 2:30 PM:
--------------------------------------------------------------

HI

I was looking at this problem and I think I found out why it is so slow sometimes. Here is a small graph:

{code}
     A1--->A2--->A3    A7--->A8--->A9 
     |           |     |            |
S--> F1          J1-->F2            J2--->E
     |           |     |            |
     A4--->A5--->A6    A10-->A11-->A12
{code}

The problem is that ALL possible execution path is covered by the {{validateForkJoin()}} method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths. 

Eg. we have 10 ForkJoin pairs with 5 forks each, that's 500 paths it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of {{LinkedList.contains()}}. 

I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests.


was (Author: pbacsko):
HI

I was looking at this problem and I think I found out why it is so slow sometimes. Here is a small graph:

{code}
     A1--->A2--->A3    A7--->A8--->A9 
     |           |     |            |
S--> F1          J1-->F2            J2--->E
     |           |     |            |
     A4--->A5--->A6    A10-->A11-->A12
{code}

The problem is that ALL possible execution path is covered by the {{validateForkJoin()}} method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths. 

Eg. we have 10 ForkJoin pairs with 5 paths each, that's 500 path it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of {{LinkedList.contains()}}. 

I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests.

> Forkjoin validation code is ridiculously slow in some cases
> -----------------------------------------------------------
>
>                 Key: OOZIE-1978
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1978
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: trunk, 4.0.1
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>             Fix For: trunk
>
>         Attachments: workflow.xml
>
>
> We've had a few users who have run into problems where submitting a workflow appears to hang (in the case of a subworkflow, it's similar but stuck in PREP).  It turns out that if you wait long enough, it will actually go through and the workflow will run normally.  The problem is that the forkjoin validation code is taking a really long time.
> The attached example has a series of 20 forks where each fork has 6 actions (it's based on an actual workflow, but all of the names were changed and the actions were all replaced by simple shell actions).  One of our support guys said it took 1-2 hours , but on my computer it was taking {color:red}*15+ hours*{color} (I had to cancel it)
> While this example doesn't have any nested forks, those can also take a long time too.
> It's easy to verify that it's the forkjoin validation code that's taking so long by looking at a jstack of the Oozie server and seeing deep recursive calls to {{org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin}}.  I also noticed a lot of sitting around in calls LinkedList.contains.  
> I think we have 3 options:
> # See if we can make the existing code faster somehow.  Perhaps there's a way to parallelize it?  Maybe there's some redundant checking that we can identify and skip? Change some data structures? etc
> # See if we can write a new way to do this validation.  I had originally completely rewritten this code a while ago, and we've since made a few fixes to catch edge cases and things.  Perhaps it needs another rewrite?
> # Try to identify when it's taking a long time and at least let the user know what's happening or something.  Right now, it just appears that the Oozie CLI has hung and the job doesn't show up in the Oozie server.  Most users aren't going to wait more than a minute or two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)