You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by "Heller, Chris" <ch...@akamai.com> on 2014/02/20 16:05:13 UTC

Dynamic parallel tasks in Oozie

Hi,

I’m trying to figure out the best way to implement a workflow in Oozie.

I am creating a workflow which splits an input into multiple outputs.

Then for each output I want to run another process over each.

The trouble is I cannot know a-priori how many outputs I will have, and so to post process each I don’t see how to setup a workflow to run the next stage.

Ideally the next stage would be a fork/join type of scenario, since each output can be processed independently. But there isn’t any way I can see to setup the fork paths without using some sort of XML generation preprocessor.

Does anyone have a suggestion of how to proceed? Am I stuck doing workflow generation? Or is there another way to structure this workflow using the existing primitives?

Thanks,
Chris

Re: Dynamic parallel tasks in Oozie

Posted by "Heller, Chris" <ch...@akamai.com>.
Yeah this is exactly the type of functionality that I was looking for. I
would certainly make use of such a feature.

I’m curious, how did your implementation go about defining the inputs to
the shards?

-Chris

On 2/20/14, 5:24 PM, "Alejandro Abdelnur" <tu...@gmail.com> wrote:

>This is what I refer as sharding, it can be seen as a special type of
>fork/join where all shards are doing the same actions on different
>datasets
>and the number of shards depends on the number of datasets.
>
>A while ago I've rewritten the workflow lib, cleanning it up a bit and
>adding this capability. But never got completed. If there is interest we
>could create an umbrella JIRA and complete the integration.
>
>Thanks.
>
>
>On Thu, Feb 20, 2014 at 1:47 PM, Mona Chitnis <ch...@yahoo-inc.com>
>wrote:
>
>> If you use the sub-workflow construct, then it would do some error
>> reporting for you. If a sub-workflow fails, the parent workflow also
>>gets
>> updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent"
>> property of a subworkflow should be the ID of the parent workflow, helps
>> get the dependency graph using IDs.
>>
>>
>> On 2/20/14, 12:52 PM, "Heller, Chris" <ch...@akamai.com> wrote:
>>
>> >Mona,
>> >
>> >Thanks. That is the road I'm headed down. At the moment.
>> >
>> >I'll create a Java action which takes the files (or a path glob -- or
>> >something) as input, and create multiple Oozie tasks based on that
>>input,
>> >and then 'wait' for those tasks to complete.
>> >
>> >A feature like this built into the workflow certainly would be nice,
>>since
>> >it would better integrate error handling I think.
>> >
>> >-Chris
>> >
>> >On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
>> >
>> >>Hi Chris,
>> >>
>> >>There isn¹t a way of dynamic parallel tasks within the same Oozie
>> >>workflow
>> >>XML currently. But you can do some programmatically. Using Oozie Java
>> >>API,
>> >>you can start a dynamic number of sub-workflows based on the number of
>> >>outputs.
>> >>
>> >>
>> >>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
>> >>
>> >>>Hi,
>> >>>
>> >>>I¹m trying to figure out the best way to implement a workflow in
>>Oozie.
>> >>>
>> >>>I am creating a workflow which splits an input into multiple outputs.
>> >>>
>> >>>Then for each output I want to run another process over each.
>> >>>
>> >>>The trouble is I cannot know a-priori how many outputs I will have,
>>and
>> >>>so to post process each I don¹t see how to setup a workflow to run
>>the
>> >>>next stage.
>> >>>
>> >>>Ideally the next stage would be a fork/join type of scenario, since
>>each
>> >>>output can be processed independently. But there isn¹t any way I can
>>see
>> >>>to setup the fork paths without using some sort of XML generation
>> >>>preprocessor.
>> >>>
>> >>>Does anyone have a suggestion of how to proceed? Am I stuck doing
>> >>>workflow generation? Or is there another way to structure this
>>workflow
>> >>>using the existing primitives?
>> >>>
>> >>>Thanks,
>> >>>Chris
>> >>
>>
>>

Re: Dynamic parallel tasks in Oozie

Posted by Virag Kothari <vi...@yahoo-inc.com>.
Yes, I think that should be added. Its been waiting on RB for a long time
:)

Thanks,
virag

On 2/20/14 2:24 PM, "Alejandro Abdelnur" <tu...@gmail.com> wrote:

>This is what I refer as sharding, it can be seen as a special type of
>fork/join where all shards are doing the same actions on different
>datasets
>and the number of shards depends on the number of datasets.
>
>A while ago I've rewritten the workflow lib, cleanning it up a bit and
>adding this capability. But never got completed. If there is interest we
>could create an umbrella JIRA and complete the integration.
>
>Thanks.
>
>
>On Thu, Feb 20, 2014 at 1:47 PM, Mona Chitnis <ch...@yahoo-inc.com>
>wrote:
>
>> If you use the sub-workflow construct, then it would do some error
>> reporting for you. If a sub-workflow fails, the parent workflow also
>>gets
>> updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent"
>> property of a subworkflow should be the ID of the parent workflow, helps
>> get the dependency graph using IDs.
>>
>>
>> On 2/20/14, 12:52 PM, "Heller, Chris" <ch...@akamai.com> wrote:
>>
>> >Mona,
>> >
>> >Thanks. That is the road I'm headed down. At the moment.
>> >
>> >I'll create a Java action which takes the files (or a path glob -- or
>> >something) as input, and create multiple Oozie tasks based on that
>>input,
>> >and then 'wait' for those tasks to complete.
>> >
>> >A feature like this built into the workflow certainly would be nice,
>>since
>> >it would better integrate error handling I think.
>> >
>> >-Chris
>> >
>> >On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
>> >
>> >>Hi Chris,
>> >>
>> >>There isn¹t a way of dynamic parallel tasks within the same Oozie
>> >>workflow
>> >>XML currently. But you can do some programmatically. Using Oozie Java
>> >>API,
>> >>you can start a dynamic number of sub-workflows based on the number of
>> >>outputs.
>> >>
>> >>
>> >>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
>> >>
>> >>>Hi,
>> >>>
>> >>>I¹m trying to figure out the best way to implement a workflow in
>>Oozie.
>> >>>
>> >>>I am creating a workflow which splits an input into multiple outputs.
>> >>>
>> >>>Then for each output I want to run another process over each.
>> >>>
>> >>>The trouble is I cannot know a-priori how many outputs I will have,
>>and
>> >>>so to post process each I don¹t see how to setup a workflow to run
>>the
>> >>>next stage.
>> >>>
>> >>>Ideally the next stage would be a fork/join type of scenario, since
>>each
>> >>>output can be processed independently. But there isn¹t any way I can
>>see
>> >>>to setup the fork paths without using some sort of XML generation
>> >>>preprocessor.
>> >>>
>> >>>Does anyone have a suggestion of how to proceed? Am I stuck doing
>> >>>workflow generation? Or is there another way to structure this
>>workflow
>> >>>using the existing primitives?
>> >>>
>> >>>Thanks,
>> >>>Chris
>> >>
>>
>>


Re: Dynamic parallel tasks in Oozie

Posted by Virag Kothari <vi...@yahoo-inc.com>.
Yes, I think that should be added. Its been waiting on RB for a long time
:)

Thanks,
virag

On 2/20/14 2:24 PM, "Alejandro Abdelnur" <tu...@gmail.com> wrote:

>This is what I refer as sharding, it can be seen as a special type of
>fork/join where all shards are doing the same actions on different
>datasets
>and the number of shards depends on the number of datasets.
>
>A while ago I've rewritten the workflow lib, cleanning it up a bit and
>adding this capability. But never got completed. If there is interest we
>could create an umbrella JIRA and complete the integration.
>
>Thanks.
>
>
>On Thu, Feb 20, 2014 at 1:47 PM, Mona Chitnis <ch...@yahoo-inc.com>
>wrote:
>
>> If you use the sub-workflow construct, then it would do some error
>> reporting for you. If a sub-workflow fails, the parent workflow also
>>gets
>> updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent"
>> property of a subworkflow should be the ID of the parent workflow, helps
>> get the dependency graph using IDs.
>>
>>
>> On 2/20/14, 12:52 PM, "Heller, Chris" <ch...@akamai.com> wrote:
>>
>> >Mona,
>> >
>> >Thanks. That is the road I'm headed down. At the moment.
>> >
>> >I'll create a Java action which takes the files (or a path glob -- or
>> >something) as input, and create multiple Oozie tasks based on that
>>input,
>> >and then 'wait' for those tasks to complete.
>> >
>> >A feature like this built into the workflow certainly would be nice,
>>since
>> >it would better integrate error handling I think.
>> >
>> >-Chris
>> >
>> >On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
>> >
>> >>Hi Chris,
>> >>
>> >>There isn¹t a way of dynamic parallel tasks within the same Oozie
>> >>workflow
>> >>XML currently. But you can do some programmatically. Using Oozie Java
>> >>API,
>> >>you can start a dynamic number of sub-workflows based on the number of
>> >>outputs.
>> >>
>> >>
>> >>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
>> >>
>> >>>Hi,
>> >>>
>> >>>I¹m trying to figure out the best way to implement a workflow in
>>Oozie.
>> >>>
>> >>>I am creating a workflow which splits an input into multiple outputs.
>> >>>
>> >>>Then for each output I want to run another process over each.
>> >>>
>> >>>The trouble is I cannot know a-priori how many outputs I will have,
>>and
>> >>>so to post process each I don¹t see how to setup a workflow to run
>>the
>> >>>next stage.
>> >>>
>> >>>Ideally the next stage would be a fork/join type of scenario, since
>>each
>> >>>output can be processed independently. But there isn¹t any way I can
>>see
>> >>>to setup the fork paths without using some sort of XML generation
>> >>>preprocessor.
>> >>>
>> >>>Does anyone have a suggestion of how to proceed? Am I stuck doing
>> >>>workflow generation? Or is there another way to structure this
>>workflow
>> >>>using the existing primitives?
>> >>>
>> >>>Thanks,
>> >>>Chris
>> >>
>>
>>


Re: Dynamic parallel tasks in Oozie

Posted by Alejandro Abdelnur <tu...@gmail.com>.
This is what I refer as sharding, it can be seen as a special type of
fork/join where all shards are doing the same actions on different datasets
and the number of shards depends on the number of datasets.

A while ago I've rewritten the workflow lib, cleanning it up a bit and
adding this capability. But never got completed. If there is interest we
could create an umbrella JIRA and complete the integration.

Thanks.


On Thu, Feb 20, 2014 at 1:47 PM, Mona Chitnis <ch...@yahoo-inc.com> wrote:

> If you use the sub-workflow construct, then it would do some error
> reporting for you. If a sub-workflow fails, the parent workflow also gets
> updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent"
> property of a subworkflow should be the ID of the parent workflow, helps
> get the dependency graph using IDs.
>
>
> On 2/20/14, 12:52 PM, "Heller, Chris" <ch...@akamai.com> wrote:
>
> >Mona,
> >
> >Thanks. That is the road I'm headed down. At the moment.
> >
> >I'll create a Java action which takes the files (or a path glob -- or
> >something) as input, and create multiple Oozie tasks based on that input,
> >and then 'wait' for those tasks to complete.
> >
> >A feature like this built into the workflow certainly would be nice, since
> >it would better integrate error handling I think.
> >
> >-Chris
> >
> >On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
> >
> >>Hi Chris,
> >>
> >>There isn¹t a way of dynamic parallel tasks within the same Oozie
> >>workflow
> >>XML currently. But you can do some programmatically. Using Oozie Java
> >>API,
> >>you can start a dynamic number of sub-workflows based on the number of
> >>outputs.
> >>
> >>
> >>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
> >>
> >>>Hi,
> >>>
> >>>I¹m trying to figure out the best way to implement a workflow in Oozie.
> >>>
> >>>I am creating a workflow which splits an input into multiple outputs.
> >>>
> >>>Then for each output I want to run another process over each.
> >>>
> >>>The trouble is I cannot know a-priori how many outputs I will have, and
> >>>so to post process each I don¹t see how to setup a workflow to run the
> >>>next stage.
> >>>
> >>>Ideally the next stage would be a fork/join type of scenario, since each
> >>>output can be processed independently. But there isn¹t any way I can see
> >>>to setup the fork paths without using some sort of XML generation
> >>>preprocessor.
> >>>
> >>>Does anyone have a suggestion of how to proceed? Am I stuck doing
> >>>workflow generation? Or is there another way to structure this workflow
> >>>using the existing primitives?
> >>>
> >>>Thanks,
> >>>Chris
> >>
>
>

Re: Dynamic parallel tasks in Oozie

Posted by "Heller, Chris" <ch...@akamai.com>.
How would I invoke a sub-workflow from the Java API? Just create a
workflow that only contains a sub-workflow?

On 2/20/14, 4:47 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:

>If you use the sub-workflow construct, then it would do some error
>reporting for you. If a sub-workflow fails, the parent workflow also gets
>updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent"
>property of a subworkflow should be the ID of the parent workflow, helps
>get the dependency graph using IDs.
>
>
>On 2/20/14, 12:52 PM, "Heller, Chris" <ch...@akamai.com> wrote:
>
>>Mona,
>>
>>Thanks. That is the road I’m headed down. At the moment.
>>
>>I’ll create a Java action which takes the files (or a path glob ― or
>>something) as input, and create multiple Oozie tasks based on that input,
>>and then ‘wait’ for those tasks to complete.
>>
>>A feature like this built into the workflow certainly would be nice,
>>since
>>it would better integrate error handling I think.
>>
>>-Chris
>>
>>On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
>>
>>>Hi Chris,
>>>
>>>There isn¹t a way of dynamic parallel tasks within the same Oozie
>>>workflow
>>>XML currently. But you can do some programmatically. Using Oozie Java
>>>API,
>>>you can start a dynamic number of sub-workflows based on the number of
>>>outputs.
>>>
>>>
>>>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
>>>
>>>>Hi,
>>>>
>>>>I¹m trying to figure out the best way to implement a workflow in Oozie.
>>>>
>>>>I am creating a workflow which splits an input into multiple outputs.
>>>>
>>>>Then for each output I want to run another process over each.
>>>>
>>>>The trouble is I cannot know a-priori how many outputs I will have, and
>>>>so to post process each I don¹t see how to setup a workflow to run the
>>>>next stage.
>>>>
>>>>Ideally the next stage would be a fork/join type of scenario, since
>>>>each
>>>>output can be processed independently. But there isn¹t any way I can
>>>>see
>>>>to setup the fork paths without using some sort of XML generation
>>>>preprocessor.
>>>>
>>>>Does anyone have a suggestion of how to proceed? Am I stuck doing
>>>>workflow generation? Or is there another way to structure this workflow
>>>>using the existing primitives?
>>>>
>>>>Thanks,
>>>>Chris
>>>
>

Re: Dynamic parallel tasks in Oozie

Posted by Alejandro Abdelnur <tu...@gmail.com>.
This is what I refer as sharding, it can be seen as a special type of
fork/join where all shards are doing the same actions on different datasets
and the number of shards depends on the number of datasets.

A while ago I've rewritten the workflow lib, cleanning it up a bit and
adding this capability. But never got completed. If there is interest we
could create an umbrella JIRA and complete the integration.

Thanks.


On Thu, Feb 20, 2014 at 1:47 PM, Mona Chitnis <ch...@yahoo-inc.com> wrote:

> If you use the sub-workflow construct, then it would do some error
> reporting for you. If a sub-workflow fails, the parent workflow also gets
> updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent"
> property of a subworkflow should be the ID of the parent workflow, helps
> get the dependency graph using IDs.
>
>
> On 2/20/14, 12:52 PM, "Heller, Chris" <ch...@akamai.com> wrote:
>
> >Mona,
> >
> >Thanks. That is the road I'm headed down. At the moment.
> >
> >I'll create a Java action which takes the files (or a path glob -- or
> >something) as input, and create multiple Oozie tasks based on that input,
> >and then 'wait' for those tasks to complete.
> >
> >A feature like this built into the workflow certainly would be nice, since
> >it would better integrate error handling I think.
> >
> >-Chris
> >
> >On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
> >
> >>Hi Chris,
> >>
> >>There isn¹t a way of dynamic parallel tasks within the same Oozie
> >>workflow
> >>XML currently. But you can do some programmatically. Using Oozie Java
> >>API,
> >>you can start a dynamic number of sub-workflows based on the number of
> >>outputs.
> >>
> >>
> >>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
> >>
> >>>Hi,
> >>>
> >>>I¹m trying to figure out the best way to implement a workflow in Oozie.
> >>>
> >>>I am creating a workflow which splits an input into multiple outputs.
> >>>
> >>>Then for each output I want to run another process over each.
> >>>
> >>>The trouble is I cannot know a-priori how many outputs I will have, and
> >>>so to post process each I don¹t see how to setup a workflow to run the
> >>>next stage.
> >>>
> >>>Ideally the next stage would be a fork/join type of scenario, since each
> >>>output can be processed independently. But there isn¹t any way I can see
> >>>to setup the fork paths without using some sort of XML generation
> >>>preprocessor.
> >>>
> >>>Does anyone have a suggestion of how to proceed? Am I stuck doing
> >>>workflow generation? Or is there another way to structure this workflow
> >>>using the existing primitives?
> >>>
> >>>Thanks,
> >>>Chris
> >>
>
>

Re: Dynamic parallel tasks in Oozie

Posted by Mona Chitnis <ch...@yahoo-inc.com>.
If you use the sub-workflow construct, then it would do some error
reporting for you. If a sub-workflow fails, the parent workflow also gets
updated to failed. Also in Oozie 4.0, the JIRA OOZIE-1264 The "parent"
property of a subworkflow should be the ID of the parent workflow, helps
get the dependency graph using IDs.


On 2/20/14, 12:52 PM, "Heller, Chris" <ch...@akamai.com> wrote:

>Mona,
>
>Thanks. That is the road I’m headed down. At the moment.
>
>I’ll create a Java action which takes the files (or a path glob ― or
>something) as input, and create multiple Oozie tasks based on that input,
>and then ‘wait’ for those tasks to complete.
>
>A feature like this built into the workflow certainly would be nice, since
>it would better integrate error handling I think.
>
>-Chris
>
>On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
>
>>Hi Chris,
>>
>>There isn¹t a way of dynamic parallel tasks within the same Oozie
>>workflow
>>XML currently. But you can do some programmatically. Using Oozie Java
>>API,
>>you can start a dynamic number of sub-workflows based on the number of
>>outputs.
>>
>>
>>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
>>
>>>Hi,
>>>
>>>I¹m trying to figure out the best way to implement a workflow in Oozie.
>>>
>>>I am creating a workflow which splits an input into multiple outputs.
>>>
>>>Then for each output I want to run another process over each.
>>>
>>>The trouble is I cannot know a-priori how many outputs I will have, and
>>>so to post process each I don¹t see how to setup a workflow to run the
>>>next stage.
>>>
>>>Ideally the next stage would be a fork/join type of scenario, since each
>>>output can be processed independently. But there isn¹t any way I can see
>>>to setup the fork paths without using some sort of XML generation
>>>preprocessor.
>>>
>>>Does anyone have a suggestion of how to proceed? Am I stuck doing
>>>workflow generation? Or is there another way to structure this workflow
>>>using the existing primitives?
>>>
>>>Thanks,
>>>Chris
>>


Re: Dynamic parallel tasks in Oozie

Posted by "Heller, Chris" <ch...@akamai.com>.
Mona,

Thanks. That is the road I’m headed down. At the moment.

I’ll create a Java action which takes the files (or a path glob ― or
something) as input, and create multiple Oozie tasks based on that input,
and then ‘wait’ for those tasks to complete.

A feature like this built into the workflow certainly would be nice, since
it would better integrate error handling I think.

-Chris

On 2/20/14, 3:43 PM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:

>Hi Chris,
>
>There isn¹t a way of dynamic parallel tasks within the same Oozie workflow
>XML currently. But you can do some programmatically. Using Oozie Java API,
>you can start a dynamic number of sub-workflows based on the number of
>outputs.
>
>
>On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:
>
>>Hi,
>>
>>I¹m trying to figure out the best way to implement a workflow in Oozie.
>>
>>I am creating a workflow which splits an input into multiple outputs.
>>
>>Then for each output I want to run another process over each.
>>
>>The trouble is I cannot know a-priori how many outputs I will have, and
>>so to post process each I don¹t see how to setup a workflow to run the
>>next stage.
>>
>>Ideally the next stage would be a fork/join type of scenario, since each
>>output can be processed independently. But there isn¹t any way I can see
>>to setup the fork paths without using some sort of XML generation
>>preprocessor.
>>
>>Does anyone have a suggestion of how to proceed? Am I stuck doing
>>workflow generation? Or is there another way to structure this workflow
>>using the existing primitives?
>>
>>Thanks,
>>Chris
>

Re: Dynamic parallel tasks in Oozie

Posted by Mona Chitnis <ch...@yahoo-inc.com>.
Hi Chris,

There isn¹t a way of dynamic parallel tasks within the same Oozie workflow
XML currently. But you can do some programmatically. Using Oozie Java API,
you can start a dynamic number of sub-workflows based on the number of
outputs.


On 2/20/14, 7:05 AM, "Heller, Chris" <ch...@akamai.com> wrote:

>Hi,
>
>I¹m trying to figure out the best way to implement a workflow in Oozie.
>
>I am creating a workflow which splits an input into multiple outputs.
>
>Then for each output I want to run another process over each.
>
>The trouble is I cannot know a-priori how many outputs I will have, and
>so to post process each I don¹t see how to setup a workflow to run the
>next stage.
>
>Ideally the next stage would be a fork/join type of scenario, since each
>output can be processed independently. But there isn¹t any way I can see
>to setup the fork paths without using some sort of XML generation
>preprocessor.
>
>Does anyone have a suggestion of how to proceed? Am I stuck doing
>workflow generation? Or is there another way to structure this workflow
>using the existing primitives?
>
>Thanks,
>Chris