You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oozie.apache.org by Matt Goeke <go...@gmail.com> on 2012/07/17 20:15:21 UTC

Oozie: asynchronous forking

All,



Does anyone know if it is possible to do asynchronous forking in Oozie?
Currently we are running a set of ETL extractions that are pairs of actions
(sqoop action then a hive transformation) but we would like to have the
Sqoop actions be serial and the Hive actions be called asynchronously when
the paired Sqoop job finishes. The reason the Sqoop actions are serial is
we would like to limit the number of concurrent mappers hitting the data
source and we could do this through the fair scheduler but that would
require a pool per data source. Attached is a picture of suggested ETL flow.



If anyone has any suggestions on best practices around this I would love to
hear them.



Thanks,

Matt

Re: Oozie: asynchronous forking

Posted by Matt Goeke <go...@gmail.com>.

All,

Thank you for the input on this. After doing some testing across different
methods it would seem that the nested fork workflow is the route we will
take. As much as I wish I could view the DAG execution visually I have to
admit that the current console is still extremely effective even for some
of these complicated DAGs.

@Alejandro: I tested the coordinator route out on a small subset of our
overall loads and it became increasingly difficult to monitor errors due to
uncertainty of when the jobs would execute. Obviously that model would only
get more difficult as I add more jobs to the mix. It was a fun exercise
though :)

--
Matt Goeke

On Wed, Jul 25, 2012 at 2:14 PM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> Matt,
>
> Using coordinators to kick the hive jobs when the sqoop outputs become
> available would be an option to keep thing simple. The only constrain is
> that you'll need to model that assuming all your inputs/outputs have a
> fixed frequency.
>
> Thx
>
> On Thu, Jul 19, 2012 at 12:03 PM, Mona Chitnis <ch...@yahoo-inc.com>wrote:
>
>> Matt,
>>
>> Virag's illustration explains the approach very well.
>>
>> However, you mentioned a requirement 'forking but not require all of the
>> forked nodes to rejoin the primary workflow'. The fork-join pair construct
>> in Oozie will mandate the forked Hive extractions to ultimately join the
>> main workflow. So is there any different requirement of asynchronous
>> behavior that is not getting fulfilled yet?
>>
>> --
>> Mona Chitnis
>>
>>
>>
>>
>> On 7/19/12 11:35 AM, "Virag Kothari" <vi...@yahoo-inc.com> wrote:
>>
>> >Matt,
>> >
>> >To me, the nested forks option you are considering looks good. Its also
>> >better to have the join in pair.
>> >For eg., if you have S1, S2, S3 as your serial sqoop extractions
>> >And H1, H2, H3 are the corresponding asynchronous Hive extraction
>> >Then, you can have
>> >
>> >S1 -> fork1
>> >Fork1 -> {S2, H1}
>> >S2 -> Fork2
>> >Fork2 -> {S3, H2}
>> >S3-> h3
>> >{H3, H2} -> Join2
>> >{Join2, H1} ->Join1
>> >
>> >However, I have not encountered many workflows using sqoop and hive. So
>> in
>> >terms of workflow design, you can get opinion from other people in
>> >community.
>> >
>> >Thanks,
>> >Virag
>> >
>> >
>> >On 7/18/12 10:52 AM, "Matt Goeke" <go...@gmail.com> wrote:
>> >
>> >> Let me see if I can give a better summary of what we are trying to do.
>> >>Our
>> >> use case is such that we have a set of mySQL instances and we would
>> >>like to
>> >> control the number of connections that we establish to them for sqoop
>> >> extractions. Within each instance we can have several tables we
>> >> are targeting for that daily extraction. Our ETL process involves the
>> >> mentioned sqoop table extractions into a Hive warehouse and then a
>> >> transformation from the Hive staging area into a date partitioned set
>> of
>> >> Hive tables (with a few column name transformations as well). We would
>> >>like
>> >> to establish an Oozie workflow per mySQL instance and use the DAG to
>> >> properly queue sqoop table extractions such that no more than one sqoop
>> >> action is happening at any time. The issue I am running into is that I
>> >>need
>> >> to find a way to have the Hive extraction run asynchronously from the
>> >> serial Sqoop queue. In other words I would like to avoid 1) having the
>> >>next
>> >> sqoop table extraction have to wait on the previous Hive transformation
>> >>and
>> >> 2) not having to move all of the Hive transformations to the bottom of
>> >>the
>> >> DAG (I would like to be able to run them as soon as the sqoop table has
>> >> been extracted).
>> >>
>> >> I have tinkered with the thought of having a coordinator job staged for
>> >> every Hive transformation and then doing a data availability clause
>> that
>> >> allowed it to run but this gets more difficult when you are trying to
>> >>watch
>> >> data folders that have been directly imported into Hive. The other
>> >>route I
>> >> have looked into is a series of nested forks in which I call the Hive
>> >> transformation and the next Sqoop action in parallel from a completed
>> >>Sqoop
>> >> Action.
>> >>
>> >> Let me know if there are any documented best practices around these
>> >>kind of
>> >> flows or if I need to try to balance this across more than just Oozie.
>> >>
>> >> --
>> >> Matt Goeke
>> >>
>> >> On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <vi...@yahoo-inc.com>
>> >>wrote:
>> >>
>> >>> Matt,
>> >>> Its always better to have a join for the corresponding fork. I think
>> it
>> >>> would be better if you clarify in the question more about your
>> workflow
>> >>> design and the requirement for asynchronous spikes.
>> >>>
>> >>> Thanks,
>> >>> Virag
>> >>>
>> >>>
>> >>> On 7/17/12 2:30 PM, "Matt Goeke" <go...@gmail.com> wrote:
>> >>>
>> >>>> Virag,
>> >>>>
>> >>>> Thanks for the response. I have read the workflow spec and while I
>> >>> realize
>> >>>> there is the ability to fork within a workflow my issue is that all
>> >>>>forks
>> >>>> must be paired with joins. What I was looking for was some way to
>> fork
>> >>> but
>> >>>> not require all of the forked nodes to rejoin the primary workflow
>> >>>>(hence
>> >>>> some of the nodes becoming asynchronous spikes). I feel like this
>> >>>> capability might already exist and this might just be an issue of
>> >>>> workflow/subworkflow composition.
>> >>>>
>> >>>> --
>> >>>> Matt Goeke
>> >>>>
>> >>>> On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <vi...@yahoo-inc.com>
>> >>> wrote:
>> >>>>
>> >>>>> Hi Matt,
>> >>>>> I think you can fork the hive actions using the fork/join control
>> >>>>>nodes
>> >>> in
>> >>>>> Oozie.
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>> http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFun
>> >>>ctio
>> >>>>> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
>> >>>>>
>> >>>>> I have no idea why the attachment doesn't work.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Virag
>> >>>>>
>> >>>>>
>> >>>>> On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:
>> >>>>>
>> >>>>>> Apparently when I put an imagur link in the reply the spam score
>> >>>>>>gets
>> >>>>> high
>> >>>>>> enough that the delivery is denied... is there anyway to link an
>> >>>>>>image?
>> >>>>>> Also, if not then is there anything I can clarify in the question
>> >>>>>>that
>> >>>>>> would make it more straightforward?
>> >>>>>>
>> >>>>>> --
>> >>>>>> Matt Goeke
>> >>>>>>
>> >>>>>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis
>> >>>>>><chitnis@yahoo-inc.com
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> The attachment hasn't come through. This had happened with an
>> >>>>>>>earlier
>> >>>>>>> email with the Oozie Meetup slides attachments too. Any solutions?
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Mona Chitnis
>> >>>>>>>
>> >>>>>>> From: Matt Goeke <goeke.matthew@gmail.com<mailto:
>> >>>>> goeke.matthew@gmail.com>>
>> >>>>>>> Reply-To: "oozie-users@incubator.apache.org<mailto:
>> >>>>>>> oozie-users@incubator.apache.org>"
>> >>>>>>><oozie-users@incubator.apache.org
>> >>>>>>> <ma...@incubator.apache.org>>
>> >>>>>>> To: "oozie-users@incubator.apache.org<mailto:
>> >>>>>>> oozie-users@incubator.apache.org>"
>> >>>>>>><oozie-users@incubator.apache.org
>> >>>>>>> <ma...@incubator.apache.org>>
>> >>>>>>> Subject: Oozie: asynchronous forking
>> >>>>>>>
>> >>>>>>> All,
>> >>>>>>>
>> >>>>>>> Does anyone know if it is possible to do asynchronous forking in
>> >>> Oozie?
>> >>>>>>> Currently we are running a set of ETL extractions that are pairs
>> of
>> >>>>> actions
>> >>>>>>> (sqoop action then a hive transformation) but we would like to
>> have
>> >>> the
>> >>>>>>> Sqoop actions be serial and the Hive actions be called
>> >>>>>>>asynchronously
>> >>>>> when
>> >>>>>>> the paired Sqoop job finishes. The reason the Sqoop actions are
>> >>>>>>>serial
>> >>>>> is
>> >>>>>>> we would like to limit the number of concurrent mappers hitting
>> the
>> >>> data
>> >>>>>>> source and we could do this through the fair scheduler but that
>> >>>>>>>would
>> >>>>>>> require a pool per data source. Attached is a picture of suggested
>> >>>>>>>ETL
>> >>>>> flow.
>> >>>>>>>
>> >>>>>>> If anyone has any suggestions on best practices around this I
>> would
>> >>> love
>> >>>>>>> to hear them.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Matt
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>> >
>>
>>
>
>
> --
> Alejandro
>

Re: Oozie: asynchronous forking

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

Matt,

Using coordinators to kick the hive jobs when the sqoop outputs become
available would be an option to keep thing simple. The only constrain is
that you'll need to model that assuming all your inputs/outputs have a
fixed frequency.

Thx

On Thu, Jul 19, 2012 at 12:03 PM, Mona Chitnis <ch...@yahoo-inc.com>wrote:

> Matt,
>
> Virag's illustration explains the approach very well.
>
> However, you mentioned a requirement 'forking but not require all of the
> forked nodes to rejoin the primary workflow'. The fork-join pair construct
> in Oozie will mandate the forked Hive extractions to ultimately join the
> main workflow. So is there any different requirement of asynchronous
> behavior that is not getting fulfilled yet?
>
> --
> Mona Chitnis
>
>
>
>
> On 7/19/12 11:35 AM, "Virag Kothari" <vi...@yahoo-inc.com> wrote:
>
> >Matt,
> >
> >To me, the nested forks option you are considering looks good. Its also
> >better to have the join in pair.
> >For eg., if you have S1, S2, S3 as your serial sqoop extractions
> >And H1, H2, H3 are the corresponding asynchronous Hive extraction
> >Then, you can have
> >
> >S1 -> fork1
> >Fork1 -> {S2, H1}
> >S2 -> Fork2
> >Fork2 -> {S3, H2}
> >S3-> h3
> >{H3, H2} -> Join2
> >{Join2, H1} ->Join1
> >
> >However, I have not encountered many workflows using sqoop and hive. So in
> >terms of workflow design, you can get opinion from other people in
> >community.
> >
> >Thanks,
> >Virag
> >
> >
> >On 7/18/12 10:52 AM, "Matt Goeke" <go...@gmail.com> wrote:
> >
> >> Let me see if I can give a better summary of what we are trying to do.
> >>Our
> >> use case is such that we have a set of mySQL instances and we would
> >>like to
> >> control the number of connections that we establish to them for sqoop
> >> extractions. Within each instance we can have several tables we
> >> are targeting for that daily extraction. Our ETL process involves the
> >> mentioned sqoop table extractions into a Hive warehouse and then a
> >> transformation from the Hive staging area into a date partitioned set of
> >> Hive tables (with a few column name transformations as well). We would
> >>like
> >> to establish an Oozie workflow per mySQL instance and use the DAG to
> >> properly queue sqoop table extractions such that no more than one sqoop
> >> action is happening at any time. The issue I am running into is that I
> >>need
> >> to find a way to have the Hive extraction run asynchronously from the
> >> serial Sqoop queue. In other words I would like to avoid 1) having the
> >>next
> >> sqoop table extraction have to wait on the previous Hive transformation
> >>and
> >> 2) not having to move all of the Hive transformations to the bottom of
> >>the
> >> DAG (I would like to be able to run them as soon as the sqoop table has
> >> been extracted).
> >>
> >> I have tinkered with the thought of having a coordinator job staged for
> >> every Hive transformation and then doing a data availability clause that
> >> allowed it to run but this gets more difficult when you are trying to
> >>watch
> >> data folders that have been directly imported into Hive. The other
> >>route I
> >> have looked into is a series of nested forks in which I call the Hive
> >> transformation and the next Sqoop action in parallel from a completed
> >>Sqoop
> >> Action.
> >>
> >> Let me know if there are any documented best practices around these
> >>kind of
> >> flows or if I need to try to balance this across more than just Oozie.
> >>
> >> --
> >> Matt Goeke
> >>
> >> On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <vi...@yahoo-inc.com>
> >>wrote:
> >>
> >>> Matt,
> >>> Its always better to have a join for the corresponding fork. I think it
> >>> would be better if you clarify in the question more about your workflow
> >>> design and the requirement for asynchronous spikes.
> >>>
> >>> Thanks,
> >>> Virag
> >>>
> >>>
> >>> On 7/17/12 2:30 PM, "Matt Goeke" <go...@gmail.com> wrote:
> >>>
> >>>> Virag,
> >>>>
> >>>> Thanks for the response. I have read the workflow spec and while I
> >>> realize
> >>>> there is the ability to fork within a workflow my issue is that all
> >>>>forks
> >>>> must be paired with joins. What I was looking for was some way to fork
> >>> but
> >>>> not require all of the forked nodes to rejoin the primary workflow
> >>>>(hence
> >>>> some of the nodes becoming asynchronous spikes). I feel like this
> >>>> capability might already exist and this might just be an issue of
> >>>> workflow/subworkflow composition.
> >>>>
> >>>> --
> >>>> Matt Goeke
> >>>>
> >>>> On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <vi...@yahoo-inc.com>
> >>> wrote:
> >>>>
> >>>>> Hi Matt,
> >>>>> I think you can fork the hive actions using the fork/join control
> >>>>>nodes
> >>> in
> >>>>> Oozie.
> >>>>>
> >>>>>
> >>>
> >>>
> http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFun
> >>>ctio
> >>>>> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
> >>>>>
> >>>>> I have no idea why the attachment doesn't work.
> >>>>>
> >>>>> Thanks,
> >>>>> Virag
> >>>>>
> >>>>>
> >>>>> On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:
> >>>>>
> >>>>>> Apparently when I put an imagur link in the reply the spam score
> >>>>>>gets
> >>>>> high
> >>>>>> enough that the delivery is denied... is there anyway to link an
> >>>>>>image?
> >>>>>> Also, if not then is there anything I can clarify in the question
> >>>>>>that
> >>>>>> would make it more straightforward?
> >>>>>>
> >>>>>> --
> >>>>>> Matt Goeke
> >>>>>>
> >>>>>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis
> >>>>>><chitnis@yahoo-inc.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>> The attachment hasn't come through. This had happened with an
> >>>>>>>earlier
> >>>>>>> email with the Oozie Meetup slides attachments too. Any solutions?
> >>>>>>>
> >>>>>>> --
> >>>>>>> Mona Chitnis
> >>>>>>>
> >>>>>>> From: Matt Goeke <goeke.matthew@gmail.com<mailto:
> >>>>> goeke.matthew@gmail.com>>
> >>>>>>> Reply-To: "oozie-users@incubator.apache.org<mailto:
> >>>>>>> oozie-users@incubator.apache.org>"
> >>>>>>><oozie-users@incubator.apache.org
> >>>>>>> <ma...@incubator.apache.org>>
> >>>>>>> To: "oozie-users@incubator.apache.org<mailto:
> >>>>>>> oozie-users@incubator.apache.org>"
> >>>>>>><oozie-users@incubator.apache.org
> >>>>>>> <ma...@incubator.apache.org>>
> >>>>>>> Subject: Oozie: asynchronous forking
> >>>>>>>
> >>>>>>> All,
> >>>>>>>
> >>>>>>> Does anyone know if it is possible to do asynchronous forking in
> >>> Oozie?
> >>>>>>> Currently we are running a set of ETL extractions that are pairs of
> >>>>> actions
> >>>>>>> (sqoop action then a hive transformation) but we would like to have
> >>> the
> >>>>>>> Sqoop actions be serial and the Hive actions be called
> >>>>>>>asynchronously
> >>>>> when
> >>>>>>> the paired Sqoop job finishes. The reason the Sqoop actions are
> >>>>>>>serial
> >>>>> is
> >>>>>>> we would like to limit the number of concurrent mappers hitting the
> >>> data
> >>>>>>> source and we could do this through the fair scheduler but that
> >>>>>>>would
> >>>>>>> require a pool per data source. Attached is a picture of suggested
> >>>>>>>ETL
> >>>>> flow.
> >>>>>>>
> >>>>>>> If anyone has any suggestions on best practices around this I would
> >>> love
> >>>>>>> to hear them.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Matt
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >
>
>


-- 
Alejandro

Re: Oozie: asynchronous forking

Posted by Mona Chitnis <ch...@yahoo-inc.com>.

Matt,

Virag's illustration explains the approach very well.

However, you mentioned a requirement 'forking but not require all of the
forked nodes to rejoin the primary workflow'. The fork-join pair construct
in Oozie will mandate the forked Hive extractions to ultimately join the
main workflow. So is there any different requirement of asynchronous
behavior that is not getting fulfilled yet?

--
Mona Chitnis




On 7/19/12 11:35 AM, "Virag Kothari" <vi...@yahoo-inc.com> wrote:

>Matt,
>
>To me, the nested forks option you are considering looks good. Its also
>better to have the join in pair.
>For eg., if you have S1, S2, S3 as your serial sqoop extractions
>And H1, H2, H3 are the corresponding asynchronous Hive extraction
>Then, you can have
>
>S1 -> fork1
>Fork1 -> {S2, H1}
>S2 -> Fork2
>Fork2 -> {S3, H2}
>S3-> h3
>{H3, H2} -> Join2
>{Join2, H1} ->Join1
>
>However, I have not encountered many workflows using sqoop and hive. So in
>terms of workflow design, you can get opinion from other people in
>community.
>
>Thanks,
>Virag
>
>
>On 7/18/12 10:52 AM, "Matt Goeke" <go...@gmail.com> wrote:
>
>> Let me see if I can give a better summary of what we are trying to do.
>>Our
>> use case is such that we have a set of mySQL instances and we would
>>like to
>> control the number of connections that we establish to them for sqoop
>> extractions. Within each instance we can have several tables we
>> are targeting for that daily extraction. Our ETL process involves the
>> mentioned sqoop table extractions into a Hive warehouse and then a
>> transformation from the Hive staging area into a date partitioned set of
>> Hive tables (with a few column name transformations as well). We would
>>like
>> to establish an Oozie workflow per mySQL instance and use the DAG to
>> properly queue sqoop table extractions such that no more than one sqoop
>> action is happening at any time. The issue I am running into is that I
>>need
>> to find a way to have the Hive extraction run asynchronously from the
>> serial Sqoop queue. In other words I would like to avoid 1) having the
>>next
>> sqoop table extraction have to wait on the previous Hive transformation
>>and
>> 2) not having to move all of the Hive transformations to the bottom of
>>the
>> DAG (I would like to be able to run them as soon as the sqoop table has
>> been extracted).
>> 
>> I have tinkered with the thought of having a coordinator job staged for
>> every Hive transformation and then doing a data availability clause that
>> allowed it to run but this gets more difficult when you are trying to
>>watch
>> data folders that have been directly imported into Hive. The other
>>route I
>> have looked into is a series of nested forks in which I call the Hive
>> transformation and the next Sqoop action in parallel from a completed
>>Sqoop
>> Action.
>> 
>> Let me know if there are any documented best practices around these
>>kind of
>> flows or if I need to try to balance this across more than just Oozie.
>> 
>> --
>> Matt Goeke
>> 
>> On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <vi...@yahoo-inc.com>
>>wrote:
>> 
>>> Matt,
>>> Its always better to have a join for the corresponding fork. I think it
>>> would be better if you clarify in the question more about your workflow
>>> design and the requirement for asynchronous spikes.
>>> 
>>> Thanks,
>>> Virag
>>> 
>>> 
>>> On 7/17/12 2:30 PM, "Matt Goeke" <go...@gmail.com> wrote:
>>> 
>>>> Virag,
>>>> 
>>>> Thanks for the response. I have read the workflow spec and while I
>>> realize
>>>> there is the ability to fork within a workflow my issue is that all
>>>>forks
>>>> must be paired with joins. What I was looking for was some way to fork
>>> but
>>>> not require all of the forked nodes to rejoin the primary workflow
>>>>(hence
>>>> some of the nodes becoming asynchronous spikes). I feel like this
>>>> capability might already exist and this might just be an issue of
>>>> workflow/subworkflow composition.
>>>> 
>>>> --
>>>> Matt Goeke
>>>> 
>>>> On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <vi...@yahoo-inc.com>
>>> wrote:
>>>> 
>>>>> Hi Matt,
>>>>> I think you can fork the hive actions using the fork/join control
>>>>>nodes
>>> in
>>>>> Oozie.
>>>>> 
>>>>> 
>>> 
>>>http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFun
>>>ctio
>>>>> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
>>>>> 
>>>>> I have no idea why the attachment doesn't work.
>>>>> 
>>>>> Thanks,
>>>>> Virag
>>>>> 
>>>>> 
>>>>> On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:
>>>>> 
>>>>>> Apparently when I put an imagur link in the reply the spam score
>>>>>>gets
>>>>> high
>>>>>> enough that the delivery is denied... is there anyway to link an
>>>>>>image?
>>>>>> Also, if not then is there anything I can clarify in the question
>>>>>>that
>>>>>> would make it more straightforward?
>>>>>> 
>>>>>> --
>>>>>> Matt Goeke
>>>>>> 
>>>>>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis
>>>>>><chitnis@yahoo-inc.com
>>>>>> wrote:
>>>>>> 
>>>>>>> The attachment hasn't come through. This had happened with an
>>>>>>>earlier
>>>>>>> email with the Oozie Meetup slides attachments too. Any solutions?
>>>>>>> 
>>>>>>> --
>>>>>>> Mona Chitnis
>>>>>>> 
>>>>>>> From: Matt Goeke <goeke.matthew@gmail.com<mailto:
>>>>> goeke.matthew@gmail.com>>
>>>>>>> Reply-To: "oozie-users@incubator.apache.org<mailto:
>>>>>>> oozie-users@incubator.apache.org>"
>>>>>>><oozie-users@incubator.apache.org
>>>>>>> <ma...@incubator.apache.org>>
>>>>>>> To: "oozie-users@incubator.apache.org<mailto:
>>>>>>> oozie-users@incubator.apache.org>"
>>>>>>><oozie-users@incubator.apache.org
>>>>>>> <ma...@incubator.apache.org>>
>>>>>>> Subject: Oozie: asynchronous forking
>>>>>>> 
>>>>>>> All,
>>>>>>> 
>>>>>>> Does anyone know if it is possible to do asynchronous forking in
>>> Oozie?
>>>>>>> Currently we are running a set of ETL extractions that are pairs of
>>>>> actions
>>>>>>> (sqoop action then a hive transformation) but we would like to have
>>> the
>>>>>>> Sqoop actions be serial and the Hive actions be called
>>>>>>>asynchronously
>>>>> when
>>>>>>> the paired Sqoop job finishes. The reason the Sqoop actions are
>>>>>>>serial
>>>>> is
>>>>>>> we would like to limit the number of concurrent mappers hitting the
>>> data
>>>>>>> source and we could do this through the fair scheduler but that
>>>>>>>would
>>>>>>> require a pool per data source. Attached is a picture of suggested
>>>>>>>ETL
>>>>> flow.
>>>>>>> 
>>>>>>> If anyone has any suggestions on best practices around this I would
>>> love
>>>>>>> to hear them.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Matt
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Oozie: asynchronous forking

Posted by Virag Kothari <vi...@yahoo-inc.com>.

Matt,

To me, the nested forks option you are considering looks good. Its also
better to have the join in pair.
For eg., if you have S1, S2, S3 as your serial sqoop extractions
And H1, H2, H3 are the corresponding asynchronous Hive extraction
Then, you can have

S1 -> fork1
Fork1 -> {S2, H1}
S2 -> Fork2
Fork2 -> {S3, H2}
S3-> h3
{H3, H2} -> Join2
{Join2, H1} ->Join1

However, I have not encountered many workflows using sqoop and hive. So in
terms of workflow design, you can get opinion from other people in
community.

Thanks,
Virag


On 7/18/12 10:52 AM, "Matt Goeke" <go...@gmail.com> wrote:

> Let me see if I can give a better summary of what we are trying to do. Our
> use case is such that we have a set of mySQL instances and we would like to
> control the number of connections that we establish to them for sqoop
> extractions. Within each instance we can have several tables we
> are targeting for that daily extraction. Our ETL process involves the
> mentioned sqoop table extractions into a Hive warehouse and then a
> transformation from the Hive staging area into a date partitioned set of
> Hive tables (with a few column name transformations as well). We would like
> to establish an Oozie workflow per mySQL instance and use the DAG to
> properly queue sqoop table extractions such that no more than one sqoop
> action is happening at any time. The issue I am running into is that I need
> to find a way to have the Hive extraction run asynchronously from the
> serial Sqoop queue. In other words I would like to avoid 1) having the next
> sqoop table extraction have to wait on the previous Hive transformation and
> 2) not having to move all of the Hive transformations to the bottom of the
> DAG (I would like to be able to run them as soon as the sqoop table has
> been extracted).
> 
> I have tinkered with the thought of having a coordinator job staged for
> every Hive transformation and then doing a data availability clause that
> allowed it to run but this gets more difficult when you are trying to watch
> data folders that have been directly imported into Hive. The other route I
> have looked into is a series of nested forks in which I call the Hive
> transformation and the next Sqoop action in parallel from a completed Sqoop
> Action.
> 
> Let me know if there are any documented best practices around these kind of
> flows or if I need to try to balance this across more than just Oozie.
> 
> --
> Matt Goeke
> 
> On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <vi...@yahoo-inc.com> wrote:
> 
>> Matt,
>> Its always better to have a join for the corresponding fork. I think it
>> would be better if you clarify in the question more about your workflow
>> design and the requirement for asynchronous spikes.
>> 
>> Thanks,
>> Virag
>> 
>> 
>> On 7/17/12 2:30 PM, "Matt Goeke" <go...@gmail.com> wrote:
>> 
>>> Virag,
>>> 
>>> Thanks for the response. I have read the workflow spec and while I
>> realize
>>> there is the ability to fork within a workflow my issue is that all forks
>>> must be paired with joins. What I was looking for was some way to fork
>> but
>>> not require all of the forked nodes to rejoin the primary workflow (hence
>>> some of the nodes becoming asynchronous spikes). I feel like this
>>> capability might already exist and this might just be an issue of
>>> workflow/subworkflow composition.
>>> 
>>> --
>>> Matt Goeke
>>> 
>>> On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <vi...@yahoo-inc.com>
>> wrote:
>>> 
>>>> Hi Matt,
>>>> I think you can fork the hive actions using the fork/join control nodes
>> in
>>>> Oozie.
>>>> 
>>>> 
>> http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunctio
>>>> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
>>>> 
>>>> I have no idea why the attachment doesn't work.
>>>> 
>>>> Thanks,
>>>> Virag
>>>> 
>>>> 
>>>> On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:
>>>> 
>>>>> Apparently when I put an imagur link in the reply the spam score gets
>>>> high
>>>>> enough that the delivery is denied... is there anyway to link an image?
>>>>> Also, if not then is there anything I can clarify in the question that
>>>>> would make it more straightforward?
>>>>> 
>>>>> --
>>>>> Matt Goeke
>>>>> 
>>>>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <chitnis@yahoo-inc.com
>>>>> wrote:
>>>>> 
>>>>>> The attachment hasn't come through. This had happened with an earlier
>>>>>> email with the Oozie Meetup slides attachments too. Any solutions?
>>>>>> 
>>>>>> --
>>>>>> Mona Chitnis
>>>>>> 
>>>>>> From: Matt Goeke <goeke.matthew@gmail.com<mailto:
>>>> goeke.matthew@gmail.com>>
>>>>>> Reply-To: "oozie-users@incubator.apache.org<mailto:
>>>>>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
>>>>>> <ma...@incubator.apache.org>>
>>>>>> To: "oozie-users@incubator.apache.org<mailto:
>>>>>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
>>>>>> <ma...@incubator.apache.org>>
>>>>>> Subject: Oozie: asynchronous forking
>>>>>> 
>>>>>> All,
>>>>>> 
>>>>>> Does anyone know if it is possible to do asynchronous forking in
>> Oozie?
>>>>>> Currently we are running a set of ETL extractions that are pairs of
>>>> actions
>>>>>> (sqoop action then a hive transformation) but we would like to have
>> the
>>>>>> Sqoop actions be serial and the Hive actions be called asynchronously
>>>> when
>>>>>> the paired Sqoop job finishes. The reason the Sqoop actions are serial
>>>> is
>>>>>> we would like to limit the number of concurrent mappers hitting the
>> data
>>>>>> source and we could do this through the fair scheduler but that would
>>>>>> require a pool per data source. Attached is a picture of suggested ETL
>>>> flow.
>>>>>> 
>>>>>> If anyone has any suggestions on best practices around this I would
>> love
>>>>>> to hear them.
>>>>>> 
>>>>>> Thanks,
>>>>>> Matt
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Oozie: asynchronous forking

Posted by Matt Goeke <go...@gmail.com>.

Let me see if I can give a better summary of what we are trying to do. Our
use case is such that we have a set of mySQL instances and we would like to
control the number of connections that we establish to them for sqoop
extractions. Within each instance we can have several tables we
are targeting for that daily extraction. Our ETL process involves the
mentioned sqoop table extractions into a Hive warehouse and then a
transformation from the Hive staging area into a date partitioned set of
Hive tables (with a few column name transformations as well). We would like
to establish an Oozie workflow per mySQL instance and use the DAG to
properly queue sqoop table extractions such that no more than one sqoop
action is happening at any time. The issue I am running into is that I need
to find a way to have the Hive extraction run asynchronously from the
serial Sqoop queue. In other words I would like to avoid 1) having the next
sqoop table extraction have to wait on the previous Hive transformation and
2) not having to move all of the Hive transformations to the bottom of the
DAG (I would like to be able to run them as soon as the sqoop table has
been extracted).

I have tinkered with the thought of having a coordinator job staged for
every Hive transformation and then doing a data availability clause that
allowed it to run but this gets more difficult when you are trying to watch
data folders that have been directly imported into Hive. The other route I
have looked into is a series of nested forks in which I call the Hive
transformation and the next Sqoop action in parallel from a completed Sqoop
Action.

Let me know if there are any documented best practices around these kind of
flows or if I need to try to balance this across more than just Oozie.

--
Matt Goeke

On Tue, Jul 17, 2012 at 3:07 PM, Virag Kothari <vi...@yahoo-inc.com> wrote:

> Matt,
> Its always better to have a join for the corresponding fork. I think it
> would be better if you clarify in the question more about your workflow
> design and the requirement for asynchronous spikes.
>
> Thanks,
> Virag
>
>
> On 7/17/12 2:30 PM, "Matt Goeke" <go...@gmail.com> wrote:
>
> > Virag,
> >
> > Thanks for the response. I have read the workflow spec and while I
> realize
> > there is the ability to fork within a workflow my issue is that all forks
> > must be paired with joins. What I was looking for was some way to fork
> but
> > not require all of the forked nodes to rejoin the primary workflow (hence
> > some of the nodes becoming asynchronous spikes). I feel like this
> > capability might already exist and this might just be an issue of
> > workflow/subworkflow composition.
> >
> > --
> > Matt Goeke
> >
> > On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <vi...@yahoo-inc.com>
> wrote:
> >
> >> Hi Matt,
> >> I think you can fork the hive actions using the fork/join control nodes
> in
> >> Oozie.
> >>
> >>
> http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunctio
> >> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
> >>
> >> I have no idea why the attachment doesn't work.
> >>
> >> Thanks,
> >> Virag
> >>
> >>
> >> On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:
> >>
> >>> Apparently when I put an imagur link in the reply the spam score gets
> >> high
> >>> enough that the delivery is denied... is there anyway to link an image?
> >>> Also, if not then is there anything I can clarify in the question that
> >>> would make it more straightforward?
> >>>
> >>> --
> >>> Matt Goeke
> >>>
> >>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <chitnis@yahoo-inc.com
> >>> wrote:
> >>>
> >>>> The attachment hasn't come through. This had happened with an earlier
> >>>> email with the Oozie Meetup slides attachments too. Any solutions?
> >>>>
> >>>> --
> >>>> Mona Chitnis
> >>>>
> >>>> From: Matt Goeke <goeke.matthew@gmail.com<mailto:
> >> goeke.matthew@gmail.com>>
> >>>> Reply-To: "oozie-users@incubator.apache.org<mailto:
> >>>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
> >>>> <ma...@incubator.apache.org>>
> >>>> To: "oozie-users@incubator.apache.org<mailto:
> >>>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
> >>>> <ma...@incubator.apache.org>>
> >>>> Subject: Oozie: asynchronous forking
> >>>>
> >>>> All,
> >>>>
> >>>> Does anyone know if it is possible to do asynchronous forking in
> Oozie?
> >>>> Currently we are running a set of ETL extractions that are pairs of
> >> actions
> >>>> (sqoop action then a hive transformation) but we would like to have
> the
> >>>> Sqoop actions be serial and the Hive actions be called asynchronously
> >> when
> >>>> the paired Sqoop job finishes. The reason the Sqoop actions are serial
> >> is
> >>>> we would like to limit the number of concurrent mappers hitting the
> data
> >>>> source and we could do this through the fair scheduler but that would
> >>>> require a pool per data source. Attached is a picture of suggested ETL
> >> flow.
> >>>>
> >>>> If anyone has any suggestions on best practices around this I would
> love
> >>>> to hear them.
> >>>>
> >>>> Thanks,
> >>>> Matt
> >>>>
> >>
> >>
>
>

Re: Oozie: asynchronous forking

Posted by Virag Kothari <vi...@yahoo-inc.com>.

Matt, 
Its always better to have a join for the corresponding fork. I think it
would be better if you clarify in the question more about your workflow
design and the requirement for asynchronous spikes.

Thanks,
Virag


On 7/17/12 2:30 PM, "Matt Goeke" <go...@gmail.com> wrote:

> Virag,
> 
> Thanks for the response. I have read the workflow spec and while I realize
> there is the ability to fork within a workflow my issue is that all forks
> must be paired with joins. What I was looking for was some way to fork but
> not require all of the forked nodes to rejoin the primary workflow (hence
> some of the nodes becoming asynchronous spikes). I feel like this
> capability might already exist and this might just be an issue of
> workflow/subworkflow composition.
> 
> --
> Matt Goeke
> 
> On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <vi...@yahoo-inc.com> wrote:
> 
>> Hi Matt,
>> I think you can fork the hive actions using the fork/join control nodes in
>> Oozie.
>> 
>> http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunctio
>> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
>> 
>> I have no idea why the attachment doesn't work.
>> 
>> Thanks,
>> Virag
>> 
>> 
>> On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:
>> 
>>> Apparently when I put an imagur link in the reply the spam score gets
>> high
>>> enough that the delivery is denied... is there anyway to link an image?
>>> Also, if not then is there anything I can clarify in the question that
>>> would make it more straightforward?
>>> 
>>> --
>>> Matt Goeke
>>> 
>>> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <chitnis@yahoo-inc.com
>>> wrote:
>>> 
>>>> The attachment hasn't come through. This had happened with an earlier
>>>> email with the Oozie Meetup slides attachments too. Any solutions?
>>>> 
>>>> --
>>>> Mona Chitnis
>>>> 
>>>> From: Matt Goeke <goeke.matthew@gmail.com<mailto:
>> goeke.matthew@gmail.com>>
>>>> Reply-To: "oozie-users@incubator.apache.org<mailto:
>>>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
>>>> <ma...@incubator.apache.org>>
>>>> To: "oozie-users@incubator.apache.org<mailto:
>>>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
>>>> <ma...@incubator.apache.org>>
>>>> Subject: Oozie: asynchronous forking
>>>> 
>>>> All,
>>>> 
>>>> Does anyone know if it is possible to do asynchronous forking in Oozie?
>>>> Currently we are running a set of ETL extractions that are pairs of
>> actions
>>>> (sqoop action then a hive transformation) but we would like to have the
>>>> Sqoop actions be serial and the Hive actions be called asynchronously
>> when
>>>> the paired Sqoop job finishes. The reason the Sqoop actions are serial
>> is
>>>> we would like to limit the number of concurrent mappers hitting the data
>>>> source and we could do this through the fair scheduler but that would
>>>> require a pool per data source. Attached is a picture of suggested ETL
>> flow.
>>>> 
>>>> If anyone has any suggestions on best practices around this I would love
>>>> to hear them.
>>>> 
>>>> Thanks,
>>>> Matt
>>>> 
>> 
>>

Re: Oozie: asynchronous forking

Posted by Matt Goeke <go...@gmail.com>.

Virag,

Thanks for the response. I have read the workflow spec and while I realize
there is the ability to fork within a workflow my issue is that all forks
must be paired with joins. What I was looking for was some way to fork but
not require all of the forked nodes to rejoin the primary workflow (hence
some of the nodes becoming asynchronous spikes). I feel like this
capability might already exist and this might just be an issue of
workflow/subworkflow composition.

--
Matt Goeke

On Tue, Jul 17, 2012 at 2:00 PM, Virag Kothari <vi...@yahoo-inc.com> wrote:

> Hi Matt,
> I think you can fork the hive actions using the fork/join control nodes in
> Oozie.
>
> http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunctio
> nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.
>
> I have no idea why the attachment doesn't work.
>
> Thanks,
> Virag
>
>
> On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:
>
> > Apparently when I put an imagur link in the reply the spam score gets
> high
> > enough that the delivery is denied... is there anyway to link an image?
> > Also, if not then is there anything I can clarify in the question that
> > would make it more straightforward?
> >
> > --
> > Matt Goeke
> >
> > On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <chitnis@yahoo-inc.com
> >wrote:
> >
> >> The attachment hasn't come through. This had happened with an earlier
> >> email with the Oozie Meetup slides attachments too. Any solutions?
> >>
> >> --
> >> Mona Chitnis
> >>
> >> From: Matt Goeke <goeke.matthew@gmail.com<mailto:
> goeke.matthew@gmail.com>>
> >> Reply-To: "oozie-users@incubator.apache.org<mailto:
> >> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
> >> <ma...@incubator.apache.org>>
> >> To: "oozie-users@incubator.apache.org<mailto:
> >> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
> >> <ma...@incubator.apache.org>>
> >> Subject: Oozie: asynchronous forking
> >>
> >> All,
> >>
> >> Does anyone know if it is possible to do asynchronous forking in Oozie?
> >> Currently we are running a set of ETL extractions that are pairs of
> actions
> >> (sqoop action then a hive transformation) but we would like to have the
> >> Sqoop actions be serial and the Hive actions be called asynchronously
> when
> >> the paired Sqoop job finishes. The reason the Sqoop actions are serial
> is
> >> we would like to limit the number of concurrent mappers hitting the data
> >> source and we could do this through the fair scheduler but that would
> >> require a pool per data source. Attached is a picture of suggested ETL
> flow.
> >>
> >> If anyone has any suggestions on best practices around this I would love
> >> to hear them.
> >>
> >> Thanks,
> >> Matt
> >>
>
>

Re: Oozie: asynchronous forking

Posted by Virag Kothari <vi...@yahoo-inc.com>.

Hi Matt,
I think you can fork the hive actions using the fork/join control nodes in
Oozie. 
http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunctio
nalSpec.html#a3.1.5_Fork_and_Join_Control_Nodes.

I have no idea why the attachment doesn't work.

Thanks,
Virag


On 7/17/12 12:13 PM, "Matt Goeke" <go...@gmail.com> wrote:

> Apparently when I put an imagur link in the reply the spam score gets high
> enough that the delivery is denied... is there anyway to link an image?
> Also, if not then is there anything I can clarify in the question that
> would make it more straightforward?
> 
> --
> Matt Goeke
> 
> On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <ch...@yahoo-inc.com>wrote:
> 
>> The attachment hasn't come through. This had happened with an earlier
>> email with the Oozie Meetup slides attachments too. Any solutions?
>> 
>> --
>> Mona Chitnis
>> 
>> From: Matt Goeke <go...@gmail.com>>
>> Reply-To: "oozie-users@incubator.apache.org<mailto:
>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
>> <ma...@incubator.apache.org>>
>> To: "oozie-users@incubator.apache.org<mailto:
>> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
>> <ma...@incubator.apache.org>>
>> Subject: Oozie: asynchronous forking
>> 
>> All,
>> 
>> Does anyone know if it is possible to do asynchronous forking in Oozie?
>> Currently we are running a set of ETL extractions that are pairs of actions
>> (sqoop action then a hive transformation) but we would like to have the
>> Sqoop actions be serial and the Hive actions be called asynchronously when
>> the paired Sqoop job finishes. The reason the Sqoop actions are serial is
>> we would like to limit the number of concurrent mappers hitting the data
>> source and we could do this through the fair scheduler but that would
>> require a pool per data source. Attached is a picture of suggested ETL flow.
>> 
>> If anyone has any suggestions on best practices around this I would love
>> to hear them.
>> 
>> Thanks,
>> Matt
>>

Re: Oozie: asynchronous forking

Posted by Matt Goeke <go...@gmail.com>.

Apparently when I put an imagur link in the reply the spam score gets high
enough that the delivery is denied... is there anyway to link an image?
Also, if not then is there anything I can clarify in the question that
would make it more straightforward?

--
Matt Goeke

On Tue, Jul 17, 2012 at 11:22 AM, Mona Chitnis <ch...@yahoo-inc.com>wrote:

> The attachment hasn't come through. This had happened with an earlier
> email with the Oozie Meetup slides attachments too. Any solutions?
>
> --
> Mona Chitnis
>
> From: Matt Goeke <go...@gmail.com>>
> Reply-To: "oozie-users@incubator.apache.org<mailto:
> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
> <ma...@incubator.apache.org>>
> To: "oozie-users@incubator.apache.org<mailto:
> oozie-users@incubator.apache.org>" <oozie-users@incubator.apache.org
> <ma...@incubator.apache.org>>
> Subject: Oozie: asynchronous forking
>
> All,
>
> Does anyone know if it is possible to do asynchronous forking in Oozie?
> Currently we are running a set of ETL extractions that are pairs of actions
> (sqoop action then a hive transformation) but we would like to have the
> Sqoop actions be serial and the Hive actions be called asynchronously when
> the paired Sqoop job finishes. The reason the Sqoop actions are serial is
> we would like to limit the number of concurrent mappers hitting the data
> source and we could do this through the fair scheduler but that would
> require a pool per data source. Attached is a picture of suggested ETL flow.
>
> If anyone has any suggestions on best practices around this I would love
> to hear them.
>
> Thanks,
> Matt
>

Re: Oozie: asynchronous forking

Posted by Mona Chitnis <ch...@yahoo-inc.com>.

The attachment hasn't come through. This had happened with an earlier email with the Oozie Meetup slides attachments too. Any solutions?

--
Mona Chitnis

From: Matt Goeke <go...@gmail.com>>
Reply-To: "oozie-users@incubator.apache.org<ma...@incubator.apache.org>" <oo...@incubator.apache.org>>
To: "oozie-users@incubator.apache.org<ma...@incubator.apache.org>" <oo...@incubator.apache.org>>
Subject: Oozie: asynchronous forking

All,

Does anyone know if it is possible to do asynchronous forking in Oozie? Currently we are running a set of ETL extractions that are pairs of actions (sqoop action then a hive transformation) but we would like to have the Sqoop actions be serial and the Hive actions be called asynchronously when the paired Sqoop job finishes. The reason the Sqoop actions are serial is we would like to limit the number of concurrent mappers hitting the data source and we could do this through the fair scheduler but that would require a pool per data source. Attached is a picture of suggested ETL flow.

If anyone has any suggestions on best practices around this I would love to hear them.

Thanks,
Matt