You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by "Srikanth Sundarrajan (JIRA)" <ji...@apache.org> on 2013/07/14 18:24:49 UTC
[jira] [Comment Edited] (FALCON-48) Pipeline entity for Falcon

    [ https://issues.apache.org/jira/browse/FALCON-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708063#comment-13708063 ] 

Srikanth Sundarrajan edited comment on FALCON-48 at 7/14/13 4:24 PM:
---------------------------------------------------------------------

Yes the ask makes sense. Looks like a common use case. 

Can we define a pipeline as the flow between two feed instance. 

Consider for example:

Two source feeds A & B defined in clusters X1 & X2 each and then are transformed (process P1) to A1 & B1 respectively again in both the clusters X1 & X2. Now consider another process (P2) which consumes A1 & B1 in each of the clusters and produce feed C1 in each of the clusters. And let us say C1 is replicated (multiple source, single target) from X1 & X2 to X3.

In this case the dependency graph would look something like (Refer FALCON-37)

{noformat}
        X1                X2
  ==============    ==============
  A           B     A           B
  |           |     |           |
  |           |     |           |
  A1          B1    A1          B1
    |        |        |        |
      |    |            |    |
        ||                ||
        C1                C1 
           |            |
             |        |
               |    |
                 ||
                 C1 
           ===============
                 X3

{noformat}

In the above flow pipeline can be defined as any pair of source feed & target feed combination. 

Using a few notations to represent this:

{noformat}
* - indicates "All clusters"
#local - indicates the specific cluster on which the source originated
{noformat}

Possible pipeline abstractions (as long as there is a path in the graph between source & targets): 

{noformat}
1. (A@*,B@*) - (C@X3)
2. (A@X1)    - (A1@#local)
3. (A@*)     - (C@#local)
{noformat}
Comments welcome.








                
      was (Author: sriksun):
    Yes makes sense. Does it make sense to define a pipeline as the flow between two feed instance. 

Consider for example:

Two source feeds A & B defined in clusters X1 & X2 each and then are transformed (process P1) to A1 & B1 respectively again in both the clusters X1 & X2. Now consider another process (P2) which consumes A1 & B1 in each of the clusters and produce feed C1 in each of the clusters. And let us say C1 is replicated (multiple source, single target) from X1 & X2 to X3.

In this case the dependency graph would look something like (Refer FALCON-37)

{noformat}
        X1                X2
  ==============    ==============
  A           B     A           B
  |           |     |           |
  |           |     |           |
  A1          B1    A1          B1
    |        |        |        |
      |    |            |    |
        ||                ||
        C1                C1 
           |            |
             |        |
               |    |
                 ||
                 C1 
           ===============
                 X3

{noformat}

In the above flow pipeline can be defined as any pair of source feed & target feed combination. 

Using a few notations to represent this:
* - indicates "All clusters"
#local - indicates the specific cluster on which the source originated

Possible pipeline abstractions (as long as there is a path in the graph between source & targets): 

1. (A@*,B@*) - C@X3
2. (A@X1) - (A1@#local)
3. (A@*) - (C@#local)

Comments welcome.









                  
> Pipeline entity for Falcon
> --------------------------
>
>                 Key: FALCON-48
>                 URL: https://issues.apache.org/jira/browse/FALCON-48
>             Project: Falcon
>          Issue Type: Wish
>          Components: general
>            Reporter: Sanjeev T
>            Priority: Minor
>              Labels: operability
>
> Falcon should also have pipeline entity.
> * Pipeline entity,can comprise of the complete DAG for given set of process and feeds, within cluster or across clusters.
> * How this helps, 
>    * setting up a pipeline, should take care of relevant feeds and process 
>      to be submitted.
>    * in case of cluster having issue, a particular pipeline can be processed 
>      on another cluster
>    * to build monitoring system for a pipeline system
>    * run a particular pipeline for given time-window
>    * cases like, backlog and catch can be handled easily
>    * for Pipeline(A) to complete, we can suspend Pipeline(B), 
>      if they have dependency

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira