You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by "Sowmya Ramesh (JIRA)" <ji...@apache.org> on 2014/08/13 20:18:12 UTC
[jira] [Comment Edited] (FALCON-594) Process lineage information for Retention policies

    [ https://issues.apache.org/jira/browse/FALCON-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095846#comment-14095846 ] 

Sowmya Ramesh edited comment on FALCON-594 at 8/13/14 6:17 PM:
---------------------------------------------------------------

Multiple approaches have been identified for adding lineage information for eviction policy.

*Approach 1:*

On execution of eviction policy delete the identified feed instance vertices from graph. For completeness the associated entities vertices should also be deleted i.e. cascade delete.

Pros:
- As the identified feed instance vertices are deleted graph DB won't keep growing and hence no storage space issues.

Cons:
- Since eviction history is not preserved this information cannot be retrieved at later point of time.

*Approach 2:*

- On execution of eviction policy delete the identified feed instance vertices [cascade delete].
- For each identified feed entity vertex create a common Evicted vertex and add an edge with label "evicted". Add a property to identify the feed instance vertex evicted [fi], timestamp of eviction[ti], WF id[wi]. Instead of creating a new common vertex self loop can be added

Pros:
- As the identified feed instance vertices are deleted graph DB won't keep growing and hence no storage space issues
- Some details about eviction is being stored in graph DB. This would enable getting details about eviction

Cons:
- Compared to Approach 1 requires more storage as we store some details related to eviction
- For each evicted instance property [fi, ti, wi] is added. In order to get the eviction details this property has to be parsed leading to performance issues

*Approach 3:*
Create a common Evicted vertex and on execution of eviction policy add an edge label "evicted" from each identified feed instance vertex to this.

Pros:
- Approach is simple in terms of implementation
- Retaining all the details of evicted feed instances for historical queries

Cons:
- Storage and performance issues as graphDB keeps growing

*Approach 4:*
On execution of retention policy add "evicted" property to each identified feed instance vertex. Do some cleanup based on time limit that ought to be available to avoid graph DB from growing leading to storage/performance related issues [FALCON-335|https://issues.apache.org/jira/browse/FALCON-335].

Pros:
- Retaining all the details of evicted feed instances for historical queries

Cons:
-  Storage and performance issues as graphDB keeps growing

In addition the decision to purge the vertices can be based on user input to preserve the history or not. In this case multiple approaches has to be implemented. 
Instead of deleting vertices right away there can be time limit to do DB cleanup.

Approach 4 is identified as a feasible solution. Please comment if you have any concerns or inputs.

Thanks!




was (Author: sowmyaramesh):
Multiple approaches have been identified for adding lineage information for eviction policy.

*Approach 1:*

On execution of eviction policy delete the identified feed instance vertices from graph. For completeness the associated entities vertices should also be deleted i.e. cascade delete.

Pros:
- As the identified feed instance vertices are deleted graph DB won't keep growing and hence no storage space issues.

Cons:
- Since eviction history is not preserved this information cannot be retrieved at later point of time.

*Approach 2:*

- On execution of eviction policy delete the identified feed instance vertices [cascade delete].
- For each identified feed entity vertex create a common Evicted vertex and add an edge with label "evicted". Add a property to identify the feed instance vertex evicted [fi], timestamp of eviction[ti], WF id[wi]. Instead of creating a new common vertex self loop can be added

Pros:
- As the identified feed instance vertices are deleted graph DB won't keep growing and hence no storage space issues
- Some details about eviction is being stored in graph DB. This would enable getting details about eviction

Cons:
- Compared to Approach 1 requires more storage as we store some details related to eviction
- For each evicted instance property [fi, ti, wi] is added. In order to get the eviction details this property has to be parsed leading to performance issues

*Approach 3:*
Create a common Evicted vertex and on execution of eviction policy add an edge label "evicted" from each identified feed instance vertex to this.

Pros:
- Approach is simple in terms of implementation
- Retaining all the details of evicted feed instances for historical queries

Cons:
- Storage and performance issues as graphDB keeps growing

*Approach 4*
On execution of retention policy add "evicted" property to each identified feed instance vertex. Do some cleanup based on time limit that ought to be available to avoid graph DB from growing leading to storage/performance related issues [FALCON-335|https://issues.apache.org/jira/browse/FALCON-335].

Pros:
- Retaining all the details of evicted feed instances for historical queries

Cons:
-  Storage and performance issues as graphDB keeps growing

In addition the decision to purge the vertices can be based on user input to preserve the history or not. In this case multiple approaches has to be implemented. 
Instead of deleting vertices right away there can be time limit to do DB cleanup.

Approach 4 is identified as a feasible solution. Please comment if you have any concerns or inputs.

Thanks!



> Process lineage information for Retention policies
> --------------------------------------------------
>
>                 Key: FALCON-594
>                 URL: https://issues.apache.org/jira/browse/FALCON-594
>             Project: Falcon
>          Issue Type: Sub-task
>            Reporter: Sowmya Ramesh
>            Assignee: Sowmya Ramesh
>
> Falcon currently addresses process executions and not data lifecycle policies. This task should address adding this information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)