You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by "Srikanth Sundarrajan (JIRA)" <ji...@apache.org> on 2014/03/02 05:06:19 UTC

[jira] [Commented] (FALCON-288) Persist lineage information into a persistent store

    [ https://issues.apache.org/jira/browse/FALCON-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917279#comment-13917279 ] 

Srikanth Sundarrajan commented on FALCON-288:
---------------------------------------------

Why do we need user node attached to the cluster vertex that relation isn't very useful and is likely to be misleading as well.
{code}
public void addClusterEntity(Cluster clusterEntity) {
...
+        addUser(clusterVertex);
{code}

addVertex() checks for existence of the vertex, however similar thing is not done for edge. You will find that for every restart, you might create redundant edges between vertex pairs at least for all element of the entity graph.

In code snippets similar to this, it might be useful to not assume the default edge label to be "output", but to actually check for it and throw an assertion error otherwise. It generally gets very hard to debug issues relation to graph sanity when the graph gets larger as there are no relational, unique or property value constraints available in most graph implementations.
{code}
+    public void addProcessFeedEdge(Vertex processVertex, Vertex feedVertex, String edgeLabel) {
+        if (edgeLabel.equals(FEED_PROCESS_EDGE_LABEL)) {
+            feedVertex.addEdge(edgeLabel, processVertex);
+        } else {
+            processVertex.addEdge(edgeLabel, feedVertex);
+        }
+    }
{code}

This is going to be a little tricky. If you leave behind vertices, even after all incident edges are removed, database is going to monotonically increase in size and cause performance issue along the line. One technique that I have used in the past with graph databases is upon edge removal, check if the vertex is left behind with no edges, if so delete the vertex as well. Few gotchas in this with respect to this particular graph are
1. Entity elements aren't to be removed
2. Convenience relations may be added to instance vertices, which aren't to be considered when counting remaining edges. 
These can be achieved by tagging the vertices and edges with appropriate properties.
{code}
+    public void removeEdge(Vertex fromVertex, Object toVertexName, String edgeLabel) {
...
+                // remove the edge and not the vertex since instances are pointing to this vertex
...
{code}

It is reasonable to leave behind graph elements after an entity is deleted to allow historical queries. However there has to be some cleanup based on time limit that ought to be available. This is required even for active ones. Also it might be worth considering to make an option available to the user to delete an entity along with its historical data.
{code}
+    @Override
+    public void onRemove(Entity entity) throws FalconException {
+        // do nothing, we'd leave the deleted entities as-is for historical purposes
+        // should we mark 'em as deleted?
+    }
{code}

Is the motivation of adding classification & groups relationship for every instance to provide "WHAT-WAS" view of the feed instance? Is that a more standard ask? Current model is generic enough to provide both "WHAT-WAS" and "WHAT-IS", but it is at a higher cost. If that is a required feature, we can leave it as is.
{code}
+    public void addFeedInstances(String[] feedNames, String[] feedInstancePaths,
...
+            addDataClassification(feed.getTags(), feedInstance);
+            addGroups(feed.getGroups(), feedInstance);
{code}

Why is workflowInstance a separate node in the graph and not a set of property on the process instance? I can imagine this being useful in re-run scenarios, but I dont see that run-relationship being captured though.

There are so many relationships being created, it might be very useful to test each one of these functions independently.


> Persist lineage information into a persistent store
> ---------------------------------------------------
>
>                 Key: FALCON-288
>                 URL: https://issues.apache.org/jira/browse/FALCON-288
>             Project: Falcon
>          Issue Type: Sub-task
>    Affects Versions: 0.5
>            Reporter: Venkatesh Seetharam
>            Assignee: Venkatesh Seetharam
>              Labels: lineage
>         Attachments: Dependency Graph.png, FALCON-288-Hive-Review.patch, FALCON-288-review-v1.patch, FALCON-288-review.patch, FALCON-288-v1.patch, Lineage Over Dependency.png
>
>
> Need to evaluate the store - rdbms vs graph db. Leaning towards latter since the data is hierarchical.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)