You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@atlas.apache.org by "Rao, Chaitra" <Ch...@intuit.com> on 2021/04/01 17:49:17 UTC

[Atlas] - Need inputs on Tag Propagation

Hi All,

We have a usecase to create lineage between two hive tables.
This works well if the table (which is input to lineage process) does not have any classification (tag) associated with it.

However, we have seen cases in production, where input hive table has classifications and it is also associated in other 40,000 lineage processes.

Also another point to note here is the classification propagation is modelled as ‘ONE_TO_TWO’.

The case where it fails is,
When the 40001 lineage process is created between the input and output tables, as tag propagation is marked as ‘ONE_TO_TWO’, the propagation starts happening to all the other 40,000 lineage processes as well that the input hive table is associated with.

We are on Atlas2.0 and in this version, this above operation did not complete even after 10 hours.

On further investigation, we found that Atlas2.0 code refers to this gremlin query “TAG_PROPAGATION_IMPACTED_INSTANCES_FOR_REMOVAL” and this query is responsible in fetching the 40000 lineage processes.

We also saw that in the latest versions of Atlas, gremlin query is replaced with in-memory traversal, but it does traverse all the 40,000 (https://issues.apache.org/jira/browse/ATLAS-3563).

However, what we want in our usecase is to propagate classification only to the new hive table output that was being created in the lineage workflow, instead of iterating over the remaining older 40,000 processes.

In order to achieve this, we have made some code changes as below in DeleteHandlerV1.

Before
final List<AtlasVertex> propagatedEntityVertices = CollectionUtils.isNotEmpty(classificationVertices) ? graphHelper.getIncludedImpactedVerticesWithReferences(toVertex, getRelationshipGuid(edge)) : null;

After

final List<AtlasVertex> propagatedEntityVertices = new ArrayList<>();

propagatedEntityVertices.add(toVertex);

We wanted to know if you see any other side effects if we go via this route, where we do not get the impactedVertices at all?

It would be very helpful if you could please shed some light on this.

Many Thanks,
Chaitra

Re: [Atlas] - Need inputs on Tag Propagation

Posted by Radhika Kundam <ra...@gmail.com>.

Hi Chaitra,

When creating a new lineage process between source table and destination
table, expected behaviour is only the source table tags will be propagated
to the new process and existing processes should not be impacted.
If you are seeing any discrepancies with expected behaviour, please provide
more details about use cases and steps to reproduce.Also provide the Atlas
version you are referring to.

When propagation enabled classification/tag added to an entity, that tag
should be propagated to all impacted vertices.
With your code changes, tag propagation will not proceed to impacted
vertices further which is not the expected behavior.

If the use case is to disable tag propagation to impacted vertices then you
can uncheck the "Propagate" option while adding tag to entity.

Thanks,
Radhika

On Sun, Apr 11, 2021 at 10:21 PM Ashutosh Mestry <am...@apache.org> wrote:

> Please give us some more time to get back to you.
>
> ~ ashutosh
>
> On 2021/04/01 17:49:17, "Rao, Chaitra" <Ch...@intuit.com> wrote:
> > Hi All,
> >
> > We have a usecase to create lineage between two hive tables.
> > This works well if the table (which is input to lineage process)  does
> not have any classification (tag) associated with it.
> >
> > However, we have seen cases in production, where input hive table has
> classifications and it is also associated in other 40,000 lineage processes.
> >
> > Also another point to note here is the classification propagation is
> modelled as ‘ONE_TO_TWO’.
> >
> > The case where it fails is,
> > When the 40001 lineage process is created between the input and output
> tables, as tag propagation is marked as ‘ONE_TO_TWO’, the propagation
> starts happening to all the other 40,000 lineage processes as well that the
> input hive table is associated with.
> >
> > We are on Atlas2.0 and in this version, this above operation did not
> complete even after 10 hours.
> >
> >
> > On further investigation, we found that Atlas2.0 code refers to this
> gremlin query “TAG_PROPAGATION_IMPACTED_INSTANCES_FOR_REMOVAL” and this
> query is responsible in fetching the 40000 lineage processes.
> >
> >
> >
> > We also saw that in the latest versions of Atlas,  gremlin query is
> replaced with in-memory traversal, but it does traverse all the 40,000 (
> https://issues.apache.org/jira/browse/ATLAS-3563).
> >
> >
> >
> > However, what we want in our usecase is to propagate classification only
> to the new hive table output that was being created in the lineage
> workflow, instead of iterating over the remaining older 40,000 processes.
> >
> >
> >
> >
> >
> > In order to achieve this, we have made some code changes as below in
> DeleteHandlerV1.
> >
> >
> >
> > Before
> > final List<AtlasVertex> propagatedEntityVertices =
> CollectionUtils.isNotEmpty(classificationVertices) ?
> graphHelper.getIncludedImpactedVerticesWithReferences(toVertex,
> getRelationshipGuid(edge)) : null;
> >
> >
> >
> > After
> >
> > final List<AtlasVertex> propagatedEntityVertices = new ArrayList<>();
> >
> > propagatedEntityVertices.add(toVertex);
> >
> >
> >
> >
> >
> > We wanted to know if you see any other side effects if we go via this
> route, where we do not get the impactedVertices at all?
> >
> > It would be very helpful if you could please shed some light on this.
> >
> > Many Thanks,
> > Chaitra
> >
>

Re: [Atlas] - Need inputs on Tag Propagation

Posted by Ashutosh Mestry <am...@apache.org>.

Please give us some more time to get back to you.

~ ashutosh

On 2021/04/01 17:49:17, "Rao, Chaitra" <Ch...@intuit.com> wrote: 
> Hi All,
> 
> We have a usecase to create lineage between two hive tables.
> This works well if the table (which is input to lineage process)  does not have any classification (tag) associated with it.
> 
> However, we have seen cases in production, where input hive table has classifications and it is also associated in other 40,000 lineage processes.
> 
> Also another point to note here is the classification propagation is modelled as ‘ONE_TO_TWO’.
> 
> The case where it fails is,
> When the 40001 lineage process is created between the input and output tables, as tag propagation is marked as ‘ONE_TO_TWO’, the propagation starts happening to all the other 40,000 lineage processes as well that the input hive table is associated with.
> 
> We are on Atlas2.0 and in this version, this above operation did not complete even after 10 hours.
> 
> 
> On further investigation, we found that Atlas2.0 code refers to this gremlin query “TAG_PROPAGATION_IMPACTED_INSTANCES_FOR_REMOVAL” and this query is responsible in fetching the 40000 lineage processes.
> 
> 
> 
> We also saw that in the latest versions of Atlas,  gremlin query is replaced with in-memory traversal, but it does traverse all the 40,000 (https://issues.apache.org/jira/browse/ATLAS-3563).
> 
> 
> 
> However, what we want in our usecase is to propagate classification only to the new hive table output that was being created in the lineage workflow, instead of iterating over the remaining older 40,000 processes.
> 
> 
> 
> 
> 
> In order to achieve this, we have made some code changes as below in DeleteHandlerV1.
> 
> 
> 
> Before
> final List<AtlasVertex> propagatedEntityVertices = CollectionUtils.isNotEmpty(classificationVertices) ? graphHelper.getIncludedImpactedVerticesWithReferences(toVertex, getRelationshipGuid(edge)) : null;
> 
> 
> 
> After
> 
> final List<AtlasVertex> propagatedEntityVertices = new ArrayList<>();
> 
> propagatedEntityVertices.add(toVertex);
> 
> 
> 
> 
> 
> We wanted to know if you see any other side effects if we go via this route, where we do not get the impactedVertices at all?
> 
> It would be very helpful if you could please shed some light on this.
> 
> Many Thanks,
> Chaitra
>