You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by Deep Singh <de...@gmail.com> on 2020/12/11 20:01:10 UTC

Review Request 73081: ATLAS-4076: Creating an asset from existing asset adds duplicate Atlas audit entries

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73081/
-----------------------------------------------------------

Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nikhil Bonte, and Sarath Subramanian.


Bugs: ATLAS-4076
    https://issues.apache.org/jira/browse/ATLAS-4076


Repository: atlas


Description
-------

Observations:
=============
Have a hive table and attach classification to it on Atlas. Enable propagation on the attached classification.
When you drive a new table from this hive table, the new table will have the propagated classification, as expected.
However, the entity audits of the newly derived table has multiple "Propagated Classification Added" enteries. 

If table derivation is done using Hive Beeline, there are 5 such entries per propagated classification.
Using Spark-shell, 3 such entries were observed per propagated classification.

Expected behaviour is to have just 1 entry per propagated classification.

Analysis:
=========
After detecting relationship and creating relationship edge, the propagated enteties(classifications) are notified to entityChangeListner through entityChangeNotifier. However details of the propagated enteties are not passed directly to notifier, but through request context (buffered into addedPropagation list). 

After processing every edge, AtlasRelationshipStore manager sends notification to entityChangeListner, which simply gets all the items in request context buffer list. 

In this issue, Hive sends and event which has multiple relationships, and only one relationship has propagated entities, but due to multiple notifications(which is correct) same buffer list is processed multipletimes (which is wrong).

Following are the list of created relationships 
Created relationship edge from [hive_table] --> [hive_storagedesc] using edge label: [__hive_table.sd] 
Created relationship edge from [hive_table] --> [hive_column] using edge label: [__hive_table.columns] 
Created relationship edge from [hive_table] --> [hive_table_ddl] using edge label: [r:hive_table_ddl_queries] 
Created relationship edge from [hive_table] --> [hive_db] using edge label: [__hive_table.db] 
Created relationship edge from [hive_process] --> [hive_process_execution] using edge label: [r:hive_process_process_executions] 
Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.outputs]
Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.inputs]
===================================================================================================
Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.outputs] 
Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.inputs] 
Created relationship edge from [hive_column_lineage] --> [hive_process] using edge label: [__hive_column_lineage.query] 

In the above list the highlited one has propagated classificatin, but subscequent 3 relationships sends 3 more notifications, resulting 3 extra entries for same classification in entity audits.

At the end entityChangeNotifier, while processing mutated entities, explicetly notify for any pending propagated entities and once again buffer list in request context is processed. Resulting in 4th extra entry in audits.

Fix:
====

One option was to send the details of propagated entities directly to notifier and not rely on the request context. It required lot of code change.
Other option was to clear the buffer in the request context after processing it in entityChangeNotifier.

This review request is with the second aproach.


Diffs
-----

  repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java 32ad65e7a 
  server-api/src/main/java/org/apache/atlas/RequestContext.java 32ffddde1 


Diff: https://reviews.apache.org/r/73081/diff/1/


Testing
-------

Manual testing was done using both hive and spark.
precommit test were success
https://ci-builds.apache.org/job/Atlas/job/PreCommit-ATLAS-Build-Test/263/console


Thanks,

Deep Singh


Re: Review Request 73081: ATLAS-4076: Creating an asset from existing asset adds duplicate Atlas audit entries

Posted by Deep Singh <de...@gmail.com>.

> On Dec. 15, 2020, 5:13 p.m., Ashutosh Mestry wrote:
> > repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java
> > Lines 362 (patched)
> > <https://reviews.apache.org/r/73081/diff/1/?file=2243171#file2243171line362>
> >
> >     Refactor: Can you move the clearAddedPropagations to notifyProagatedEntities.

No, It is here in this method where propagations are extracted from context. Therefore only this method must clear the extracted propagatinos in the context.


> On Dec. 15, 2020, 5:13 p.m., Ashutosh Mestry wrote:
> > repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java
> > Lines 364 (patched)
> > <https://reviews.apache.org/r/73081/diff/1/?file=2243171#file2243171line364>
> >
> >     Will removing all proapgations after notification is done work?

Yes, I guess so. Because these propagations are kept in context precisely to pass them to notifier. Once processing is done we must clear it from context.


- Deep


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73081/#review222342
-----------------------------------------------------------


On Dec. 11, 2020, 9:47 p.m., Deep Singh wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/73081/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2020, 9:47 p.m.)
> 
> 
> Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nikhil Bonte, and Sarath Subramanian.
> 
> 
> Bugs: ATLAS-4076
>     https://issues.apache.org/jira/browse/ATLAS-4076
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> Observations:
> =============
> Have a hive table and attach classification to it on Atlas. Enable propagation on the attached classification.
> When you drive a new table from this hive table, the new table will have the propagated classification, as expected.
> However, the entity audits of the newly derived table has multiple "Propagated Classification Added" enteries. 
> 
> If table derivation is done using Hive Beeline, there are 5 such entries per propagated classification.
> Using Spark-shell, 3 such entries were observed per propagated classification.
> 
> Expected behaviour is to have just 1 entry per propagated classification.
> 
> Analysis:
> =========
> After detecting relationship and creating relationship edge, the propagated enteties(classifications) are notified to entityChangeListner through entityChangeNotifier. However details of the propagated enteties are not passed directly to notifier, but through request context (buffered into addedPropagation list). 
> 
> After processing every edge, AtlasRelationshipStore manager sends notification to entityChangeListner, which simply gets all the items in request context buffer list. 
> 
> In this issue, Hive sends event which has multiple relationships, and only one relationship has propagated entities, but due to multiple notifications(which is correct) same buffer list is processed multipletimes (which is wrong).
> 
> Following are the list of created relationships 
> Created relationship edge from [hive_table] --> [hive_storagedesc] using edge label: [__hive_table.sd] 
> Created relationship edge from [hive_table] --> [hive_column] using edge label: [__hive_table.columns] 
> Created relationship edge from [hive_table] --> [hive_table_ddl] using edge label: [r:hive_table_ddl_queries] 
> Created relationship edge from [hive_table] --> [hive_db] using edge label: [__hive_table.db] 
> Created relationship edge from [hive_process] --> [hive_process_execution] using edge label: [r:hive_process_process_executions] 
> Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.outputs]
> Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.inputs]
> ===================================================================================================
> Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.outputs] 
> Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.inputs] 
> Created relationship edge from [hive_column_lineage] --> [hive_process] using edge label: [__hive_column_lineage.query] 
> 
> In the above list the highlited one has propagated classificatin, but subscequent 3 relationships sends 3 more notifications, resulting 3 extra entries for same classification in entity audits.
> 
> At the end entityChangeNotifier, while processing mutated entities, explicetly notify for any pending propagated entities and once again buffer list in request context is processed. Resulting in 4th extra entry in audits.
> 
> Fix:
> ====
> 
> One option was to send the details of propagated entities directly to notifier and not rely on the request context. It required lot of code change.
> Other option was to clear the buffer in the request context after processing it in entityChangeNotifier.
> 
> This review request is with the second aproach.
> 
> 
> Diffs
> -----
> 
>   repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java 32ad65e7a 
>   server-api/src/main/java/org/apache/atlas/RequestContext.java 32ffddde1 
> 
> 
> Diff: https://reviews.apache.org/r/73081/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing was done using both hive and spark.
> precommit test were success
> https://ci-builds.apache.org/job/Atlas/job/PreCommit-ATLAS-Build-Test/263/console
> 
> 
> Thanks,
> 
> Deep Singh
> 
>


Re: Review Request 73081: ATLAS-4076: Creating an asset from existing asset adds duplicate Atlas audit entries

Posted by Ashutosh Mestry via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73081/#review222342
-----------------------------------------------------------




repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java
Lines 362 (patched)
<https://reviews.apache.org/r/73081/#comment311407>

    Refactor: Can you move the clearAddedPropagations to notifyProagatedEntities.



repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java
Lines 364 (patched)
<https://reviews.apache.org/r/73081/#comment311408>

    Will removing all proapgations after notification is done work?


- Ashutosh Mestry


On Dec. 11, 2020, 9:47 p.m., Deep Singh wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/73081/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2020, 9:47 p.m.)
> 
> 
> Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nikhil Bonte, and Sarath Subramanian.
> 
> 
> Bugs: ATLAS-4076
>     https://issues.apache.org/jira/browse/ATLAS-4076
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> Observations:
> =============
> Have a hive table and attach classification to it on Atlas. Enable propagation on the attached classification.
> When you drive a new table from this hive table, the new table will have the propagated classification, as expected.
> However, the entity audits of the newly derived table has multiple "Propagated Classification Added" enteries. 
> 
> If table derivation is done using Hive Beeline, there are 5 such entries per propagated classification.
> Using Spark-shell, 3 such entries were observed per propagated classification.
> 
> Expected behaviour is to have just 1 entry per propagated classification.
> 
> Analysis:
> =========
> After detecting relationship and creating relationship edge, the propagated enteties(classifications) are notified to entityChangeListner through entityChangeNotifier. However details of the propagated enteties are not passed directly to notifier, but through request context (buffered into addedPropagation list). 
> 
> After processing every edge, AtlasRelationshipStore manager sends notification to entityChangeListner, which simply gets all the items in request context buffer list. 
> 
> In this issue, Hive sends event which has multiple relationships, and only one relationship has propagated entities, but due to multiple notifications(which is correct) same buffer list is processed multipletimes (which is wrong).
> 
> Following are the list of created relationships 
> Created relationship edge from [hive_table] --> [hive_storagedesc] using edge label: [__hive_table.sd] 
> Created relationship edge from [hive_table] --> [hive_column] using edge label: [__hive_table.columns] 
> Created relationship edge from [hive_table] --> [hive_table_ddl] using edge label: [r:hive_table_ddl_queries] 
> Created relationship edge from [hive_table] --> [hive_db] using edge label: [__hive_table.db] 
> Created relationship edge from [hive_process] --> [hive_process_execution] using edge label: [r:hive_process_process_executions] 
> Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.outputs]
> Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.inputs]
> ===================================================================================================
> Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.outputs] 
> Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.inputs] 
> Created relationship edge from [hive_column_lineage] --> [hive_process] using edge label: [__hive_column_lineage.query] 
> 
> In the above list the highlited one has propagated classificatin, but subscequent 3 relationships sends 3 more notifications, resulting 3 extra entries for same classification in entity audits.
> 
> At the end entityChangeNotifier, while processing mutated entities, explicetly notify for any pending propagated entities and once again buffer list in request context is processed. Resulting in 4th extra entry in audits.
> 
> Fix:
> ====
> 
> One option was to send the details of propagated entities directly to notifier and not rely on the request context. It required lot of code change.
> Other option was to clear the buffer in the request context after processing it in entityChangeNotifier.
> 
> This review request is with the second aproach.
> 
> 
> Diffs
> -----
> 
>   repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java 32ad65e7a 
>   server-api/src/main/java/org/apache/atlas/RequestContext.java 32ffddde1 
> 
> 
> Diff: https://reviews.apache.org/r/73081/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing was done using both hive and spark.
> precommit test were success
> https://ci-builds.apache.org/job/Atlas/job/PreCommit-ATLAS-Build-Test/263/console
> 
> 
> Thanks,
> 
> Deep Singh
> 
>


Re: Review Request 73081: ATLAS-4076: Creating an asset from existing asset adds duplicate Atlas audit entries

Posted by Ashutosh Mestry via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73081/#review222341
-----------------------------------------------------------


Ship it!




Ship It!

- Ashutosh Mestry


On Dec. 11, 2020, 9:47 p.m., Deep Singh wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/73081/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2020, 9:47 p.m.)
> 
> 
> Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nikhil Bonte, and Sarath Subramanian.
> 
> 
> Bugs: ATLAS-4076
>     https://issues.apache.org/jira/browse/ATLAS-4076
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> Observations:
> =============
> Have a hive table and attach classification to it on Atlas. Enable propagation on the attached classification.
> When you drive a new table from this hive table, the new table will have the propagated classification, as expected.
> However, the entity audits of the newly derived table has multiple "Propagated Classification Added" enteries. 
> 
> If table derivation is done using Hive Beeline, there are 5 such entries per propagated classification.
> Using Spark-shell, 3 such entries were observed per propagated classification.
> 
> Expected behaviour is to have just 1 entry per propagated classification.
> 
> Analysis:
> =========
> After detecting relationship and creating relationship edge, the propagated enteties(classifications) are notified to entityChangeListner through entityChangeNotifier. However details of the propagated enteties are not passed directly to notifier, but through request context (buffered into addedPropagation list). 
> 
> After processing every edge, AtlasRelationshipStore manager sends notification to entityChangeListner, which simply gets all the items in request context buffer list. 
> 
> In this issue, Hive sends event which has multiple relationships, and only one relationship has propagated entities, but due to multiple notifications(which is correct) same buffer list is processed multipletimes (which is wrong).
> 
> Following are the list of created relationships 
> Created relationship edge from [hive_table] --> [hive_storagedesc] using edge label: [__hive_table.sd] 
> Created relationship edge from [hive_table] --> [hive_column] using edge label: [__hive_table.columns] 
> Created relationship edge from [hive_table] --> [hive_table_ddl] using edge label: [r:hive_table_ddl_queries] 
> Created relationship edge from [hive_table] --> [hive_db] using edge label: [__hive_table.db] 
> Created relationship edge from [hive_process] --> [hive_process_execution] using edge label: [r:hive_process_process_executions] 
> Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.outputs]
> Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.inputs]
> ===================================================================================================
> Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.outputs] 
> Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.inputs] 
> Created relationship edge from [hive_column_lineage] --> [hive_process] using edge label: [__hive_column_lineage.query] 
> 
> In the above list the highlited one has propagated classificatin, but subscequent 3 relationships sends 3 more notifications, resulting 3 extra entries for same classification in entity audits.
> 
> At the end entityChangeNotifier, while processing mutated entities, explicetly notify for any pending propagated entities and once again buffer list in request context is processed. Resulting in 4th extra entry in audits.
> 
> Fix:
> ====
> 
> One option was to send the details of propagated entities directly to notifier and not rely on the request context. It required lot of code change.
> Other option was to clear the buffer in the request context after processing it in entityChangeNotifier.
> 
> This review request is with the second aproach.
> 
> 
> Diffs
> -----
> 
>   repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java 32ad65e7a 
>   server-api/src/main/java/org/apache/atlas/RequestContext.java 32ffddde1 
> 
> 
> Diff: https://reviews.apache.org/r/73081/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing was done using both hive and spark.
> precommit test were success
> https://ci-builds.apache.org/job/Atlas/job/PreCommit-ATLAS-Build-Test/263/console
> 
> 
> Thanks,
> 
> Deep Singh
> 
>


Re: Review Request 73081: ATLAS-4076: Creating an asset from existing asset adds duplicate Atlas audit entries

Posted by Deep Singh <de...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73081/
-----------------------------------------------------------

(Updated Dec. 11, 2020, 9:47 p.m.)


Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nikhil Bonte, and Sarath Subramanian.


Bugs: ATLAS-4076
    https://issues.apache.org/jira/browse/ATLAS-4076


Repository: atlas


Description (updated)
-------

Observations:
=============
Have a hive table and attach classification to it on Atlas. Enable propagation on the attached classification.
When you drive a new table from this hive table, the new table will have the propagated classification, as expected.
However, the entity audits of the newly derived table has multiple "Propagated Classification Added" enteries. 

If table derivation is done using Hive Beeline, there are 5 such entries per propagated classification.
Using Spark-shell, 3 such entries were observed per propagated classification.

Expected behaviour is to have just 1 entry per propagated classification.

Analysis:
=========
After detecting relationship and creating relationship edge, the propagated enteties(classifications) are notified to entityChangeListner through entityChangeNotifier. However details of the propagated enteties are not passed directly to notifier, but through request context (buffered into addedPropagation list). 

After processing every edge, AtlasRelationshipStore manager sends notification to entityChangeListner, which simply gets all the items in request context buffer list. 

In this issue, Hive sends event which has multiple relationships, and only one relationship has propagated entities, but due to multiple notifications(which is correct) same buffer list is processed multipletimes (which is wrong).

Following are the list of created relationships 
Created relationship edge from [hive_table] --> [hive_storagedesc] using edge label: [__hive_table.sd] 
Created relationship edge from [hive_table] --> [hive_column] using edge label: [__hive_table.columns] 
Created relationship edge from [hive_table] --> [hive_table_ddl] using edge label: [r:hive_table_ddl_queries] 
Created relationship edge from [hive_table] --> [hive_db] using edge label: [__hive_table.db] 
Created relationship edge from [hive_process] --> [hive_process_execution] using edge label: [r:hive_process_process_executions] 
Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.outputs]
Created relationship edge from [hive_process] --> [hive_table] using edge label: [__Process.inputs]
===================================================================================================
Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.outputs] 
Created relationship edge from [hive_column_lineage] --> [hive_column] using edge label: [__Process.inputs] 
Created relationship edge from [hive_column_lineage] --> [hive_process] using edge label: [__hive_column_lineage.query] 

In the above list the highlited one has propagated classificatin, but subscequent 3 relationships sends 3 more notifications, resulting 3 extra entries for same classification in entity audits.

At the end entityChangeNotifier, while processing mutated entities, explicetly notify for any pending propagated entities and once again buffer list in request context is processed. Resulting in 4th extra entry in audits.

Fix:
====

One option was to send the details of propagated entities directly to notifier and not rely on the request context. It required lot of code change.
Other option was to clear the buffer in the request context after processing it in entityChangeNotifier.

This review request is with the second aproach.


Diffs
-----

  repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityChangeNotifier.java 32ad65e7a 
  server-api/src/main/java/org/apache/atlas/RequestContext.java 32ffddde1 


Diff: https://reviews.apache.org/r/73081/diff/1/


Testing
-------

Manual testing was done using both hive and spark.
precommit test were success
https://ci-builds.apache.org/job/Atlas/job/PreCommit-ATLAS-Build-Test/263/console


Thanks,

Deep Singh