You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Radhika Kundam (Jira)" <ji...@apache.org> on 2023/05/17 17:34:00 UTC

[jira] [Commented] (ATLAS-4746) hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive

    [ https://issues.apache.org/jira/browse/ATLAS-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723554#comment-17723554 ] 

Radhika Kundam commented on ATLAS-4746:
---------------------------------------

[~umesh.padashetty] 

Two different issues exists as per the description.
 # process/process_executions were getting created when there is no scope of lineage creation. 
 # With ATLAS-4734 fix, Atlas started linking the process to the entity, hence it started failing.

As part of this jira, I am fixing the issue of linking the process when there is no need of lineage which was introduced with ATLAS-4734

But to fix the first issue, we need to analyze further why process/process_executions were getting created in first place when there is no need of lineage. To track that issue created other Jira# ATLAS-4753

> hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-4746
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4746
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 2.3.0
>            Reporter: Umesh Padashetty
>            Assignee: Radhika Kundam
>            Priority: Critical
>         Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png, Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19 PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png
>
>
> Queries ran:
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc;
> insert into test_hive_lineage_4 values ('qwer', '2');
> update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
> As you can see, these are simple DML queries, and not DDL
> We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT, DELETE, and UPDATE){*} queries NOR should we be tracking the audits. 
> Jiras via which DML operations audits were skipped:
>  * https://issues.apache.org/jira/browse/ATLAS-3188
>  * https://issues.apache.org/jira/browse/ATLAS-3198
> But all the issues were related to audits and not the lineage. In all these cases, lineage was not generated for the DML UPDATE query
> But observing that we are now capturing lineage for simple DML Update query
> Relationship after running
> {code:java}
> create table test_hive_lineage_4 (name string, id int) stored as orc; {code}
> !Screenshot 2023-05-02 at 6.28.18 PM.png!
> !Screenshot 2023-05-02 at 6.28.34 PM.png!
> As seen, there is no lineage generated. Good so far 
> Then I ran
> {code:java}
> insert into test_hive_lineage_4 values ('qwer', '2'); {code}
> No lineage was generated. Good so far 
> !Screenshot 2023-05-02 at 6.47.19 PM.png!
> Then I ran 
> {code:java}
> update test_hive_lineage_4 set name = 'vwxy' where id = 2;  {code}
> This immediately generated a hive_process and a hive_process_execution
> Interestingly, hive_process with the following name was generated. As you can see, it has DELETE in the process name, when in reality this was an UPDATE DML. Another cause of concern?
> {code:java}
> QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000 {code}
> !Screenshot 2023-05-02 at 6.29.05 PM.png!
>   !Screenshot 2023-05-02 at 6.29.10 PM.png!
> I then ran the same update query 100+ times, it created 100+ UNIQUE (timestamp delimited) hive_process_executions
> !Screenshot 2023-05-02 at 6.50.16 PM.png!
> This is a disaster since every UPDATE query now generates a process_execution. 
> Customers can run 1000s of update queries, which are mostly of no use for atlas, but this issue is now leading to the generation of 1000s of process_executions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)