You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Umesh Padashetty (Jira)" <ji...@apache.org> on 2023/05/02 13:44:00 UTC

[jira] [Created] (ATLAS-4746) hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive

Umesh Padashetty created ATLAS-4746:
---------------------------------------

             Summary: hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive
                 Key: ATLAS-4746
                 URL: https://issues.apache.org/jira/browse/ATLAS-4746
             Project: Atlas
          Issue Type: Bug
    Affects Versions: 2.3.0
            Reporter: Umesh Padashetty
         Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png, Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19 PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png

Queries ran:
{code:java}
create table test_hive_lineage_4 (name string, id int) stored as orc;

insert into test_hive_lineage_4 values ('qwer', '2');

update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
As you can see, these are simple DML queries, and not DDL

We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT, DELETE, and UPDATE){*} queries NOR should we be tracking the audits. 

Jiras via which DML operations audits were skipped:
 * https://issues.apache.org/jira/browse/ATLAS-3188
 * https://issues.apache.org/jira/browse/ATLAS-3198

But all the issues were related to audits and not the lineage. In all these cases, lineage was not generated for the DML UPDATE query

But observing that we are now capturing lineage for simple DML Update query

Relationship after running
{code:java}
create table test_hive_lineage_4 (name string, id int) stored as orc; {code}
!Screenshot 2023-05-02 at 6.28.18 PM.png!

!Screenshot 2023-05-02 at 6.28.34 PM.png!

As seen, there is no lineage generated. Good so far 

Then I ran
{code:java}
insert into test_hive_lineage_4 values ('qwer', '2'); {code}
No lineage was generated. Good so far 

!Screenshot 2023-05-02 at 6.47.19 PM.png!

Then I ran 
{code:java}
update test_hive_lineage_4 set name = 'vwxy' where id = 2;  {code}
This immediately generated a hive_process and a hive_process_execution

Interestingly, hive_process with the following name was generated. As you can see, it has DELETE in the process name, when in reality this was an UPDATE DML. Another cause of concern?
{code:java}
QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000 {code}
!Screenshot 2023-05-02 at 6.29.05 PM.png!

  !Screenshot 2023-05-02 at 6.29.10 PM.png!

I then ran the same update query 100+ times, it created 100+ UNIQUE (timestamp delimited) hive_process_executions

!Screenshot 2023-05-02 at 6.50.16 PM.png!

This is a disaster since every UPDATE query now generates a process_execution. 

Customers can run 1000s of update queries, which are mostly of no use for atlas, but this issue is now leading to the generation of 1000s of process_executions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)