You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Umesh Padashetty (Jira)" <ji...@apache.org> on 2023/05/02 13:44:00 UTC
[jira] [Created] (ATLAS-4746) hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive
Umesh Padashetty created ATLAS-4746:
---------------------------------------
Summary: hive_process and hive_process_execution (lineage) being generated for simple DML UPDATE queries run via hive
Key: ATLAS-4746
URL: https://issues.apache.org/jira/browse/ATLAS-4746
Project: Atlas
Issue Type: Bug
Affects Versions: 2.3.0
Reporter: Umesh Padashetty
Attachments: Screenshot 2023-05-02 at 6.28.18 PM.png, Screenshot 2023-05-02 at 6.28.34 PM.png, Screenshot 2023-05-02 at 6.29.05 PM.png, Screenshot 2023-05-02 at 6.29.10 PM.png, Screenshot 2023-05-02 at 6.47.19 PM.png, Screenshot 2023-05-02 at 6.50.16 PM.png
Queries ran:
{code:java}
create table test_hive_lineage_4 (name string, id int) stored as orc;
insert into test_hive_lineage_4 values ('qwer', '2');
update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
As you can see, these are simple DML queries, and not DDL
We should NOT be tracking lineage for any of the DML ({*}SELECT, INSERT, DELETE, and UPDATE){*} queries NOR should we be tracking the audits.
Jiras via which DML operations audits were skipped:
* https://issues.apache.org/jira/browse/ATLAS-3188
* https://issues.apache.org/jira/browse/ATLAS-3198
But all the issues were related to audits and not the lineage. In all these cases, lineage was not generated for the DML UPDATE query
But observing that we are now capturing lineage for simple DML Update query
Relationship after running
{code:java}
create table test_hive_lineage_4 (name string, id int) stored as orc; {code}
!Screenshot 2023-05-02 at 6.28.18 PM.png!
!Screenshot 2023-05-02 at 6.28.34 PM.png!
As seen, there is no lineage generated. Good so far
Then I ran
{code:java}
insert into test_hive_lineage_4 values ('qwer', '2'); {code}
No lineage was generated. Good so far
!Screenshot 2023-05-02 at 6.47.19 PM.png!
Then I ran
{code:java}
update test_hive_lineage_4 set name = 'vwxy' where id = 2; {code}
This immediately generated a hive_process and a hive_process_execution
Interestingly, hive_process with the following name was generated. As you can see, it has DELETE in the process name, when in reality this was an UPDATE DML. Another cause of concern?
{code:java}
QUERY:default.test_hive_lineage_4@cm:1683032252000->:DELETE:default.test_hive_lineage_4@cm:1683032252000 {code}
!Screenshot 2023-05-02 at 6.29.05 PM.png!
!Screenshot 2023-05-02 at 6.29.10 PM.png!
I then ran the same update query 100+ times, it created 100+ UNIQUE (timestamp delimited) hive_process_executions
!Screenshot 2023-05-02 at 6.50.16 PM.png!
This is a disaster since every UPDATE query now generates a process_execution.
Customers can run 1000s of update queries, which are mostly of no use for atlas, but this issue is now leading to the generation of 1000s of process_executions
--
This message was sent by Atlassian Jira
(v8.20.10#820010)