You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Madhan Neethiraj (JIRA)" <ji...@apache.org> on 2018/11/19 21:51:00 UTC

[jira] [Created] (ATLAS-2975) Hive hook generates duplicate column_lineage entities

Madhan Neethiraj created ATLAS-2975:
---------------------------------------

             Summary: Hive hook generates duplicate column_lineage entities
                 Key: ATLAS-2975
                 URL: https://issues.apache.org/jira/browse/ATLAS-2975
             Project: Atlas
          Issue Type: Bug
          Components: atlas-intg
    Affects Versions: 1.1.0, 0.8.3, 1.0.0
            Reporter: Madhan Neethiraj
            Assignee: Madhan Neethiraj


Hive hook is expected to create one column-lineage entity for each column in the output table. However, for each output column, hive hook might generates multiple column-lineage entities when multiple partitions are involved - one entity for each partition. This can end up with large number of duplciate column-lineage entities, depending on the number of partitions. Such duplicate entities should be avoided.

Here is the sample HSQL to repro this issue:

{noformat}
CREATE TABLE visitors(name STRING, dob DATE) PARTITIONED BY (yob INT);
CREATE TABLE visitors_log(name STRING, dob DATE);

INSERT INTO TABLE visitors_log VALUES('John',  '1980-08-08'),
                                     ('Jack',  '1980-09-09'),
                                     ('Kevin', '1990-10-10'),
                                     ('Ken',   '1990-11-11'),
                                     ('Larry', '1995-12-12');

SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO TABLE visitors PARTITION(yob) SELECT name, dob, YEAR(dob) yob FROM visitors_log;
{noformat}

In above case, columns visitors.name and visitors.dob will have 3 input lineage - one for each partition 1980, 1990 and 1995.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)