You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/11/20 00:02:00 UTC
[jira] [Commented] (ATLAS-2975) Hive hook generates duplicate
column_lineage entities
[ https://issues.apache.org/jira/browse/ATLAS-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692456#comment-16692456 ]
ASF subversion and git services commented on ATLAS-2975:
--------------------------------------------------------
Commit ed795dc4c10ef56999ff57fa67739a1e126a2ccb in atlas's branch refs/heads/master from [~madhan@apache.org]
[ https://git-wip-us.apache.org/repos/asf?p=atlas.git;h=ed795dc ]
ATLAS-2975: updated Hive hook to avoid duplicate column-lineage entities; also updated Atlas server to skip duplicate column-lineage entities
> Hive hook generates duplicate column_lineage entities
> -----------------------------------------------------
>
> Key: ATLAS-2975
> URL: https://issues.apache.org/jira/browse/ATLAS-2975
> Project: Atlas
> Issue Type: Bug
> Components: atlas-intg
> Affects Versions: 1.0.0, 0.8.3, 1.1.0
> Reporter: Madhan Neethiraj
> Assignee: Madhan Neethiraj
> Priority: Major
> Fix For: 2.0.0
>
> Attachments: ATLAS-2975-master.patch
>
>
> Hive hook is expected to create one column-lineage entity for each column in the output table. However, for each output column, hive hook might generates multiple column-lineage entities when multiple partitions are involved - one entity for each partition. This can end up with large number of duplciate column-lineage entities, depending on the number of partitions. Such duplicate entities should be avoided.
> Here is the sample HSQL to repro this issue:
> {noformat}
> CREATE TABLE visitors(name STRING, dob DATE) PARTITIONED BY (yob INT);
> CREATE TABLE visitors_log(name STRING, dob DATE);
> INSERT INTO TABLE visitors_log VALUES('John', '1980-08-08'),
> ('Jack', '1980-09-09'),
> ('Kevin', '1990-10-10'),
> ('Ken', '1990-11-11'),
> ('Larry', '1995-12-12');
> SET hive.exec.dynamic.partition.mode=nonstrict;
> INSERT INTO TABLE visitors PARTITION(yob) SELECT name, dob, YEAR(dob) yob FROM visitors_log;
> {noformat}
> In above case, columns visitors.name and visitors.dob will have 3 input lineage - one for each partition 1980, 1990 and 1995.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)