You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Yao Guangdong (Jira)" <ji...@apache.org> on 2021/12/29 09:27:00 UTC

[jira] [Updated] (HIVE-25836) Tez union all operation may cause duplicate data

     [ https://issues.apache.org/jira/browse/HIVE-25836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yao Guangdong updated HIVE-25836:
---------------------------------
    Summary: Tez union all operation may cause duplicate data  (was: Tez union operation may cause duplicate data)

> Tez union all operation may cause duplicate data
> ------------------------------------------------
>
>                 Key: HIVE-25836
>                 URL: https://issues.apache.org/jira/browse/HIVE-25836
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.3.0, 2.3.8
>            Reporter: Yao Guangdong
>            Priority: Critical
>
> When we use tez union all operation.Which will cause some duplicate data in some cases. Which is because tez union all operation can generate sub directory in the table or parition directory.The sub directory use number as name and the result data file will stored in sub directory.If the sub directory have the speculate task execute and the speculate task's result file also in the sub directory.The hive client will delete duplicate task's file when the job finished.The hive client only check one level have the duplicate task's file.BecauseĀ  the sub directory's exsist. Which make the sub directory's duplicate task's file not delete and the duplicate data happened.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)