You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/11/29 07:08:00 UTC

[jira] [Work logged] (HIVE-23891) Using UNION sql clause and speculative execution can cause file duplication in Tez

     [ https://issues.apache.org/jira/browse/HIVE-23891?focusedWorklogId=829549&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-829549 ]

ASF GitHub Bot logged work on HIVE-23891:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Nov/22 07:07
            Start Date: 29/Nov/22 07:07
    Worklog Time Spent: 10m 
      Work Description: dengzhhu653 commented on PR #1294:
URL: https://github.com/apache/hive/pull/1294#issuecomment-1330176420

   Hello @georgepachitariu, cloud you please submit another PR against the latest master? Thank you!




Issue Time Tracking
-------------------

    Worklog Id:     (was: 829549)
    Time Spent: 2.5h  (was: 2h 20m)

> Using UNION sql clause and speculative execution can cause file duplication in Tez
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-23891
>                 URL: https://issues.apache.org/jira/browse/HIVE-23891
>             Project: Hive
>          Issue Type: Bug
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-23891.1.patch
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Hello, 
> the specific scenario when this can happen:
>  - the execution engine is Tez;
>  - speculative execution is on;
>  - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there is a UNION. Later, when deduplicating, Hive doesn't take that into account and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
>     select * from union_first_part
> union all
>     select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/000000_0
> .../union_all/HIVE_UNION_SUBDIR_1/000000_1
> .../union_all/HIVE_UNION_SUBDIR_2/000000_1
> {code}
> The attached patch increases the number of folder levels that Hive will check recursively for duplicates when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)