You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/11/27 00:45:00 UTC

[jira] [Work logged] (HIVE-23891) Using UNION sql clause and speculative execution can cause file duplication in Tez

     [ https://issues.apache.org/jira/browse/HIVE-23891?focusedWorklogId=517204&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517204 ]

ASF GitHub Bot logged work on HIVE-23891:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Nov/20 00:44
            Start Date: 27/Nov/20 00:44
    Worklog Time Spent: 10m 
      Work Description: github-actions[bot] commented on pull request #1294:
URL: https://github.com/apache/hive/pull/1294#issuecomment-734518903


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 517204)
    Time Spent: 2h 10m  (was: 2h)

> Using UNION sql clause and speculative execution can cause file duplication in Tez
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-23891
>                 URL: https://issues.apache.org/jira/browse/HIVE-23891
>             Project: Hive
>          Issue Type: Bug
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-23891.1.patch
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Hello, 
> the specific scenario when this can happen:
>  - the execution engine is Tez;
>  - speculative execution is on;
>  - the query inserts into a table and the last step is a UNION sql clause;
> The problem is that Tez creates an extra layer of subdirectories when there is a UNION. Later, when deduplicating, Hive doesn't take that into account and only deduplicates folders but not the files inside.
> So for a query like this:
> {code:sql}
> insert overwrite table union_all
>     select * from union_first_part
> union all
>     select * from union_second_part;
> {code}
> The folder structure afterwards will be like this (a possible example):
> {code:java}
> .../union_all/HIVE_UNION_SUBDIR_1/000000_0
> .../union_all/HIVE_UNION_SUBDIR_1/000000_1
> .../union_all/HIVE_UNION_SUBDIR_2/000000_1
> {code}
> The attached patch increases the number of folder levels that Hive will check recursively for duplicates when we have a UNION in Tez.
> Feel free to reach out if you have any questions :).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)