You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "George Pachitariu (Jira)" <ji...@apache.org> on 2020/07/21 17:08:00 UTC
[jira] [Created] (HIVE-23891) Using UNION sql clause and
speculative execution can cause file duplication in Tez
George Pachitariu created HIVE-23891:
----------------------------------------
Summary: Using UNION sql clause and speculative execution can cause file duplication in Tez
Key: HIVE-23891
URL: https://issues.apache.org/jira/browse/HIVE-23891
Project: Hive
Issue Type: Bug
Reporter: George Pachitariu
Assignee: George Pachitariu
Hello,
the specific scenario when this can happen:
- the execution engine is Tez;
- speculative execution is on;
- the query inserts into a table and the last step is a UNION sql clause;
The problem is that Tez creates an extra layer of subdirectories when there is a UNION. Later, when deduplicating, Hive doesn't take that into account and only deduplicates folders but not the files inside.
So for a query like this:
{code:sql}
insert overwrite table union_all
select * from union_first_part
union all
select * from union_second_part;
{code}
The folder structure afterwards will be like this (a possible example):
{code:java}
.../union_all/HIVE_UNION_SUBDIR_1/000000_0
.../union_all/HIVE_UNION_SUBDIR_1/000000_1
.../union_all/HIVE_UNION_SUBDIR_2/000000_1
{code}
The attached patch increases the number of folder levels that Hive will check recursively for duplicates (recursively) when we have a UNION in Tez.
Feel free to reach out if you have any questions :).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)