You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ádám Szita (Jira)" <ji...@apache.org> on 2022/04/04 15:37:00 UTC

[jira] [Commented] (HIVE-26110) bulk insert into partitioned table creates lots of files in iceberg

    [ https://issues.apache.org/jira/browse/HIVE-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516906#comment-17516906 ] 

Ádám Szita commented on HIVE-26110:
-----------------------------------

Thanks for finding this [~rbalamohan]. I was able to reproduce this locally, and found what most probably was missing from my patch in HIVE-25975 dealing with this feature:

The explain plan lacked this line for the Reduce Sink operator:
{code:java}
Map-reduce partition columns: _col23 (type: bigint) {code}
..so although the SortedDynPartitionOptimizer has taken care of setting the Iceberg table's partition column as a KEY column, it didn't mark it as _*partition*_ {*}column in Map/Reduce terms{*}. (I have probably missed this because of the confusing terminology, as I see it, this is not a Hive table partition, but a MR job relevant term for adjusting the shuffle phase... ) This made all reducers write a certain amount of rows irrespective of key distribution (nevertheless the rows within the reducer tasks were sorted on the key column - otherwise an exception would have been thrown from ClusteredWriter class as you mentioned it earlier) and we ended up with 
{code:java}
1 <= n <= reducerCount {code}
files for each partition.

Anyhow what we probably need is to add a partCols.add() invocation beside [https://github.com/apache/hive/pull/3060/files#diff-b28bcf13b1a3e2d73139df50ed102fae4625032ea73337aa1c295bb3069e1499R651] , it actually solved the issue on my local repro.

> bulk insert into partitioned table creates lots of files in iceberg
> -------------------------------------------------------------------
>
>                 Key: HIVE-26110
>                 URL: https://issues.apache.org/jira/browse/HIVE-26110
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> For e.g, create web_returns table in tpcds in iceberg format and try to copy over data from regular table. More like "insert into web_returns_iceberg as select * from web_returns".
> This inserts the data correctly, however there are lot of files present in each partition. IMO, dynamic sort optimisation isn't working fine and this causes records not to be grouped in the final phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)