You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Ádám Szita (Jira)" <ji...@apache.org> on 2022/03/04 11:29:00 UTC

[jira] [Resolved] (HIVE-25975) Optimize ClusteredWriter for bucketed Iceberg tables

     [ https://issues.apache.org/jira/browse/HIVE-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ádám Szita resolved HIVE-25975.
-------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Committed to master. Thanks for the thorough reviews from [~pvary] and [~Marton Bod] 

> Optimize ClusteredWriter for bucketed Iceberg tables
> ----------------------------------------------------
>
>                 Key: HIVE-25975
>                 URL: https://issues.apache.org/jira/browse/HIVE-25975
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Ádám Szita
>            Assignee: Ádám Szita
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> The first version of the ClusteredWriter in Hive-Iceberg will be lenient for bucketed tables: i.e. the records do not need to be ordered by the bucket values, the writer will just close its current file and open a new one for out-of-order records. 
> This is suboptimal for the long-term due to creating many small files. Spark uses a UDF to compute the bucket value for each record and therefore it is able to order the records by bucket values, achieving optimal clustering.
> The proposed change adds a new UDF that uses Iceberg's bucket transformation function to produce bucket values from constants or any column input. All types that Iceberg buckets support are supported in this UDF too, except for UUID.
> This UDF is then used in SortedDynPartitionOptimizer to sort data during write if the target Iceberg target has bucket transform partitioning.
> To enable this, Hive has been extended with the feature that allows storage handlers to define custom sorting expressions, to be passed to FileSink operator's DynPartContext during dynamic partitioning write scenarios.
> The lenient version of ClusteredWriter in patched-iceberg-core has been disposed of as it is not needed anymore with this feature in.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)