You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Naveen Gangam (Jira)" <ji...@apache.org> on 2021/12/20 15:22:00 UTC
[jira] [Resolved] (HIVE-25800) loadDynamicPartitions in Hive.java should not load all partitions of a managed table
[ https://issues.apache.org/jira/browse/HIVE-25800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Naveen Gangam resolved HIVE-25800.
----------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Fix has been pushed to master. Thank you for the patch [~sourabh912] and the review [~lpinter]
> loadDynamicPartitions in Hive.java should not load all partitions of a managed table
> -------------------------------------------------------------------------------------
>
> Key: HIVE-25800
> URL: https://issues.apache.org/jira/browse/HIVE-25800
> Project: Hive
> Issue Type: Improvement
> Components: Hive
> Reporter: Sourabh Goyal
> Assignee: Sourabh Goyal
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to not add partitions one by one in HMS. As part of that improvement, following code was introduced:
> {code:java}
> // fetch all the partitions matching the part spec using the partition iterable
> // this way the maximum batch size configuration parameter is considered
> PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl, partSpec,
> conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(), 300));
> Iterator<Partition> iterator = partitionIterable.iterator();
> // Match valid partition path to partitions
> while (iterator.hasNext()) {
> Partition partition = iterator.next();
> partitionDetailsMap.entrySet().stream()
> .filter(entry -> entry.getValue().fullSpec.equals(partition.getSpec()))
> .findAny().ifPresent(entry -> {
> entry.getValue().partition = partition;
> entry.getValue().hasOldPartition = true;
> });
> } {code}
> The above code fetches all the existing partitions for a table from HMS and compare that dynamic partitions list to decide old and new partitions to be added to HMS (in batches). The call to fetch all partitions has introduced a performance regression for tables with large number of partitions (of the order of 100K).
>
> This is fixed for external tables in https://issues.apache.org/jira/browse/HIVE-25178. However for ACID tables there is an open Jira(HIVE-25187). Until we have an appropriate fix in HIVE-25187, we can apply the following:
> Skip fetching all partitions. Instead, in the threadPool which loads each partition individually, call get_partition() to check if the partition already exists in HMS or not.
> This will introduce additional getPartition() call for every partition to be loaded dynamically but removes fetching all existing partitions for a table.
> I believe this is fine since for tables with small number of existing partitions in HMS - getPartitions() won't add too much overhead but for tables with large number of existing partitions, it will certainly avoid getting all partitions from HMS
> cc - [~lpinter] [~ngangam]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)