You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/12/13 17:44:00 UTC

[jira] [Work logged] (HIVE-25800) loadDynamicPartitions in Hive.java should not load all partitions of a managed table

     [ https://issues.apache.org/jira/browse/HIVE-25800?focusedWorklogId=695244&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-695244 ]

ASF GitHub Bot logged work on HIVE-25800:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Dec/21 17:43
            Start Date: 13/Dec/21 17:43
    Worklog Time Spent: 10m 
      Work Description: sourabh912 opened a new pull request #2868:
URL: https://github.com/apache/hive/pull/2868


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to not add partitions one
   by one in HMS. It used to fetch all the existing partitions for a table from HMS and compare that with
   dynamic partitions list to decide old and new partitions to be added to HMS (in batches). The call to
   fetch all partitions has introduced a performance regression for tables with large number of
   partitions (of the order of 100K).
   
   This is fixed for external tables in HIVE-25178. However for ACID tables there is an open Jira HIVE-25187.
   Until we have an appropriate fix in HIVE-25187,we can skip fetching all partitions. Instead, in the
   threadPool which loads each partition individually,call getPartition() to check if the partition already
   exists in HMS or not. This will introduce additional getPartition() call for every partition to be loaded
   dynamically but does not fetch all existing partitions for a table anymore.
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Since it is an improvement in existing logic, therefore relying on existing tests. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 695244)
    Remaining Estimate: 0h
            Time Spent: 10m

> loadDynamicPartitions in Hive.java should not load all partitions of a managed table 
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-25800
>                 URL: https://issues.apache.org/jira/browse/HIVE-25800
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>            Reporter: Sourabh Goyal
>            Assignee: Sourabh Goyal
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-20661 added an improvement in loadDynamicPartitions() api in Hive.java to not add partitions one by one in HMS. As part of that improvement, following code was introduced: 
> {code:java}
> // fetch all the partitions matching the part spec using the partition iterable
> // this way the maximum batch size configuration parameter is considered
> PartitionIterable partitionIterable = new PartitionIterable(Hive.get(), tbl, partSpec,
>           conf.getInt(MetastoreConf.ConfVars.BATCH_RETRIEVE_MAX.getVarname(), 300));
> Iterator<Partition> iterator = partitionIterable.iterator();
> // Match valid partition path to partitions
> while (iterator.hasNext()) {
>   Partition partition = iterator.next();
>   partitionDetailsMap.entrySet().stream()
>           .filter(entry -> entry.getValue().fullSpec.equals(partition.getSpec()))
>           .findAny().ifPresent(entry -> {
>             entry.getValue().partition = partition;
>             entry.getValue().hasOldPartition = true;
>           });
> } {code}
> The above code fetches all the existing partitions for a table from HMS and compare that dynamic partitions list to decide old and new partitions to be added to HMS (in batches). The call to fetch all partitions has introduced a performance regression for tables with large number of partitions (of the order of 100K). 
>  
> This is fixed for external tables in https://issues.apache.org/jira/browse/HIVE-25178.  However for ACID tables there is an open Jira(HIVE-25187). Until we have an appropriate fix in HIVE-25187, we can apply the following: 
> Skip fetching all partitions. Instead, in the threadPool which loads each partition individually,  call get_partition() to check if the partition already exists in HMS or not.  
> This will introduce additional getPartition() call for every partition to be loaded dynamically but removes fetching all existing partitions for a table. 
> I believe this is fine since for tables with small number of existing partitions in HMS - getPartitions() won't add too much overhead but for tables with large number of existing partitions, it will certainly avoid getting all partitions from HMS 
> cc - [~lpinter] [~ngangam] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)