You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "mahesh kumar behera (Jira)" <ji...@apache.org> on 2021/05/13 07:31:00 UTC

[jira] [Commented] (HIVE-24663) Batch process in ColStatsProcessor

    [ https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343755#comment-17343755 ] 

mahesh kumar behera commented on HIVE-24663:
--------------------------------------------

The original issue with the slowness in because of the way column stats are processed at HMS. The stats are updated one by one at HMS using JDO connections. This was resulting into performance issues as JDO does lots of conversion. So the proper fix is to batch the processing into single sql statements and execute it using direct sql. 

> Batch process in ColStatsProcessor
> ----------------------------------
>
>                 Key: HIVE-24663
>                 URL: https://issues.apache.org/jira/browse/HIVE-24663
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: mahesh kumar behera
>            Priority: Major
>              Labels: performance
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)