You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Prasanth Jayachandran (JIRA)" <ji...@apache.org> on 2017/11/01 18:53:00 UTC

[jira] [Commented] (HIVE-17935) Turn on hive.optimize.sort.dynamic.partition by default

    [ https://issues.apache.org/jira/browse/HIVE-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234570#comment-16234570 ] 

Prasanth Jayachandran commented on HIVE-17935:
----------------------------------------------

The thing to note is that this might cause performance regression for some jobs. Jobs with partition column values in the order of 10s will have regression as it may run as map only job. This feature will force a reducer stage even for small jobs. In some cases, reducer deduplication can bring in gains but in cases where there is extra reducer and small partition count this will slow down. This optimization is really beneficial when there are lots of partition which can cause queries to OOM or create GC pressure. In all cases, this will also result in optimal file structure (concurrent writers for ORC can result in too many small stripes per file which is suboptimal). So there are good and bad about this optimization. Ideally we want optimizer to make smart decision during planning whether to enable this or not based on column stats from source table. cc/ [~ashutoshc]

> Turn on hive.optimize.sort.dynamic.partition by default
> -------------------------------------------------------
>
>                 Key: HIVE-17935
>                 URL: https://issues.apache.org/jira/browse/HIVE-17935
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Andrew Sherman
>            Assignee: Andrew Sherman
>            Priority: Major
>         Attachments: HIVE-17935.1.patch, HIVE-17935.2.patch
>
>
> The config option hive.optimize.sort.dynamic.partition is an optimization for Hive’s dynamic partitioning feature. It was originally implemented in [HIVE-6455|https://issues.apache.org/jira/browse/HIVE-6455]. With this optimization, the dynamic partition columns and bucketing columns (in case of bucketed tables) are sorted before being fed to the reducers. Since the partitioning and bucketing columns are sorted, each reducer can keep only one record writer open at any time thereby reducing the memory pressure on the reducers. There were some early problems with this optimization and it was disabled by default in HiveConf in [HIVE-8151|https://issues.apache.org/jira/browse/HIVE-8151]. Since then setting hive.optimize.sort.dynamic.partition=true has been used to solve problems where dynamic partitioning produces with (1) too many small files on HDFS, which is bad for the cluster and can increase overhead for future Hive queries over those partitions, and (2) OOM issues in the map tasks because it trying to simultaneously write to 100 different files. 
> It now seems that the feature is probably mature enough that it can be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)