You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Zoltan Haindrich (JIRA)" <ji...@apache.org> on 2018/04/20 14:50:00 UTC

[jira] [Comment Edited] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

    [ https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445471#comment-16445471 ] 

Zoltan Haindrich edited comment on HIVE-18743 at 4/20/18 2:49 PM:
------------------------------------------------------------------

[~akolb] there was an acid related ticket which have landed just before I've seen the end of that ticket - since it have added a lot of if-s everywhere I've to re-interpret a lot of things...
so we are better of to have at least this fix..
+1  ; test failures are not related
note: why are you removing previous version of your patch? please don't do that...I know it might look tidier...but: the comments will miss there context  - and by re-using patch#01 you may confuse a reviewer who have already seen your ticket...and remembers that it had 1 patch....


was (Author: kgyrtkirk):
[~akolb] there was an acid related ticket which have landed just before I've seen the end of that ticket - since it have added a lot of if-s everywhere I've to re-interpret a lot of things...
so we are better of to have at least this fix..
+1  ; I'm checking if there are any related test failures
note: why are you removing previous version of your patch? please don't do that...I know it might look tidier...but: the comments will miss there context  - and by re-using patch#01 you may confuse a reviewer who have already seen your ticket...and remembers that it had 1 patch....

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-18743
>                 URL: https://issues.apache.org/jira/browse/HIVE-18743
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>            Reporter: Alexander Behm
>            Assignee: Alexander Kolbasov
>            Priority: Major
>         Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the table directory to populate basic stats like file counts and sizes. This file listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is intended to selectively prevent this stats collection. Unfortunately, this table property is checked *after* the expensive file listing operation, so the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, Warehouse wh,
>                                              boolean madeDir, boolean forceRecompute, EnvironmentContext environmentContext) throws MetaException {
>     if (tbl.getPartitionKeysSize() == 0) {
>       // Update stats only when unpartitioned
>       FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, tbl);
>       return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after wh.getFileStatusesForUnpartitionedTable() has already been called
>     } else {
>       return false;
>     }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)