You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/01/12 09:02:42 UTC

[GitHub] [incubator-iceberg] alexey-fin opened a new pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

alexey-fin opened a new pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue merged pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574062407
 
 
   @aokolnychyi i mean this function org.apache.hadoop.hive.metastore.MetaStoreUtils#updateUnpartitionedTableStatsFast(org.apache.hadoop.hive.metastore.api.Database, org.apache.hadoop.hive.metastore.api.Table, org.apache.hadoop.hive.metastore.Warehouse, boolean, boolean)
   
   we use hive 2.3.2 and in our case function org.apache.hadoop.hive.metastore.MetaStoreUtils#requireCalStats has a workaround to prevent this via environmentContext param DO_NOT_UPDATE_STATS, unfortunately iceberg calls alterTable with null environmentContext, so what we did is override HiveAlterHandler (hive allows it to be pluggable via configuration)
   
   @rdsr in hive 1.2.1 the function requireCalStats doesn't have this workaround unfortunately

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574184804
 
 
   @omalley @rdblue @rdsr, any ideas on how to avoid updating stats in HMS 1.2.1?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
rdsr commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574300807
 
 
   Seems like a recursive call
   ```
   /**
      * @param table
      * @return array of FileStatus objects corresponding to the files making up the passed
      * unpartitioned table
      */
     public FileStatus[] getFileStatusesForUnpartitionedTable(Database db, Table table)
         throws MetaException {
       startWarehouseFunction("getFileStatusesForUnpartitionedTable", ": db=" + db.getName() + " : table=" + table.getTableName());
       Path tablePath = getTablePath(db, table.getTableName());
       try {
         FileSystem fileSys = tablePath.getFileSystem(conf);
         return HiveStatsUtils.getFileStatusRecurse(tablePath, -1, fileSys);
       } catch (IOException ioe) {
         MetaStoreUtils.logAndThrowMetaException(ioe);
       } finally {
         endWarehouseFunction("getFileStatusesForUnpartitionedTable");
       }
       return null;
     }```
   
   But I'm in favor of this PR as we'll fix the Metastore bug ourselves if performance is a penalty for us.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574313780
 
 
   Guys, in hive 1.2.1 this is really bizzare, look at the function getTablePath :     return getDnsPath(new Path(getDatabasePath(db), tableName.toLowerCase()));
   it IGNORES the real table location (if it is not default) and just goes to warehouse_dir/tableName
   so unless you use the default warehouse location it scans nothing!
   
   on the other hand in hive 2.3.2:
           Path tablePath = this.getDnsPath(new Path(table.getSd().getLocation()));
   
   so this PR fixes things for hive 2.3.2, in hive 1.2.1 you cant really disable the scan but you can work
   around it by providing custom location

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-573395627
 
 
   Keep in mind hive 1.2 does not allow to disable the scan this way, how we wanna deal with it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#discussion_r366306052
 
 

 ##########
 File path: hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
 ##########
 @@ -163,7 +165,10 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) {
 
       if (base != null) {
         metaClients.run(client -> {
-          client.alter_table(database, tableName, tbl);
+          client.alter_table(database, tableName, tbl,
 
 Review comment:
   Nit: Iceberg uses 4 spaces for continued indentation and `ImmutableMap` from Guava.
   
   What about this?
   
   ```
   EnvironmentContext envContext = new EnvironmentContext(
       ImmutableMap.of(StatsSetupConst.DO_NOT_UPDATE_STATS, StatsSetupConst.TRUE)
   );
   client.alter_table(database, tableName, tbl, envContext);
   
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574049574
 
 
   @alexey-fin, do you mean [this bug](https://jira.apache.org/jira/browse/HIVE-18743)?
   
   @alexey-fin, could you point to code in HMS that triggers a recursive scan on alter table? I did a quick look but didn't find such a case if it's an external table. I do see such calls in alter partition, for example.
   
   It also seems that we can set `hive.stats.autogather` to false to prevent stats collection if HMS is used only by Iceberg.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574281563
 
 
   It looks like HIVE-18743 makes it so we can't avoid this, but it looks like this is probably only listing the table root, which contains only 2 entries: the `data` and `metadata` folders. This doesn't cause a recursive listing, does it?
   
   If that's the case, then I think we should use the change here to fix later versions and not worry about it for 1.2.1.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr edited a comment on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
rdsr edited a comment on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574300807
 
 
   Seems like a recursive call
   ```
   /**
      * @param table
      * @return array of FileStatus objects corresponding to the files making up the passed
      * unpartitioned table
      */
     public FileStatus[] getFileStatusesForUnpartitionedTable(Database db, Table table)
         throws MetaException {
       startWarehouseFunction("getFileStatusesForUnpartitionedTable", ": db=" + db.getName() + " : table=" + table.getTableName());
       Path tablePath = getTablePath(db, table.getTableName());
       try {
         FileSystem fileSys = tablePath.getFileSystem(conf);
         return HiveStatsUtils.getFileStatusRecurse(tablePath, -1, fileSys);
       } catch (IOException ioe) {
         MetaStoreUtils.logAndThrowMetaException(ioe);
       } finally {
         endWarehouseFunction("getFileStatusesForUnpartitionedTable");
       }
       return null;
     }
   ```
   
   But I'm in favor of this PR as we'll fix the Metastore bug ourselves if performance is a penalty for us.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
rdsr commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-573816862
 
 
   > Keep in mind hive 1.2 does not allow to disable the scan this way, how we wanna deal with it?
   
   Hi @alexey-fin do you mean this disabling of stats collection is not part of Hive-1.2. I do see to classes [e.g org.apache.hadoop.hive.common.StatsSetupConst] in my hive-common-1.2.1 jar

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574455055
 
 
   Okay, sounds like 1.2.1 is just broken, but we can at least make this change to fix other versions. Let's get this in and we can see about fixing 1.2.1 if anyone has ideas later. Thanks @alexey-fin! Nice work.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574143282
 
 
   @alexey-fin, I overlooked the second if condition in `HiveAlterHandler$alterTable`. Indeed, HMS seems to call `MetaStoreUtils` to update props. 
   
   It also looks like you are correct and [HIVE-18743](https://jira.apache.org/jira/browse/HIVE-18743) is not fixed in 1.2.1. Will setting `hive.stats.autogather` to false in hive.xml help in 1.2.1?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] alexey-fin commented on a change in pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
alexey-fin commented on a change in pull request #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#discussion_r366310571
 
 

 ##########
 File path: hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
 ##########
 @@ -163,7 +165,10 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) {
 
       if (base != null) {
         metaClients.run(client -> {
-          client.alter_table(database, tableName, tbl);
+          client.alter_table(database, tableName, tbl,
 
 Review comment:
   Sure thing, sorry

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi edited a comment on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574184804
 
 
   @omalley @rdblue @rdsr, any ideas on how to avoid scanning the table dir in HMS 1.2.1?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir

Posted by GitBox <gi...@apache.org>.
alexey-fin commented on issue #734: HIVE: On alter table prevent hive from recursively scanning the complete table dir
URL: https://github.com/apache/incubator-iceberg/pull/734#issuecomment-574153375
 
 
   @aokolnychyi Unfortunately it seems like it's impossible to avoid the call to org.apache.hadoop.hive.metastore.Warehouse#getFileStatusesForUnpartitionedTable in hive 1.2.1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org