You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/05 06:49:40 UTC

[GitHub] [iceberg] kevinjqliu opened a new pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

kevinjqliu opened a new pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670


   `HiveTableOperations` does not update HMS `transient_lastDdlTime` parameter on snapshot write. The param is updated once only on table creation.
   
   For context, we currently query the `transient_lastDdlTime` param directly from HMS to determine the freshness of tables. This is faster than reading all the metadata files and get the timestamp for each iceberg table's most recent snapshot.
   
   This PR removes the `transient_lastDdlTime`/`hive_metastoreConstants.DDL_TIME` param when setting HMS table parameters in `HiveTableOperations`. By doing so, the [HMS client will update the field in its alter table function](https://github.com/apache/hive/blob/cceee0a61a75274520178cd31cad26e3b1a25b12/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java#L6081-L6086). This is what the [Hive Engine is doing for alter table](https://github.com/apache/hive/blob/a8e50734e0460e506f1762fbe0f628bcb444b8f5/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L841-L845).
   
   Alternatively, we can explicitly set the timestamp in `HiveTableOperations` instead of relying on HMS client implementation. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kevinjqliu commented on pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
kevinjqliu commented on pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#issuecomment-1004489465


   hi @kbendick , @rdblue happy new year! 
   I want to follow up on this PR. Is it feasible to merge this PR with just the remove parameter change in `HiveTableOperations` and take out the integration test? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#discussion_r764414396



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -362,6 +362,9 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
     // remove any props from HMS that are no longer present in Iceberg table props
     obsoleteProps.forEach(parameters::remove);
 
+    // remove any props that are controlled by HMS so it can handle them
+    parameters.remove(hive_metastoreConstants.DDL_TIME);

Review comment:
       Is this method called for `INSERT` as well? Maybe updating this every time is not entirely correct. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#discussion_r766824970



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -362,6 +362,9 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
     // remove any props from HMS that are no longer present in Iceberg table props
     obsoleteProps.forEach(parameters::remove);
 
+    // remove any props that are controlled by HMS so it can handle them
+    parameters.remove(hive_metastoreConstants.DDL_TIME);

Review comment:
       Theoretically it should not. DDL_TIME should be updated only with DDL operations.
   Insert should not affect this, but if the alter table updating the statistics is incorrect, then it might.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#discussion_r764416376



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -362,6 +362,9 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
     // remove any props from HMS that are no longer present in Iceberg table props
     obsoleteProps.forEach(parameters::remove);
 
+    // remove any props that are controlled by HMS so it can handle them
+    parameters.remove(hive_metastoreConstants.DDL_TIME);

Review comment:
       Probably adding some test where we create a table and later insert to the table, and we make sure that the insert does not change the Ddl time




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kevinjqliu commented on pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
kevinjqliu commented on pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#issuecomment-1005095805


   >  @kevinjqliu: Did you have time to check if the DDL time is changed when we only inserting to a Hive table and do not change the mertadata?
   
   @pvary
   seems like insert to the table changes the DDL time according to this.
   https://lists.apache.org/thread/44rr7q3wb2pfp7b5prgnb6vt2of1t41c


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#issuecomment-1005556427


   > > @kevinjqliu: Did you have time to check if the DDL time is changed when we only inserting to a Hive table and do not change the mertadata?
   > 
   > @pvary seems like insert to the table changes the DDL time according to this. https://lists.apache.org/thread/44rr7q3wb2pfp7b5prgnb6vt2of1t41c
   
   @kevinjqliu: So Hive basically handles `DDL_TIME` incorrectly and updates it whenever there is some change to the table (maybe caused by the update of the table statistics). I would still consider this as a Hive bug and would be very cautious building features around this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#discussion_r762610090



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -362,6 +362,9 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
     // remove any props from HMS that are no longer present in Iceberg table props
     obsoleteProps.forEach(parameters::remove);
 
+    // remove the DDL_TIME so it gets refreshed

Review comment:
       Nit: you might consider formatting this comment more generically like the one above. Something like `// remove any props that are controlled by HMS so it can handle them`. That way it's also clear where to add more properties like that in the future.
   
   Up to you though. There's nothing wrong per-se with the current comment. Just this has come up a few times on Slack and I'm not sure if there's other properties that might fall into this category.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kevinjqliu commented on a change in pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
kevinjqliu commented on a change in pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#discussion_r763595653



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -647,6 +647,8 @@ public void testIcebergAndHmsTableProperties() throws Exception {
     Assert.assertEquals(HiveIcebergInputFormat.class.getName(), hmsTable.getSd().getInputFormat());
     Assert.assertEquals(HiveIcebergOutputFormat.class.getName(), hmsTable.getSd().getOutputFormat());
     Assert.assertEquals(HiveIcebergSerDe.class.getName(), hmsTable.getSd().getSerdeInfo().getSerializationLib());
+    // Add 1 second delay to reflect DDL_TIME change in HMS
+    Thread.sleep(1000);

Review comment:
       > Sleeping for 1 second really adds up and makes tests slow... what about always setting the property to the last snapshot timestamp? 
   
   I agree sleeping for 1 second is not ideal. Wasn't sure about a good alternative. Currently, the test runs fast enough that the table creation and subsequent snapshot write all happen in the same unix timestamp. So I can't write a test that shows the timestamp had changed, something like (table creation timestamp < write timestamp) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#discussion_r762610090



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
##########
@@ -362,6 +362,9 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, Map<St
     // remove any props from HMS that are no longer present in Iceberg table props
     obsoleteProps.forEach(parameters::remove);
 
+    // remove the DDL_TIME so it gets refreshed

Review comment:
       Nit: you might consider formatting this comment more generically like the one above. Something like `// remove any props that are controlled by HMS so it can handle them`. That way it's also clear where to add more properties like that in the future.
   
   Up to you though. There's nothing wrong with the current comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kevinjqliu commented on pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
kevinjqliu commented on pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#issuecomment-987526058


   > Some people might have implemented the HMS interface, but changed minor things like the timestamp precision (as an example) on the server side. Leaving it to the metastore's implementation seems ideal to me.
   
   sounds good, I'll defer to HMS to handle this param. The Hive Engine also removes the param and defers to HMS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#issuecomment-1004633455


   > Also, in case you weren't aware, you [can mark your work GitHub account or any other account you've also used in the repo as a co-author](https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors#creating-co-authored-commits-on-the-command-line). Plenty of people do it so their work is reflected across both accounts (if that's important to you- though definitely not a requirement).
   
   @kevinjqliu: Did you have time to check if the DDL time is changed when we only inserting to a Hive table and do not change the mertadata?
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3670: HiveTableOperations: update HMS transient_lastDdlTime param on snapshot writes

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3670:
URL: https://github.com/apache/iceberg/pull/3670#discussion_r763448386



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -647,6 +647,8 @@ public void testIcebergAndHmsTableProperties() throws Exception {
     Assert.assertEquals(HiveIcebergInputFormat.class.getName(), hmsTable.getSd().getInputFormat());
     Assert.assertEquals(HiveIcebergOutputFormat.class.getName(), hmsTable.getSd().getOutputFormat());
     Assert.assertEquals(HiveIcebergSerDe.class.getName(), hmsTable.getSd().getSerdeInfo().getSerializationLib());
+    // Add 1 second delay to reflect DDL_TIME change in HMS
+    Thread.sleep(1000);

Review comment:
       Sleeping for 1 second really adds up and makes tests slow. Instead, what about always setting the property to the last snapshot timestamp? That way you get more accurate information -- the timestamp of the last snapshot -- and you have something that you can also validate in a test using `Assert.equals`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org