You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/06/24 15:36:56 UTC

[GitHub] [hive] pvary opened a new pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

pvary opened a new pull request #2427:
URL: https://github.com/apache/hive/pull/2427


   ### What changes were proposed in this pull request?
   
   - Introduce a new configuration property: `iceberg.hive.keep.stats`. If this property is set then we keep the statistics at a new Iceberg commit. Otherwise we invalidate the stats as we can not make sure that they are correct.
   - Fix a NullPointerExeption in HiveIcebergMetaHook which happens when we are storing stats.
   - Adds a new unit tests, and enhances the check that the stat values are accurate
   
   Also contains HIVE-25276 as a base.
   
   ### Why are the changes needed?
   When someone modifies the Iceberg table outside of Hive then we should make sure that the stats are invalid
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Extra unit test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2427:
URL: https://github.com/apache/hive/pull/2427#discussion_r684017511



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -2420,11 +2464,36 @@ private String timestampAfterSnapshot(Table table, int snapshotPosition) {
     return simpleDateFormat.format(new Date(time));
   }
 
-  private void checkColStat(String tableName, String colName) {
+  private void checkColStat(String tableName, String colName, boolean accurate) {
     List<Object[]> rows = shell.executeStatement("DESCRIBE " + tableName + " " + colName);
 
     Assert.assertEquals(2, rows.size());
     Assert.assertEquals(StatsSetupConst.COLUMN_STATS_ACCURATE, rows.get(1)[0]);
+    Assert.assertEquals(accurate, !rows.get(1)[1].toString().matches("\\{\\}\\s*"));

Review comment:
       Can you add a small comment explaining the regex here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary merged pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
pvary merged pull request #2427:
URL: https://github.com/apache/hive/pull/2427


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2427:
URL: https://github.com/apache/hive/pull/2427#discussion_r684012632



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -2235,12 +2245,13 @@ public void testStatWithPartitionedInsert() {
     String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
     shell.executeStatement(insert);
 
-    checkColStat("customers", "customer_id");
-    checkColStat("customers", "first_name");
+    checkColStat("customers", "customer_id", true);
+    checkColStat("customers", "first_name", true);

Review comment:
       nit: why not use `identifier.name()` here too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2427:
URL: https://github.com/apache/hive/pull/2427#discussion_r684151918



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -764,7 +764,7 @@ public void testIcebergAndHmsTableProperties() throws Exception {
     Assert.assertEquals(expectedIcebergProperties, icebergTable.properties());
 
     if (Catalogs.hiveCatalog(shell.getHiveConf(), tableProperties)) {
-      Assert.assertEquals(10, hmsParams.size());
+      Assert.assertEquals(11, hmsParams.size());

Review comment:
       It was an empty stat. `{}`
   This was not a big issue, but found a better way. Let's see if this breaks anything :)

##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -2235,12 +2245,13 @@ public void testStatWithPartitionedInsert() {
     String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
     shell.executeStatement(insert);
 
-    checkColStat("customers", "customer_id");
-    checkColStat("customers", "first_name");
+    checkColStat("customers", "customer_id", true);
+    checkColStat("customers", "first_name", true);

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2427:
URL: https://github.com/apache/hive/pull/2427#discussion_r684015292



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -2377,6 +2392,35 @@ public void testAsOfWithJoins() throws IOException, InterruptedException {
     Assert.assertEquals(8, rows.size());
   }
 
+  @Test
+  public void testStatsRemoved() throws IOException {
+    Assume.assumeTrue("Only HiveCatalog can remove stats which become obsolete",
+        testTableType == TestTables.TestTableType.HIVE_CATALOG);
+
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, true);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id", true);
+    checkColStatMinMaxValue(identifier.name(), "customer_id", 0, 2);
+
+    // Create a Catalog where the KEEP_HIVE_STATS is false
+    shell.metastore().hiveConf().set(HiveTableOperations.KEEP_HIVE_STATS, StatsSetupConst.FALSE);
+    TestTables nonHiveTestTables = HiveIcebergStorageHandlerTestUtils.testTables(shell, testTableType, temp);
+    Table nonHiveTable = nonHiveTestTables.loadTable(identifier);
+
+    // Append data to the table through a this non-Hive Catalog

Review comment:
       It's still Hive Catalog though, isn't it? I thought we are simulating more like another compute engine, where the keep.stats=false




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2427:
URL: https://github.com/apache/hive/pull/2427#discussion_r684010406



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -764,7 +764,7 @@ public void testIcebergAndHmsTableProperties() throws Exception {
     Assert.assertEquals(expectedIcebergProperties, icebergTable.properties());
 
     if (Catalogs.hiveCatalog(shell.getHiveConf(), tableProperties)) {
-      Assert.assertEquals(10, hmsParams.size());
+      Assert.assertEquals(11, hmsParams.size());

Review comment:
       Do we know what the new param is?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2427:
URL: https://github.com/apache/hive/pull/2427#discussion_r684152648



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -2377,6 +2392,35 @@ public void testAsOfWithJoins() throws IOException, InterruptedException {
     Assert.assertEquals(8, rows.size());
   }
 
+  @Test
+  public void testStatsRemoved() throws IOException {
+    Assume.assumeTrue("Only HiveCatalog can remove stats which become obsolete",
+        testTableType == TestTables.TestTableType.HIVE_CATALOG);
+
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, true);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id", true);
+    checkColStatMinMaxValue(identifier.name(), "customer_id", 0, 2);
+
+    // Create a Catalog where the KEEP_HIVE_STATS is false
+    shell.metastore().hiveConf().set(HiveTableOperations.KEEP_HIVE_STATS, StatsSetupConst.FALSE);
+    TestTables nonHiveTestTables = HiveIcebergStorageHandlerTestUtils.testTables(shell, testTableType, temp);
+    Table nonHiveTable = nonHiveTestTables.loadTable(identifier);
+
+    // Append data to the table through a this non-Hive Catalog

Review comment:
       Fixed the comment.
   Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2427: HIVE-25286: Set stats to inaccurate when an Iceberg table is modified outside Hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2427:
URL: https://github.com/apache/hive/pull/2427#discussion_r684153220



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -2420,11 +2464,36 @@ private String timestampAfterSnapshot(Table table, int snapshotPosition) {
     return simpleDateFormat.format(new Date(time));
   }
 
-  private void checkColStat(String tableName, String colName) {
+  private void checkColStat(String tableName, String colName, boolean accurate) {
     List<Object[]> rows = shell.executeStatement("DESCRIBE " + tableName + " " + colName);
 
     Assert.assertEquals(2, rows.size());
     Assert.assertEquals(StatsSetupConst.COLUMN_STATS_ACCURATE, rows.get(1)[0]);
+    Assert.assertEquals(accurate, !rows.get(1)[1].toString().matches("\\{\\}\\s*"));

Review comment:
       Added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org