You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/06/22 14:02:22 UTC

[GitHub] [hive] pvary opened a new pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

pvary opened a new pull request #2419:
URL: https://github.com/apache/hive/pull/2419


   ### What changes were proposed in this pull request?
   Allow column stat generation when we can commit in a move task
   
   ### Why are the changes needed?
   So we have column statistics after inserting to an Iceberg table
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added unit tests
   
   Also added #2359 as it is required for this patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672441145



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/exec/tez/TestTezOutputCommitter.java
##########
@@ -122,6 +122,7 @@ private IDriver getDriverWithCommitter(String committerClass) {
     conf.setVar(HiveConf.ConfVars.HIVE_AUTHORIZATION_MANAGER,
         "org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory");
     conf.setBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVESTATSCOLAUTOGATHER, false);

Review comment:
       Otherwise the tests are failing, because with stats turned on we generate 2 tasks instead of 1 (change of the execution plans which contain a stage)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672164500



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {
+          throw new MetaException("Query state attached to Session state must be not null. " +
+              "Partition transform metadata cannot be saved.");
+        }
         hmsTable.getSd().getCols().addAll(hmsTable.getPartitionKeys());
         hmsTable.setPartitionKeysIsSet(false);
       }
+      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, hmsTable);

Review comment:
       Is this needed here? Or if we use it only for the validation inside `spec()`, then maybe remove the local variable, or better yet, extract the validation logic into a different method we can call here?

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {

Review comment:
       Where was this part before that we saved it into the session conf?

##########
File path: iceberg/iceberg-handler/src/test/results/positive/vectorized_iceberg_read.q.out
##########
@@ -129,17 +129,17 @@ Stage-0
     Stage-1
       Reducer 2 vectorized
       File Output Operator [FS_11]
-        Select Operator [SEL_10] (rows=1 width=564)
+        Select Operator [SEL_10] (rows=1 width=372)

Review comment:
       Out of curiosity: do you know why the width has descreased so much?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r673140199



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {
+          throw new MetaException("Query state attached to Session state must be not null. " +
+              "Partition transform metadata cannot be saved.");
+        }
         hmsTable.getSd().getCols().addAll(hmsTable.getPartitionKeys());
         hmsTable.setPartitionKeysIsSet(false);
       }
+      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, hmsTable);

Review comment:
       Right, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658090073



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,85 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");

Review comment:
       Ok, thanks, just wanted to doube-check my understanding.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658743168



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,85 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");
+  }
+
+  @Test
+  public void testStatWithInsertOverwrite() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,

Review comment:
       Added Partitioned CTAS




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658972546



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/create/CreateTableDesc.java
##########
@@ -804,9 +801,18 @@ public Table toTable(HiveConf conf) throws HiveException {
       }
     }
 
-    if (getCols() != null) {
-      tbl.setFields(getCols());
+    Optional<List<FieldSchema>> cols = Optional.of(getCols());
+    Optional<List<FieldSchema>> partCols = Optional.of(getPartCols());
+
+    if (storageHandler !=null && storageHandler.alwaysUnpartitioned()) {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672435549



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {
+          throw new MetaException("Query state attached to Session state must be not null. " +
+              "Partition transform metadata cannot be saved.");
+        }
         hmsTable.getSd().getCols().addAll(hmsTable.getPartitionKeys());
         hmsTable.setPartitionKeysIsSet(false);
       }
+      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, hmsTable);

Review comment:
       This is moved from line 236. We need it to be set, but we have to do it after we got the correct spec

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {

Review comment:
       This is for migrating tables from non-Iceberg tables to Iceberg tables. Previously we just depended on the partition cols, from now on we need to have the data in the `SessionState` instead. So we put that there

##########
File path: iceberg/iceberg-handler/src/test/results/positive/vectorized_iceberg_read.q.out
##########
@@ -129,17 +129,17 @@ Stage-0
     Stage-1
       Reducer 2 vectorized
       File Output Operator [FS_11]
-        Select Operator [SEL_10] (rows=1 width=564)
+        Select Operator [SEL_10] (rows=1 width=372)

Review comment:
       TBH I am not sure, but I expect that has something to do with the new statistics

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/exec/tez/TestTezOutputCommitter.java
##########
@@ -122,6 +122,7 @@ private IDriver getDriverWithCommitter(String committerClass) {
     conf.setVar(HiveConf.ConfVars.HIVE_AUTHORIZATION_MANAGER,
         "org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory");
     conf.setBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVESTATSCOLAUTOGATHER, false);

Review comment:
       Otherwise the tests are failing, because with stats turned on we generate 2 tasks instead of 1 (change of the execution plans which contain a stage)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672164500



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {
+          throw new MetaException("Query state attached to Session state must be not null. " +
+              "Partition transform metadata cannot be saved.");
+        }
         hmsTable.getSd().getCols().addAll(hmsTable.getPartitionKeys());
         hmsTable.setPartitionKeysIsSet(false);
       }
+      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, hmsTable);

Review comment:
       Is this needed here? Or if we use it only for the validation inside `spec()`, then maybe remove the local variable, or better yet, extract the validation logic into a different method we can call here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658813552



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/create/CreateTableDesc.java
##########
@@ -804,9 +801,18 @@ public Table toTable(HiveConf conf) throws HiveException {
       }
     }
 
-    if (getCols() != null) {
-      tbl.setFields(getCols());
+    Optional<List<FieldSchema>> cols = Optional.of(getCols());
+    Optional<List<FieldSchema>> partCols = Optional.of(getPartCols());
+
+    if (storageHandler !=null && storageHandler.alwaysUnpartitioned()) {

Review comment:
       nit: `!= null`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] szlta commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
szlta commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672325104



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/exec/tez/TestTezOutputCommitter.java
##########
@@ -122,6 +122,7 @@ private IDriver getDriverWithCommitter(String committerClass) {
     conf.setVar(HiveConf.ConfVars.HIVE_AUTHORIZATION_MANAGER,
         "org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory");
     conf.setBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVESTATSCOLAUTOGATHER, false);

Review comment:
       Why is this required here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary merged pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary merged pull request #2419:
URL: https://github.com/apache/hive/pull/2419


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658808816



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,104 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");
+  }
+
+  @Test
+  public void testStatWithInsertOverwrite() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, true);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");
+  }
+
+  @Test
+  public void testStatWithPartitionedInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+    PartitionSpec spec = PartitionSpec.builderFor(HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA)
+        .identity("last_name").build();
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA, spec,
+        fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat("customers", "customer_id");
+    checkColStat("customers", "first_name");
+  }
+
+  @Test
+  public void testStatWithCTAS() {
+    Assume.assumeTrue(HiveIcebergSerDe.CTAS_EXCEPTION_MSG, testTableType == TestTables.TestTableType.HIVE_CATALOG);
+
+    shell.executeStatement("CREATE TABLE source (id bigint, name string) PARTITIONED BY (dept string) STORED AS ORC");
+    shell.executeStatement(testTables.getInsertQuery(
+        HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, TableIdentifier.of("default", "source"), false));
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    shell.executeStatement(String.format(
+        "CREATE TABLE target STORED BY ICEBERG %s TBLPROPERTIES ('%s'='%s') AS SELECT * FROM source",
+        testTables.locationForCreateTableSQL(TableIdentifier.of("default", "target")),
+        TableProperties.DEFAULT_FILE_FORMAT, fileFormat));
+
+    checkColStat("target", "id");
+  }
+
+  @Test
+  public void testStatWithPartitionedCTAS() {
+    Assume.assumeTrue(HiveIcebergSerDe.CTAS_EXCEPTION_MSG, testTableType == TestTables.TestTableType.HIVE_CATALOG);
+
+    shell.executeStatement("CREATE TABLE source (id bigint, name string) PARTITIONED BY (dept string) STORED AS ORC");
+    shell.executeStatement(testTables.getInsertQuery(
+        HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, TableIdentifier.of("default", "source"), false));
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+

Review comment:
       nit: extra new line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658739899



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java
##########
@@ -71,13 +71,21 @@
   private Context origCtx;
   
   public ColumnStatsAutoGatherContext(SemanticAnalyzer sa, HiveConf conf,
-      Operator<? extends OperatorDesc> op, Table tbl, Map<String, String> partSpec,
+      Operator<? extends OperatorDesc> op, Table origTbl, Map<String, String> partSpec,
       boolean isInsertInto, Context ctx) throws SemanticException {
     super();
     this.sa = sa;
     this.conf = conf;
     this.op = op;
-    this.tbl = tbl;
+    try {
+      this.tbl = origTbl.copy();
+    } catch (HiveException he) {
+      throw new SemanticException(he);
+    }
+    if (tbl.getStorageHandler().alwaysUnpartitioned()) {

Review comment:
       Can the storage handler be null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658743756



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java
##########
@@ -71,13 +71,21 @@
   private Context origCtx;
   
   public ColumnStatsAutoGatherContext(SemanticAnalyzer sa, HiveConf conf,
-      Operator<? extends OperatorDesc> op, Table tbl, Map<String, String> partSpec,
+      Operator<? extends OperatorDesc> op, Table origTbl, Map<String, String> partSpec,
       boolean isInsertInto, Context ctx) throws SemanticException {
     super();
     this.sa = sa;
     this.conf = conf;
     this.op = op;
-    this.tbl = tbl;
+    try {
+      this.tbl = origTbl.copy();
+    } catch (HiveException he) {
+      throw new SemanticException(he);
+    }
+    if (tbl.getStorageHandler().alwaysUnpartitioned()) {

Review comment:
       Not sure, so I have added the `null` check




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658973349



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,104 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");
+  }
+
+  @Test
+  public void testStatWithInsertOverwrite() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, true);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");
+  }
+
+  @Test
+  public void testStatWithPartitionedInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+    PartitionSpec spec = PartitionSpec.builderFor(HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA)
+        .identity("last_name").build();
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA, spec,
+        fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat("customers", "customer_id");
+    checkColStat("customers", "first_name");
+  }
+
+  @Test
+  public void testStatWithCTAS() {
+    Assume.assumeTrue(HiveIcebergSerDe.CTAS_EXCEPTION_MSG, testTableType == TestTables.TestTableType.HIVE_CATALOG);
+
+    shell.executeStatement("CREATE TABLE source (id bigint, name string) PARTITIONED BY (dept string) STORED AS ORC");
+    shell.executeStatement(testTables.getInsertQuery(
+        HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, TableIdentifier.of("default", "source"), false));
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    shell.executeStatement(String.format(
+        "CREATE TABLE target STORED BY ICEBERG %s TBLPROPERTIES ('%s'='%s') AS SELECT * FROM source",
+        testTables.locationForCreateTableSQL(TableIdentifier.of("default", "target")),
+        TableProperties.DEFAULT_FILE_FORMAT, fileFormat));
+
+    checkColStat("target", "id");
+  }
+
+  @Test
+  public void testStatWithPartitionedCTAS() {
+    Assume.assumeTrue(HiveIcebergSerDe.CTAS_EXCEPTION_MSG, testTableType == TestTables.TestTableType.HIVE_CATALOG);
+
+    shell.executeStatement("CREATE TABLE source (id bigint, name string) PARTITIONED BY (dept string) STORED AS ORC");
+    shell.executeStatement(testTables.getInsertQuery(
+        HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, TableIdentifier.of("default", "source"), false));
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658972416



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/create/CreateTableDesc.java
##########
@@ -804,9 +801,18 @@ public Table toTable(HiveConf conf) throws HiveException {
       }
     }
 
-    if (getCols() != null) {
-      tbl.setFields(getCols());
+    Optional<List<FieldSchema>> cols = Optional.of(getCols());

Review comment:
       Thanks!
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672181449



##########
File path: iceberg/iceberg-handler/src/test/results/positive/vectorized_iceberg_read.q.out
##########
@@ -129,17 +129,17 @@ Stage-0
     Stage-1
       Reducer 2 vectorized
       File Output Operator [FS_11]
-        Select Operator [SEL_10] (rows=1 width=564)
+        Select Operator [SEL_10] (rows=1 width=372)

Review comment:
       Out of curiosity: do you know why the width has descreased so much?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672180879



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {

Review comment:
       Where was this part before that we saved it into the session conf?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672439401



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {

Review comment:
       This is for migrating tables from non-Iceberg tables to Iceberg tables. Previously we just depended on the partition cols, from now on we need to have the data in the `SessionState` instead. So we put that there




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658085550



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,85 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");
+  }
+
+  @Test
+  public void testStatWithInsertOverwrite() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,

Review comment:
       My thought process was the following:
   - We need a basic test - so created unpartitioned insert
   - We need to test partitioned tables (what happens with the partition columns) - so created partitioned test (not sure that this is strictly needed)
   - We need to test IOW - so created IOW
   
   I am open to discussions either way




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] szlta commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
szlta commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672325104



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/exec/tez/TestTezOutputCommitter.java
##########
@@ -122,6 +122,7 @@ private IDriver getDriverWithCommitter(String committerClass) {
     conf.setVar(HiveConf.ConfVars.HIVE_AUTHORIZATION_MANAGER,
         "org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory");
     conf.setBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVESTATSCOLAUTOGATHER, false);

Review comment:
       Why is this required here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658062987



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,85 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");

Review comment:
       Are the columns stats gathered for `first_name` and `last_name` as well, we're just saving on the number of describe calls?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672435549



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {
+          throw new MetaException("Query state attached to Session state must be not null. " +
+              "Partition transform metadata cannot be saved.");
+        }
         hmsTable.getSd().getCols().addAll(hmsTable.getPartitionKeys());
         hmsTable.setPartitionKeysIsSet(false);
       }
+      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, hmsTable);

Review comment:
       This is moved from line 236. We need it to be set, but we have to do it after we got the correct spec




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672439783



##########
File path: iceberg/iceberg-handler/src/test/results/positive/vectorized_iceberg_read.q.out
##########
@@ -129,17 +129,17 @@ Stage-0
     Stage-1
       Reducer 2 vectorized
       File Output Operator [FS_11]
-        Select Operator [SEL_10] (rows=1 width=564)
+        Select Operator [SEL_10] (rows=1 width=372)

Review comment:
       TBH I am not sure, but I expect that has something to do with the new statistics




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658084643



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,85 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");

Review comment:
       They are basically the same, and generated as well.
   Only in partitioned case did I went for checking the partition columns too. Just to be sure




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658743466



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,85 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");

Review comment:
       Added to a partitioned CTAS test




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658806362



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/table/create/CreateTableDesc.java
##########
@@ -804,9 +801,18 @@ public Table toTable(HiveConf conf) throws HiveException {
       }
     }
 
-    if (getCols() != null) {
-      tbl.setFields(getCols());
+    Optional<List<FieldSchema>> cols = Optional.of(getCols());

Review comment:
       It seems like `getCols()` can return null based on the previous null check. This would fail here as well, unless wrap it using `Optional.ofNullable()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] szlta commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
szlta commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672325104



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/exec/tez/TestTezOutputCommitter.java
##########
@@ -122,6 +122,7 @@ private IDriver getDriverWithCommitter(String committerClass) {
     conf.setVar(HiveConf.ConfVars.HIVE_AUTHORIZATION_MANAGER,
         "org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory");
     conf.setBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVESTATSCOLAUTOGATHER, false);

Review comment:
       Why is this required here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r672435549



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {
+          throw new MetaException("Query state attached to Session state must be not null. " +
+              "Partition transform metadata cannot be saved.");
+        }
         hmsTable.getSd().getCols().addAll(hmsTable.getPartitionKeys());
         hmsTable.setPartitionKeysIsSet(false);
       }
+      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, hmsTable);

Review comment:
       This is moved from line 236. We need it to be set, but we have to do it after we got the correct spec

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -233,15 +237,21 @@ public void preAlterTable(org.apache.hadoop.hive.metastore.api.Table hmsTable, E
       preAlterTableProperties.tableLocation = sd.getLocation();
       preAlterTableProperties.format = sd.getInputFormat();
       preAlterTableProperties.schema = schema(catalogProperties, hmsTable);
-      preAlterTableProperties.spec = spec(conf, preAlterTableProperties.schema, catalogProperties, hmsTable);
       preAlterTableProperties.partitionKeys = hmsTable.getPartitionKeys();
 
       context.getProperties().put(HiveMetaHook.ALLOW_PARTITION_KEY_CHANGE, "true");
       // If there are partition keys specified remove them from the HMS table and add them to the column list
-      if (hmsTable.isSetPartitionKeys()) {
+      if (hmsTable.isSetPartitionKeys() && !hmsTable.getPartitionKeys().isEmpty()) {
+        List<PartitionTransformSpec> spec = PartitionTransform.getPartitionTransformSpec(hmsTable.getPartitionKeys());
+        if (!SessionStateUtil.addResource(conf, hive_metastoreConstants.PARTITION_TRANSFORM_SPEC, spec)) {

Review comment:
       This is for migrating tables from non-Iceberg tables to Iceberg tables. Previously we just depended on the partition cols, from now on we need to have the data in the `SessionState` instead. So we put that there

##########
File path: iceberg/iceberg-handler/src/test/results/positive/vectorized_iceberg_read.q.out
##########
@@ -129,17 +129,17 @@ Stage-0
     Stage-1
       Reducer 2 vectorized
       File Output Operator [FS_11]
-        Select Operator [SEL_10] (rows=1 width=564)
+        Select Operator [SEL_10] (rows=1 width=372)

Review comment:
       TBH I am not sure, but I expect that has something to do with the new statistics

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/exec/tez/TestTezOutputCommitter.java
##########
@@ -122,6 +122,7 @@ private IDriver getDriverWithCommitter(String committerClass) {
     conf.setVar(HiveConf.ConfVars.HIVE_AUTHORIZATION_MANAGER,
         "org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory");
     conf.setBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVESTATSCOLAUTOGATHER, false);

Review comment:
       Otherwise the tests are failing, because with stats turned on we generate 2 tasks instead of 1 (change of the execution plans which contain a stage)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2419: HIVE-25276: Enable automatic statistics generation for Iceberg tables

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2419:
URL: https://github.com/apache/hive/pull/2419#discussion_r658061851



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -1314,6 +1314,85 @@ public void testScanTableCaseInsensitive() throws IOException {
     Assert.assertArrayEquals(new Object[] {1L, "Bob", "Green"}, rows.get(1));
   }
 
+  @Test
+  public void testStatWithInsert() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+        PartitionSpec.unpartitioned(), fileFormat, ImmutableList.of());
+
+    if (testTableType != TestTables.TestTableType.HIVE_CATALOG) {
+      // If the location is set and we have to gather stats, then we have to update the table stats now
+      shell.executeStatement("ANALYZE TABLE " + identifier + " COMPUTE STATISTICS FOR COLUMNS");
+    }
+
+    String insert = testTables.getInsertQuery(HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS, identifier, false);
+    shell.executeStatement(insert);
+
+    checkColStat(identifier.name(), "customer_id");
+  }
+
+  @Test
+  public void testStatWithInsertOverwrite() {
+    TableIdentifier identifier = TableIdentifier.of("default", "customers");
+
+    shell.setHiveSessionValue(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, true);
+    testTables.createTable(shell, identifier.name(), HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,

Review comment:
       If we have test cases for unpartitioned insert, partitioned insert, unpartitioned IOW, should we have a test case for partitioned IOW as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org