You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/08/18 17:25:00 UTC

[jira] [Work logged] (HIVE-23887) Reset table level basic/column stats during import.

     [ https://issues.apache.org/jira/browse/HIVE-23887?focusedWorklogId=472060&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-472060 ]

ASF GitHub Bot logged work on HIVE-23887:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Aug/20 17:24
            Start Date: 18/Aug/20 17:24
    Worklog Time Spent: 10m 
      Work Description: sankarh commented on a change in pull request #1370:
URL: https://github.com/apache/hive/pull/1370#discussion_r472338632



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2452,9 +2448,16 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
         skewedInfo.setSkewedColValueLocationMaps(skewedColValueLocationMaps);
         newCreatedTpart.getSd().setSkewedInfo(skewedInfo);
       }
-      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
+
+      //if hive.stats.autogather = false or resetStatistics = true then

Review comment:
       The comment can answer the "why" rather than "what". The below if statement is self explanatory.

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnExIm.java
##########
@@ -389,21 +390,56 @@ public void testImportPartitioned() throws Exception {
 
   @Test
   public void testImportPartitionedOrc() throws Exception {
+    //clear and drop table T,Tstage
     runStatementOnDriver("drop table if exists T");
     runStatementOnDriver("drop table if exists Tstage");
-    runStatementOnDriver("create table T (a int, b int) partitioned by (p int) stored" +
-        " as orc tblproperties('transactional'='true')");
-    //Tstage is the target table
-    runStatementOnDriver("create table Tstage (a int, b int) partitioned by (p int) stored" +
-        " as orc tblproperties('transactional'='true')");
+
+    //create target table - T
+    runStatementOnDriver("create table T (a int, b int) partitioned by (p int) stored"
+        + " as orc tblproperties('transactional'='true')");
+
+    //create source table - Tstage
+    runStatementOnDriver("create table Tstage (a int, b int) partitioned by (p int) stored"
+        + " as orc tblproperties('transactional'='true')");
+
     //this creates an ORC data file with correct schema under table root
     runStatementOnDriver("insert into Tstage values(1,2,10),(3,4,11),(5,6,12)");
-    final int[][] rows = {{3}};
-    //now we have an archive with 3 partitions
+    final int[][] rows = { { 3 } };
+
+    //check Partitions statistics
+    List<String> rsTstagePartitionsProperties = runStatementOnDriver("show partitions Tstage");
+    for (String rsTstagePartition : rsTstagePartitionsProperties) {
+      List<String> rsPartitionProperties =
+          runStatementOnDriver("describe formatted Tstage partition(" + rsTstagePartition + ")");
+      Assert.assertEquals("COLUMN_STATS_ACCURATE of partition " + rsTstagePartition + " of Tstage table", true,
+          rsPartitionProperties.contains("\tCOLUMN_STATS_ACCURATE\t{\\\"BASIC_STATS\\\":\\\"true\\\"}"));
+      Assert.assertEquals(" of partition " + rsTstagePartition + " of Tstage table", true,
+          rsPartitionProperties.contains("\tnumRows             \t1                   "));
+    }
+
+    //now we have an archive Tstage with 3 partitions
     runStatementOnDriver("export table Tstage to '" + getWarehouseDir() + "/1'");
 
     //load T
     runStatementOnDriver("import table T from '" + getWarehouseDir() + "/1'");
+
+    //check basic stats in tblproperties of T
+    List<String> rsTProperties = runStatementOnDriver("show tblproperties T");
+    Assert.assertEquals("COLUMN_STATS_ACCURATE of T table", false,

Review comment:
       I think, stats are false as table T was already created. Can we check if it is set to false when import create the table.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2452,9 +2448,16 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
         skewedInfo.setSkewedColValueLocationMaps(skewedColValueLocationMaps);
         newCreatedTpart.getSd().setSkewedInfo(skewedInfo);
       }
-      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
+
+      //if hive.stats.autogather = false or resetStatistics = true then
+      //clear partition column stats and set basic stats to false

Review comment:
       Nit: Add single space after "//"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2452,9 +2448,16 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
         skewedInfo.setSkewedColValueLocationMaps(skewedColValueLocationMaps);
         newCreatedTpart.getSd().setSkewedInfo(skewedInfo);
       }
-      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
+
+      //if hive.stats.autogather = false or resetStatistics = true then
+      //clear partition column stats and set basic stats to false
+      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER) || resetStatistics) {

Review comment:
       Shall check resetStatistics first as it take less cpu cycles in positive flow.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -70,6 +70,7 @@
 
 import com.google.common.collect.ImmutableList;
 
+import org.antlr.runtime.misc.Stats;

Review comment:
       Unused import.

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnExIm.java
##########
@@ -566,4 +602,44 @@ public void testMMExportAborted() throws Exception {
         TestTxnCommands2.stringifyValues(data), rs);
 
   }
-}
+
+  @Test
+  public void testImportOrc() throws Exception {
+    //Clear and Drop T and Tstage if exist
+    runStatementOnDriver("drop table if exists T");
+    runStatementOnDriver("drop table if exists Tstage");
+
+    //create target table - T
+    runStatementOnDriver("create table T (a int, b int) stored" + " as orc tblproperties('transactional'='true')");
+
+    //Create source table - Tstage
+    runStatementOnDriver("create table Tstage (a int, b int) stored" + " as orc tblproperties('transactional'='true')");
+
+    //this creates an ORC data file with correct schema under table root
+    runStatementOnDriver("insert into Tstage values(1,2),(3,4),(5,6)");
+    final int[][] rows = { { 3 } };
+
+    //check Tstage statistics
+    List<String> rsTStageProperties = runStatementOnDriver("show tblproperties Tstage");
+    Assert.assertEquals("COLUMN_STATS_ACCURATE of Tstage table", true,
+        rsTStageProperties.contains("COLUMN_STATS_ACCURATE\t{\"BASIC_STATS\":\"true\"}"));
+    Assert.assertEquals("numRows of Tstage table", true, rsTStageProperties.contains("numRows\t3"));
+    Assert.assertEquals("numFiles of Tstage table", true, rsTStageProperties.contains("numFiles\t1"));
+
+    //now we have an archive Tstage table
+    runStatementOnDriver("export table Tstage to '" + getWarehouseDir() + "/1'");
+
+    //load T
+    runStatementOnDriver("import table T from '" + getWarehouseDir() + "/1'");
+
+    //check basic stats in tblproperties T
+    List<String> rsTProperties = runStatementOnDriver("show tblproperties T");

Review comment:
       Can test the import with hive.stats.autogather=true and see if stats are accurate for both partitioned and non-partitioned tables.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2452,9 +2448,16 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
         skewedInfo.setSkewedColValueLocationMaps(skewedColValueLocationMaps);
         newCreatedTpart.getSd().setSkewedInfo(skewedInfo);
       }
-      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
+
+      //if hive.stats.autogather = false or resetStatistics = true then
+      //clear partition column stats and set basic stats to false
+      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER) || resetStatistics) {
+        LOG.debug(

Review comment:
       Nit: Use upper-case for first character of log message.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -2452,9 +2448,16 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map<String, St
         skewedInfo.setSkewedColValueLocationMaps(skewedColValueLocationMaps);
         newCreatedTpart.getSd().setSkewedInfo(skewedInfo);
       }
-      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
+
+      //if hive.stats.autogather = false or resetStatistics = true then
+      //clear partition column stats and set basic stats to false
+      if (!this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER) || resetStatistics) {
+        LOG.debug(
+            "clear partition column statistics and setting basic stats to false for " + newTPart.getCompleteName());
+        StatsSetupConst.clearColumnStatsState(newTPart.getParameters());

Review comment:
       clearColumnStatsState call is redundant as setBasicStatsState(false) will remove the column stats as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 472060)
    Time Spent: 1h 20m  (was: 1h 10m)

> Reset table level basic/column stats during import.
> ---------------------------------------------------
>
>                 Key: HIVE-23887
>                 URL: https://issues.apache.org/jira/browse/HIVE-23887
>             Project: Hive
>          Issue Type: Bug
>          Components: Import/Export, Statistics
>            Reporter: Ashish Sharma
>            Assignee: Ashish Sharma
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> While doing "export table db.table to '/import/table' " column stats are not dumped but import doesn't reset the flag which leads to incorrect stats.
> Reset columns stats while import to force Imported to recalculate the Columns stats



--
This message was sent by Atlassian Jira
(v8.3.4#803005)