You are viewing a plain text version of this content. The canonical link for it is here.

Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/05/27 03:37:11 UTC

[GitHub] [hive] jcamachor opened a new pull request #1034: HIVE-23530

jcamachor opened a new pull request #1034:
URL: https://github.com/apache/hive/pull/1034


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431191795



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java
##########
@@ -61,6 +61,7 @@
  */
 @Description(name = "compute_stats",
       value = "_FUNC_(x) - Returns the statistical summary of a set of primitive type values.")
+@Deprecated

Review comment:
       Some tests use `compute_stats` directly. I have created a different JIRA https://issues.apache.org/jira/browse/HIVE-23558 to remove the UDAF and potentially rewriting/removing those tests.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431192485



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeBitVector.java
##########
@@ -0,0 +1,561 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.udf.generic;
+
+import org.apache.hadoop.hive.common.classification.InterfaceAudience;
+import org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimator;
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.stats.ColStatsProcessor.ColumnStatsType;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+import org.apache.hadoop.hive.serde2.io.DateWritableV2;
+import org.apache.hadoop.hive.serde2.io.TimestampWritableV2;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.BinaryObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.DateObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.DoubleObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.HiveDecimalObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.TimestampObjectInspector;
+import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.io.BytesWritable;
+
+import static org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getEmptyNumDistinctValueEstimator;
+import static org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getNumDistinctValueEstimator;
+
+/**
+ * GenericUDAFComputeBitVector. This UDAF replicates part of the functionality
+ * that was in GenericUDAFComputeStats previously, which is deprecated now.
+ * In particular, it will compute a bit vector using the algorithm provided

Review comment:
       Updated the javadoc.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r432188236



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java
##########
@@ -127,15 +131,16 @@ public int process(Hive db, Table tbl) throws Exception {
             LOG.debug("Because {} is infinite or NaN, we skip stats.", columnName, e);
           }
         }
+        pos += columnStatsFields.size();

Review comment:
       I want to circle back on this. Finally I had to undo this change since the exception above breaks the flow and you need to know the position in the outer method, it is not sufficient to rely on the iterator. I think the workflow could be slightly different if the exception would be caught in the internal methods; maybe worth tackling at some point.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431161927



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
##########
@@ -467,6 +467,8 @@
     system.registerGenericUDAF("context_ngrams", new GenericUDAFContextNGrams());
 
     system.registerGenericUDAF("compute_stats", new GenericUDAFComputeStats());
+    system.registerGenericUDF("ndv_compute_bit_vector", GenericUDFNDVComputeBitVector.class);
+    system.registerGenericUDAF("compute_bit_vector", new GenericUDAFComputeBitVector());

Review comment:
       I got feedback from Gopal about these names (I used something similar to your suggestions first) and his reasoning to use these is that they should not clash with any function that a user has previously added (same for the comment you left above). That is why they have names that are understandable by us but not necessarily straightforward. In any case, these are not user-facing functions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431213790



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java
##########
@@ -127,15 +131,16 @@ public int process(Hive db, Table tbl) throws Exception {
             LOG.debug("Because {} is infinite or NaN, we skip stats.", columnName, e);
           }
         }
+        pos += columnStatsFields.size();

Review comment:
       Changed the method to use iterators.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431216259



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java
##########
@@ -304,15 +287,199 @@ public static String genRewrittenQuery(Table tbl, List<String> colNames, HiveCon
     }
 
     String rewrittenQuery = rewrittenQueryBuilder.toString();
-    rewrittenQuery = new VariableSubstitution(new HiveVariableSource() {
-      @Override
-      public Map<String, String> getHiveVariable() {
-        return SessionState.get().getHiveVariables();
-      }
-    }).substitute(conf, rewrittenQuery);
+    rewrittenQuery = new VariableSubstitution(
+        () -> SessionState.get().getHiveVariables()).substitute(conf, rewrittenQuery);
     return rewrittenQuery;
   }
 
+  private static void genComputeStats(StringBuilder rewrittenQueryBuilder, HiveConf conf,
+      int pos, String columnName, TypeInfo typeInfo) throws SemanticException {
+    Preconditions.checkArgument(typeInfo.getCategory() == Category.PRIMITIVE);
+    ColumnStatsType columnStatsType =
+        ColumnStatsType.getColumnStatsType((PrimitiveTypeInfo) typeInfo);
+    // The first column is always the type
+    // The rest of columns will depend on the type itself
+    int size = columnStatsType.getColumnStats().size() - 1;
+    for (int i = 0; i < size; i++) {
+      ColumnStatsField columnStatsField = columnStatsType.getColumnStats().get(i);
+      appendStatsField(rewrittenQueryBuilder, conf, columnStatsField, columnStatsType,
+          columnName, pos);
+      rewrittenQueryBuilder.append(", ");

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r430947360



##########
File path: accumulo-handler/src/test/results/positive/accumulo_single_sourced_multi_insert.q.out
##########
@@ -96,16 +96,16 @@ STAGE PLANS:
                   outputColumnNames: key, value
                   Statistics: Num rows: 55 Data size: 9405 Basic stats: COMPLETE Column stats: COMPLETE
                   Group By Operator
-                    aggregations: compute_stats(key, 'hll'), compute_stats(value, 'hll')
+                    aggregations: max(length(key)), avg(COALESCE(length(key),0)), count(CASE WHEN (key is null) THEN (1) ELSE (null) END), compute_bit_vector(key, 'hll'), max(length(value)), avg(COALESCE(length(value),0)), count(CASE WHEN (value is null) THEN (1) ELSE (null) END), compute_bit_vector(value, 'hll')

Review comment:
       note: `count(value)` would automatically ignore `null`-s for free - nullcount can then be calculated `numRows - count(value)` or `count(1) - count(value)`

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
##########
@@ -467,6 +467,8 @@
     system.registerGenericUDAF("context_ngrams", new GenericUDAFContextNGrams());
 
     system.registerGenericUDAF("compute_stats", new GenericUDAFComputeStats());
+    system.registerGenericUDF("ndv_compute_bit_vector", GenericUDFNDVComputeBitVector.class);
+    system.registerGenericUDAF("compute_bit_vector", new GenericUDAFComputeBitVector());

Review comment:
       ...using the name "bit_vector" is what we used to describe these things - however I think it would be better to give it a more decent name - now that it gets introduced as an udf as well.
   
   some ideas which came to my mind - for naming it:
   * keep using `bit_vector` because it's already widespread in the codebase
   * `stat_ndv_estimate`
   * `hive_ndv_estimate`
   * `ndv_estimate` 
   
   and use
   * `estimate` or some keyword for the one doing the evaluation
   * `aggregate` or `sketch` for the aggregator fn
   
     `stat_ndv_aggregate` or `stat_ndv_sketch`

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
##########
@@ -467,6 +467,8 @@
     system.registerGenericUDAF("context_ngrams", new GenericUDAFContextNGrams());
 
     system.registerGenericUDAF("compute_stats", new GenericUDAFComputeStats());
+    system.registerGenericUDF("ndv_compute_bit_vector", GenericUDFNDVComputeBitVector.class);

Review comment:
       the name of this method is a bit misleading to me; it looks to me this is the "estimate" method 
   
   

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java
##########
@@ -304,15 +287,199 @@ public static String genRewrittenQuery(Table tbl, List<String> colNames, HiveCon
     }
 
     String rewrittenQuery = rewrittenQueryBuilder.toString();
-    rewrittenQuery = new VariableSubstitution(new HiveVariableSource() {
-      @Override
-      public Map<String, String> getHiveVariable() {
-        return SessionState.get().getHiveVariables();
-      }
-    }).substitute(conf, rewrittenQuery);
+    rewrittenQuery = new VariableSubstitution(
+        () -> SessionState.get().getHiveVariables()).substitute(conf, rewrittenQuery);
     return rewrittenQuery;
   }
 
+  private static void genComputeStats(StringBuilder rewrittenQueryBuilder, HiveConf conf,
+      int pos, String columnName, TypeInfo typeInfo) throws SemanticException {
+    Preconditions.checkArgument(typeInfo.getCategory() == Category.PRIMITIVE);
+    ColumnStatsType columnStatsType =
+        ColumnStatsType.getColumnStatsType((PrimitiveTypeInfo) typeInfo);
+    // The first column is always the type
+    // The rest of columns will depend on the type itself
+    int size = columnStatsType.getColumnStats().size() - 1;
+    for (int i = 0; i < size; i++) {
+      ColumnStatsField columnStatsField = columnStatsType.getColumnStats().get(i);
+      appendStatsField(rewrittenQueryBuilder, conf, columnStatsField, columnStatsType,
+          columnName, pos);
+      rewrittenQueryBuilder.append(", ");
+    }
+    ColumnStatsField columnStatsField = columnStatsType.getColumnStats().get(size);
+    appendStatsField(rewrittenQueryBuilder, conf, columnStatsField, columnStatsType,
+        columnName, pos);
+  }
+
+  private static void appendStatsField(StringBuilder rewrittenQueryBuilder, HiveConf conf,
+      ColumnStatsField columnStatsField, ColumnStatsType columnStatsType,
+      String columnName, int pos) throws SemanticException {
+    switch (columnStatsField) {
+    case COLUMN_TYPE:
+      appendColumnType(rewrittenQueryBuilder, conf, columnStatsType, pos);

Review comment:
       I think it would probably make sense to organize these things around the column types - on the other side we have type based stuff next to each other...here we have calc based split
   but it might not worth the effort..

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java
##########
@@ -127,15 +131,16 @@ public int process(Hive db, Table tbl) throws Exception {
             LOG.debug("Because {} is infinite or NaN, we skip stats.", columnName, e);
           }
         }
+        pos += columnStatsFields.size();

Review comment:
       all these methods are very wierd... because they just read 1 element at a time..but the position control is placed "outside" ... (here)
   
   instead of starting a loop `for(int i=0;)` stuff; passing a columnindex iterator might make everything more readable...
   if the iterator is not empty - a new column statistic clould be parsed from it

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeBitVector.java
##########
@@ -0,0 +1,561 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.udf.generic;
+
+import org.apache.hadoop.hive.common.classification.InterfaceAudience;
+import org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimator;
+import org.apache.hadoop.hive.common.type.HiveDecimal;
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.stats.ColStatsProcessor.ColumnStatsType;
+import org.apache.hadoop.hive.ql.util.JavaDataModel;
+import org.apache.hadoop.hive.serde2.io.DateWritableV2;
+import org.apache.hadoop.hive.serde2.io.TimestampWritableV2;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.BinaryObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.DateObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.DoubleObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.HiveDecimalObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
+import org.apache.hadoop.hive.serde2.objectinspector.primitive.TimestampObjectInspector;
+import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.io.BytesWritable;
+
+import static org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getEmptyNumDistinctValueEstimator;
+import static org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getNumDistinctValueEstimator;
+
+/**
+ * GenericUDAFComputeBitVector. This UDAF replicates part of the functionality
+ * that was in GenericUDAFComputeStats previously, which is deprecated now.
+ * In particular, it will compute a bit vector using the algorithm provided

Review comment:
       because there will be no way to use `compute_stats` anymore: I think this apidoc should start with something like "This UDAF will compute a bit vector"
   

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java
##########
@@ -130,8 +130,10 @@ public void insertTableValuesAnalyzePipeline() throws SemanticException {
         partSpec.put(partKey, null);
       }
     }
+    List<String> colNames = Utilities.getColumnNamesFromFieldSchema(tbl.getCols());
+    List<String> colTypes = ColumnStatsSemanticAnalyzer.getColumnTypes(tbl, colNames);
     String command = ColumnStatsSemanticAnalyzer.genRewrittenQuery(
-        tbl, Utilities.getColumnNamesFromFieldSchema(tbl.getCols()), conf, partSpec, isPartitionStats, true);
+        tbl, colNames, colTypes, conf, partSpec, isPartitionStats, true);

Review comment:
       I think it would be better to pass the `tbl.getCols()` instead of the 2 lists
   
   the `getColumnTypes` method also removes some column names silently...it would be better to have that method also work on a list of FieldSchema entries

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java
##########
@@ -304,15 +287,199 @@ public static String genRewrittenQuery(Table tbl, List<String> colNames, HiveCon
     }
 
     String rewrittenQuery = rewrittenQueryBuilder.toString();
-    rewrittenQuery = new VariableSubstitution(new HiveVariableSource() {
-      @Override
-      public Map<String, String> getHiveVariable() {
-        return SessionState.get().getHiveVariables();
-      }
-    }).substitute(conf, rewrittenQuery);
+    rewrittenQuery = new VariableSubstitution(
+        () -> SessionState.get().getHiveVariables()).substitute(conf, rewrittenQuery);
     return rewrittenQuery;
   }
 
+  private static void genComputeStats(StringBuilder rewrittenQueryBuilder, HiveConf conf,
+      int pos, String columnName, TypeInfo typeInfo) throws SemanticException {
+    Preconditions.checkArgument(typeInfo.getCategory() == Category.PRIMITIVE);
+    ColumnStatsType columnStatsType =
+        ColumnStatsType.getColumnStatsType((PrimitiveTypeInfo) typeInfo);
+    // The first column is always the type
+    // The rest of columns will depend on the type itself
+    int size = columnStatsType.getColumnStats().size() - 1;
+    for (int i = 0; i < size; i++) {
+      ColumnStatsField columnStatsField = columnStatsType.getColumnStats().get(i);
+      appendStatsField(rewrittenQueryBuilder, conf, columnStatsField, columnStatsType,
+          columnName, pos);
+      rewrittenQueryBuilder.append(", ");

Review comment:
       instead of duplicating the loop body; a conditional for the "append" might be better; it's easier to write than describe; I'm thinking something like:
   ```
   for(a : x) {
   if(!first) append(",");
   body();
   }
   ```

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java
##########
@@ -61,6 +61,7 @@
  */
 @Description(name = "compute_stats",
       value = "_FUNC_(x) - Returns the statistical summary of a set of primitive type values.")
+@Deprecated

Review comment:
       I don't think this class should be retained - it will have no real use anymore




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431176481



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsAutoGatherContext.java
##########
@@ -130,8 +130,10 @@ public void insertTableValuesAnalyzePipeline() throws SemanticException {
         partSpec.put(partKey, null);
       }
     }
+    List<String> colNames = Utilities.getColumnNamesFromFieldSchema(tbl.getCols());
+    List<String> colTypes = ColumnStatsSemanticAnalyzer.getColumnTypes(tbl, colNames);
     String command = ColumnStatsSemanticAnalyzer.genRewrittenQuery(
-        tbl, Utilities.getColumnNamesFromFieldSchema(tbl.getCols()), conf, partSpec, isPartitionStats, true);
+        tbl, colNames, colTypes, conf, partSpec, isPartitionStats, true);

Review comment:
       The underlying `genRewrittenQuery` method may work on a subset of columns in the table, that is why it was receiving column names / types separately. I have made the `protected static` method work directly with the table and extract the column names and types from it, and added corresponding comments to it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431189978



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java
##########
@@ -304,15 +287,199 @@ public static String genRewrittenQuery(Table tbl, List<String> colNames, HiveCon
     }
 
     String rewrittenQuery = rewrittenQueryBuilder.toString();
-    rewrittenQuery = new VariableSubstitution(new HiveVariableSource() {
-      @Override
-      public Map<String, String> getHiveVariable() {
-        return SessionState.get().getHiveVariables();
-      }
-    }).substitute(conf, rewrittenQuery);
+    rewrittenQuery = new VariableSubstitution(
+        () -> SessionState.get().getHiveVariables()).substitute(conf, rewrittenQuery);
     return rewrittenQuery;
   }
 
+  private static void genComputeStats(StringBuilder rewrittenQueryBuilder, HiveConf conf,
+      int pos, String columnName, TypeInfo typeInfo) throws SemanticException {
+    Preconditions.checkArgument(typeInfo.getCategory() == Category.PRIMITIVE);
+    ColumnStatsType columnStatsType =
+        ColumnStatsType.getColumnStatsType((PrimitiveTypeInfo) typeInfo);
+    // The first column is always the type
+    // The rest of columns will depend on the type itself
+    int size = columnStatsType.getColumnStats().size() - 1;
+    for (int i = 0; i < size; i++) {
+      ColumnStatsField columnStatsField = columnStatsType.getColumnStats().get(i);
+      appendStatsField(rewrittenQueryBuilder, conf, columnStatsField, columnStatsType,
+          columnName, pos);
+      rewrittenQueryBuilder.append(", ");
+    }
+    ColumnStatsField columnStatsField = columnStatsType.getColumnStats().get(size);
+    appendStatsField(rewrittenQueryBuilder, conf, columnStatsField, columnStatsType,
+        columnName, pos);
+  }
+
+  private static void appendStatsField(StringBuilder rewrittenQueryBuilder, HiveConf conf,
+      ColumnStatsField columnStatsField, ColumnStatsType columnStatsType,
+      String columnName, int pos) throws SemanticException {
+    switch (columnStatsField) {
+    case COLUMN_TYPE:
+      appendColumnType(rewrittenQueryBuilder, conf, columnStatsType, pos);

Review comment:
       This is kind of misleading because `column_type` does not always match the actual column type, e.g., BYTE, SHORT, INT or BIGINT are all mapped to the same column stats type (LONG). However, I did not want to change the internal name that was used before in this patch too. I renamed the enum to `COLUMN_STATS_TYPE` to add some more clarity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] asfgit closed pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

asfgit closed pull request #1034:
URL: https://github.com/apache/hive/pull/1034


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #1034: HIVE-23530

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #1034:
URL: https://github.com/apache/hive/pull/1034#discussion_r431351279



##########
File path: accumulo-handler/src/test/results/positive/accumulo_single_sourced_multi_insert.q.out
##########
@@ -96,16 +96,16 @@ STAGE PLANS:
                   outputColumnNames: key, value
                   Statistics: Num rows: 55 Data size: 9405 Basic stats: COMPLETE Column stats: COMPLETE
                   Group By Operator
-                    aggregations: compute_stats(key, 'hll'), compute_stats(value, 'hll')
+                    aggregations: max(length(key)), avg(COALESCE(length(key),0)), count(CASE WHEN (key is null) THEN (1) ELSE (null) END), compute_bit_vector(key, 'hll'), max(length(value)), avg(COALESCE(length(value),0)), count(CASE WHEN (value is null) THEN (1) ELSE (null) END), compute_bit_vector(value, 'hll')

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org