You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2022/09/30 10:40:48 UTC
[GitHub] [hive] zabetak commented on a diff in pull request #3628: HIVE-26320: Deserialize Parquet VARCHAR and CHAR types appropriately

zabetak commented on code in PR #3628:
URL: https://github.com/apache/hive/pull/3628#discussion_r984418472


##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java:
##########
@@ -481,12 +485,36 @@ protected BytesWritable convert(Binary binary) {
   },
   ESTRING_CONVERTER(String.class) {
     @Override
-    PrimitiveConverter getConverter(final PrimitiveType type, final int index, final ConverterParent parent, TypeInfo hiveTypeInfo) {
+    PrimitiveConverter getConverter(final PrimitiveType type, final int index, final ConverterParent parent,
+        TypeInfo hiveTypeInfo) {
+      // If we have type information, we should return properly typed strings. However, there are a variety
+      // of code paths that do not provide the typeInfo in those cases we default to Text. This idiom is also
+      // followed by for example the BigDecimal converter in which if there is no type information,
+      // it defaults to the widest representation
+      if (hiveTypeInfo != null) {
+        String typeName = hiveTypeInfo.getTypeName().toLowerCase();
+        if (typeName.startsWith(serdeConstants.CHAR_TYPE_NAME)) {
+          return new BinaryConverter<HiveCharWritable>(type, parent, index) {
+            @Override
+              protected HiveCharWritable convert(Binary binary) {
+                return new HiveCharWritable(binary.getBytes(), ((CharTypeInfo) hiveTypeInfo).getLength());
+              }
+          };
+        } else if (typeName.startsWith(serdeConstants.VARCHAR_TYPE_NAME)) {
+          return new BinaryConverter<HiveVarcharWritable>(type, parent, index) {
+            @Override
+              protected HiveVarcharWritable convert(Binary binary) {
+                return new HiveVarcharWritable(binary.getBytes(), ((VarcharTypeInfo) hiveTypeInfo).getLength());
+              }
+          };
+        }
+      }

Review Comment:
   I think the most common way to do something based on a primitive type is to use its category. It seems robust and more efficient.
   ```java
   if (hiveTypeInfo instanceof PrimitiveTypeInfo) {
           PrimitiveTypeInfo t = (PrimitiveTypeInfo) hiveTypeInfo;
           switch (t.getPrimitiveCategory()) {
           case VARCHAR:
             break;
           case CHAR:
             break;
           }
         }
   ```
   I see that in this class they rely much on the name and serdeConstants so I am also OK with the current code as it is.



##########
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java:
##########
@@ -212,23 +216,59 @@ public void testGetInt64NanosAdjustedToUTCTimestampConverter() throws Exception
   }
 
   @Test
-  public void testGetTextConverter() throws Exception {
+  public void testGetTextConverterForString() throws Exception {
     PrimitiveType primitiveType = Types.optional(PrimitiveTypeName.BINARY)
-        .as(LogicalTypeAnnotation.stringType()).named("value");
-    Writable writable = getWritableFromBinaryConverter(new VarcharTypeInfo(), primitiveType,
-        Binary.fromString("this_is_a_value"));
-    Text textWritable = (Text) writable;
-    assertEquals("this_is_a_value", textWritable.toString());
+            .as(LogicalTypeAnnotation.stringType()).named("value");
+    String value = "this_is_a_value";
+    Text textWritable = (Text) getWritableFromBinaryConverter(stringTypeInfo, primitiveType,
+            Binary.fromString(value));
+    // we should get what we put in
+    assertEquals(value, textWritable.toString());
+  }
+
+  @Test
+  public void testGetTextConverterForChar() throws Exception {
+    PrimitiveType primitiveType = Types.optional(PrimitiveTypeName.BINARY)
+            .as(LogicalTypeAnnotation.stringType()).named("value");
+    String value = "this_is_a_value";
+
+    HiveCharWritable textWritable = (HiveCharWritable) getWritableFromBinaryConverter(
+            new CharTypeInfo(value.length() + 2), primitiveType, Binary.fromString(value));
+    // check that it enforces the length (it should be padded)
+    assertEquals(value + "  ", textWritable.toString());
+
+    textWritable = (HiveCharWritable) getWritableFromBinaryConverter(
+            new CharTypeInfo(6), primitiveType, Binary.fromString(value));
+    // check that it enforces the length (it should truncate)
+    assertEquals(value.substring(0, 6), textWritable.toString());

Review Comment:
   nit. The test could be easily split in two and a proper name could avoid the need for extra comments
   * testGetTextConverterForCharPadsValueWithSpacesTillLen
   * testGetTextConverterForCharTruncatesValueExceedingLen



##########
ql/src/test/queries/clientpositive/pointlookup.q:
##########
@@ -127,3 +127,39 @@ or inOutputOpt.key = null;
 drop table orOutput;
 drop table inOutput;
 drop table inOutputOpt;
+
+-- test case(s) for HIVE-26320 for ORC
+SET hive.optimize.point.lookup=true;
+SET hive.optimize.point.lookup.min=2;
+CREATE EXTERNAL TABLE hive2623_orc(kob varchar(2), enhanced_type_code int) STORED AS ORC;
+INSERT INTO hive2623_orc VALUES('BB',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='BB' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check
+FROM hive2623_orc;
+DROP TABLE hive2623_orc;
+
+CREATE EXTERNAL TABLE hive2623_char_orc(kob char(2), enhanced_type_code int) STORED AS ORC;
+INSERT INTO hive2623_char_orc VALUES('B',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='B ' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check
+FROM hive2623_char_orc;
+DROP TABLE hive2623_char_orc;
+
+-- test case(s) for HIVE-26320 for PARQUET
+SET hive.optimize.point.lookup=true;
+SET hive.optimize.point.lookup.min=2;

Review Comment:
   This is already set above no need to set it twice.



##########
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java:
##########
@@ -481,12 +485,36 @@ protected BytesWritable convert(Binary binary) {
   },
   ESTRING_CONVERTER(String.class) {
     @Override
-    PrimitiveConverter getConverter(final PrimitiveType type, final int index, final ConverterParent parent, TypeInfo hiveTypeInfo) {
+    PrimitiveConverter getConverter(final PrimitiveType type, final int index, final ConverterParent parent,
+        TypeInfo hiveTypeInfo) {
+      // If we have type information, we should return properly typed strings. However, there are a variety
+      // of code paths that do not provide the typeInfo in those cases we default to Text. This idiom is also
+      // followed by for example the BigDecimal converter in which if there is no type information,
+      // it defaults to the widest representation
+      if (hiveTypeInfo != null) {
+        String typeName = hiveTypeInfo.getTypeName().toLowerCase();
+        if (typeName.startsWith(serdeConstants.CHAR_TYPE_NAME)) {
+          return new BinaryConverter<HiveCharWritable>(type, parent, index) {
+            @Override
+              protected HiveCharWritable convert(Binary binary) {
+                return new HiveCharWritable(binary.getBytes(), ((CharTypeInfo) hiveTypeInfo).getLength());
+              }
+          };
+        } else if (typeName.startsWith(serdeConstants.VARCHAR_TYPE_NAME)) {
+          return new BinaryConverter<HiveVarcharWritable>(type, parent, index) {
+            @Override
+              protected HiveVarcharWritable convert(Binary binary) {
+                return new HiveVarcharWritable(binary.getBytes(), ((VarcharTypeInfo) hiveTypeInfo).getLength());
+              }
+          };
+        }
+      }
+      // STRING type
       return new BinaryConverter<Text>(type, parent, index) {
         @Override
-        protected Text convert(Binary binary) {
-          return new Text(binary.getBytes());
-        }
+          protected Text convert(Binary binary) {
+            return new Text(binary.getBytes());
+          }

Review Comment:
   nit: broken indentation



##########
ql/src/test/queries/clientpositive/pointlookup.q:
##########
@@ -127,3 +127,39 @@ or inOutputOpt.key = null;
 drop table orOutput;
 drop table inOutput;
 drop table inOutputOpt;
+
+-- test case(s) for HIVE-26320 for ORC
+SET hive.optimize.point.lookup=true;
+SET hive.optimize.point.lookup.min=2;
+CREATE EXTERNAL TABLE hive2623_orc(kob varchar(2), enhanced_type_code int) STORED AS ORC;
+INSERT INTO hive2623_orc VALUES('BB',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='BB' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check
+FROM hive2623_orc;
+DROP TABLE hive2623_orc;
+
+CREATE EXTERNAL TABLE hive2623_char_orc(kob char(2), enhanced_type_code int) STORED AS ORC;
+INSERT INTO hive2623_char_orc VALUES('B',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='B ' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check
+FROM hive2623_char_orc;
+DROP TABLE hive2623_char_orc;
+
+-- test case(s) for HIVE-26320 for PARQUET
+SET hive.optimize.point.lookup=true;
+SET hive.optimize.point.lookup.min=2;
+
+CREATE EXTERNAL TABLE hive2623_parq(kob varchar(2), enhanced_type_code int) STORED AS PARQUET;
+INSERT INTO hive2623_parq VALUES('BB',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='BB' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check
+FROM hive2623_parq;
+DROP TABLE hive2623_parq;
+
+CREATE EXTERNAL TABLE hive2623_char_parq(kob char(2), enhanced_type_code int) STORED AS PARQUET;
+INSERT INTO hive2623_char_parq VALUES('B ',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='B ' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check
+FROM hive2623_char_parq;
+DROP TABLE hive2623_char_parq;
+
+CREATE EXTERNAL TABLE hive2623_int_parq(kob int, enhanced_type_code int) STORED AS PARQUET;
+INSERT INTO hive2623_int_parq VALUES(2,18),(23,18),(12,18);
+SELECT CASE WHEN ((kob=2 AND enhanced_type_code=18) OR (kob=23  AND enhanced_type_code=18)) THEN 1 ELSE 0 END AS logic_check FROM hive2623_int_parq;
+DROP TABLE hive2623_int_parq;

Review Comment:
   The bug was not really in pointlookup optimization but in the IN function combined with structs and Parquet format. We can certainly leave a test here with `CASE WHEN` since it was reported like this but it may be a good idea to add explicitly a test case using the IN function in udf_in.q  (or elsewhere).
   
   Something like:
   ```sql
   SELECT (kob, code) IN (('BB', 18), ('BC', 18)) FROM parq;
   ```



##########
ql/src/test/queries/clientpositive/pointlookup.q:
##########
@@ -127,3 +127,39 @@ or inOutputOpt.key = null;
 drop table orOutput;
 drop table inOutput;
 drop table inOutputOpt;
+
+-- test case(s) for HIVE-26320 for ORC
+SET hive.optimize.point.lookup=true;
+SET hive.optimize.point.lookup.min=2;
+CREATE EXTERNAL TABLE hive2623_orc(kob varchar(2), enhanced_type_code int) STORED AS ORC;
+INSERT INTO hive2623_orc VALUES('BB',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='BB' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check
+FROM hive2623_orc;

Review Comment:
   `hive2623` vs `hive26230` intentional or typo? same goes for all table names.



##########
serde/src/java/org/apache/hadoop/hive/serde2/io/HiveBaseCharWritable.java:
##########
@@ -26,10 +26,19 @@
 import org.apache.hive.common.util.HiveStringUtils;
 
 public abstract class HiveBaseCharWritable {
-  protected Text value = new Text();
+  protected Text value;
   protected int charLength = -1;
 
   public HiveBaseCharWritable() {
+    value = new Text();
+  }
+
+  public HiveBaseCharWritable(Text text) {
+    value = text;
+  }

Review Comment:
   Is this constructor used by anyone?



##########
ql/src/test/queries/clientpositive/pointlookup.q:
##########
@@ -127,3 +127,39 @@ or inOutputOpt.key = null;
 drop table orOutput;
 drop table inOutput;
 drop table inOutputOpt;
+
+-- test case(s) for HIVE-26320 for ORC
+SET hive.optimize.point.lookup=true;
+SET hive.optimize.point.lookup.min=2;
+CREATE EXTERNAL TABLE hive2623_orc(kob varchar(2), enhanced_type_code int) STORED AS ORC;
+INSERT INTO hive2623_orc VALUES('BB',18),('BC',18),('AB',18);
+SELECT CASE WHEN ((kob='BB' AND enhanced_type_code='18') OR (kob='BC' AND enhanced_type_code='18')) THEN 1 ELSE 0 END AS logic_check

Review Comment:
   I noticed that in this and some of the queries below we are using a character literal (`'18'`) instead of numeric literal (`18`). I am not sure what's the intent here but better avoid making the test more complicated than necessary.



##########
serde/src/java/org/apache/hadoop/hive/serde2/io/HiveBaseCharWritable.java:
##########
@@ -26,10 +26,19 @@
 import org.apache.hive.common.util.HiveStringUtils;
 
 public abstract class HiveBaseCharWritable {
-  protected Text value = new Text();
+  protected Text value;
   protected int charLength = -1;
 
   public HiveBaseCharWritable() {
+    value = new Text();
+  }
+
+  public HiveBaseCharWritable(Text text) {
+    value = text;
+  }
+
+  public HiveBaseCharWritable(byte[] bytes) {
+    value = new Text(bytes);

Review Comment:
   We could add an additional `set` method in sublcasses and no touch at all this class (fewer changes in total). Is there a specific reason for using the constructor approach?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org