You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/07/24 00:03:47 UTC

[GitHub] [incubator-pinot] jackjlli opened a new pull request #5746: Fix code to correctly extract value of multi-value column from avro file

jackjlli opened a new pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746


   ## Description
   This PR fixed the issue introduced from this PR: https://github.com/apache/incubator-pinot/pull/5238
   
   Basically the value of the multi-value column should be at the 1st level instead of digging into the deepest level. The issue is causing the segment creation fail to extract the correct value from the avro file.
   
   ## Tests Done
   Before the change:
   ```
   java.lang.ClassCastException: java.util.HashMap cannot be cast to java.lang.Number
   	at org.apache.pinot.core.data.recordtransformer.PinotDataType$11.toInteger(PinotDataType.java:392) ~[classes/:?]
   	at org.apache.pinot.core.data.recordtransformer.PinotDataType.toIntegerArray(PinotDataType.java:512) ~[classes/:?]
   	at org.apache.pinot.core.data.recordtransformer.PinotDataType$13.convert(PinotDataType.java:441) ~[classes/:?]
   	at org.apache.pinot.core.data.recordtransformer.PinotDataType$13.convert(PinotDataType.java:438) ~[classes/:?]
   	at org.apache.pinot.core.data.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:100) ~[classes/:?]
   	at org.apache.pinot.core.data.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:74) ~[classes/:?]
   	at org.apache.pinot.core.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:70) ~[classes/:?]
   	at org.apache.pinot.core.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:38) ~[classes/:?]
   	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:146) ~[classes/:?]
   	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:131) ~[classes/:?]
   	at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:93) ~[classes/:?]
   	at org.apache.pinot.tools.admin.command.CreateSegmentCommand.lambda$execute$0(CreateSegmentCommand.java:238) ~[classes/:?]
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_172]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
   	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
   ```
   
   After the change:
   ```
   2020/07/23 17:02:15.195 INFO [CreateSegmentCommand] [pool-2-thread-1] Successfully created segment: messageInsightsSendEventsDaily_20200722_20200722_0 at directory: ~/sampleData/output/segmentName_20200722_20200722_0
   2020/07/23 17:02:15.196 INFO [CreateSegmentCommand] [main] Successfully created 1 segments from data files: [~/sampleData/input/test.avro]
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r460321430



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java
##########
@@ -470,6 +474,43 @@ public String toString(Object value) {
     }
   },
 
+  HASHMAP,

Review comment:
       We actually don't have to introduce this type (in future we should when we have Map) but now there is no need. 
   
   In the transform() method of DataTypeTransformer, we should do:
   
   ```
   if (values[0] instance of Map) {
       // and if schema says the column is primitive
       // then get the primitive type from Map using the function you have already implemented 
      ` `getPinotDataTypeFromHashMap``
        call dest.convert()
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jackjlli commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
jackjlli commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459797366



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       This is following the conversion here. `GenericData.Record` is for fetching the value from a single value column. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] Jackie-Jiang commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r460436632



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java
##########
@@ -632,4 +721,22 @@ public static PinotDataType getPinotDataType(FieldSpec fieldSpec) {
             "Unsupported data type: " + dataType + " in field: " + fieldSpec.getName());
     }
   }
+
+  public static PinotDataType getSpecificMapDataTypeFromMap(Map<Object, Object> map) {
+    Iterator<Object> iterator = map.values().iterator();
+    Object obj = iterator.next();
+    if (obj instanceof Integer) {
+      return PinotDataType.INTEGER_MAP;
+    } else if (obj instanceof Long) {
+      return PinotDataType.LONG_MAP;
+    } else if (obj instanceof Float) {
+      return PinotDataType.FLOAT_MAP;
+    } else if (obj instanceof Double) {
+      return PinotDataType.DOUBLE_MAP;
+    } else if (obj instanceof String) {
+      return PinotDataType.STRING_MAP;
+    } else {
+      throw new IllegalStateException(String.format("'%s' isn't supported in the hash map.", obj.getClass()));

Review comment:
       How about nested Map? How about single-entry map with multi-valued value (e.g. `Object[]`)?

##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java
##########
@@ -56,6 +56,7 @@
     MULTI_VALUE_TYPE_MAP.put(Float.class, PinotDataType.FLOAT_ARRAY);
     MULTI_VALUE_TYPE_MAP.put(Double.class, PinotDataType.DOUBLE_ARRAY);
     MULTI_VALUE_TYPE_MAP.put(String.class, PinotDataType.STRING_ARRAY);
+    MULTI_VALUE_TYPE_MAP.put(HashMap.class, PinotDataType.MAP);

Review comment:
       `HashMap` is not good enough, should use `instanceof Map` instead




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] snleee commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
snleee commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459795372



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       I'm not sure if this is a valid fix. According to the comment instruction,
   
   ```
   Converts the value to either a single value (string, number, bytebuffer), multi value (Object[]) or a Map
   Natively Pinot only understands single values and multi values.
   Map is useful only if some ingestion transform functions operates on it in the transformation layer.
   ```
   
   Don't we need to follow the above conversion in case of `GenericData.Record`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r460317334



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java
##########
@@ -87,6 +90,8 @@ public GenericRow transform(GenericRow record) {
         source = MULTI_VALUE_TYPE_MAP.get(values[0].getClass());
         if (source == null) {
           source = PinotDataType.OBJECT_ARRAY;
+        } else if (source == PinotDataType.HASHMAP) {

Review comment:
       This should be done conditionally. In other words, consider two columns:
   
   col1 - primitive column defined as MV int/float/double/string (simple array of primitive types that Pinot supports).  Something like "CompaniesWorkedAt".
   ```
   [
      "val1",
       "val2",
        "val3"
   ]
   ```
   col2 - complex column defined as MV. Something like addresses which is an array (struct) or array (map).
   
   ```
   [
       {
           "k1" : "v1",
           "k2":  "v2",
            "k3": "v3"
      }
   ]
   ```
   
   The second is a complex column whereas is first is standard primitive. Now the AvroRecordExtractor and AvroUtils.convert() would have returned the second as an array of Map/HashMap for our sample data
   
   ```
   "dimension_***" : [ {
       "item_id" : {
         "string" : "some data"
       }
     }, {
       "item_id" : {
         "string" : "some data"
       }
     }
   
   ```
   Now since schema for "dimension_***" column indicates it is an array of primitives, we need to extract the actual value from each HashMap. In such cases, it is also true that map size would be 1 but that alone should not be the condition since map size could be 1 even if the object is actually a complex one. So we should convert from Map to primitive only  if the schema says so. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459804013



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       If GenericData.Record is supposed imply SV, then what is the distinction between the last two branches of this code
   
   ```
   public static Object convert(Object value) {
       if (value == null) {
         return null;
       }
       Object convertedValue;
       if (value instanceof Collection) {
         convertedValue = handleMultiValue((Collection) value);
       } else if (value instanceof Map) {
         convertedValue = handleMap((Map) value);
       } else if(value instanceof GenericData.Record) {
         convertedValue = handleGenericRecord((GenericData.Record) value);
       } else {
         convertedValue = RecordReaderUtils.convertSingleValue(value);
       }
       return convertedValue;
     }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459805263



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();

Review comment:
       @npawar 's comment https://github.com/apache/incubator-pinot/pull/5746/#issuecomment-663302060 is also aligned with my observation. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r460322646



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java
##########
@@ -87,6 +90,8 @@ public GenericRow transform(GenericRow record) {
         source = MULTI_VALUE_TYPE_MAP.get(values[0].getClass());
         if (source == null) {
           source = PinotDataType.OBJECT_ARRAY;
+        } else if (source == PinotDataType.HASHMAP) {

Review comment:
       See this comment on what should be the algorithm here
   
   https://github.com/apache/incubator-pinot/pull/5746/#discussion_r460321430




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] Jackie-Jiang commented on pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#issuecomment-664682965


   Please take a look at #5760 which is a general fix for this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jackjlli commented on pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
jackjlli commented on pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#issuecomment-663308658


   > The fix should be in the `DataTypeTransformer` instead of here.
   > In `DataTypeTransformer` we should be able to handle map properly
   
   I've thought about that. But `DataTypeTransformer` uses a generic class for all kinds of data types though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mayankshriv commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
mayankshriv commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459793932



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       Is `record` guaranteed to be non-empty?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r460317334



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/DataTypeTransformer.java
##########
@@ -87,6 +90,8 @@ public GenericRow transform(GenericRow record) {
         source = MULTI_VALUE_TYPE_MAP.get(values[0].getClass());
         if (source == null) {
           source = PinotDataType.OBJECT_ARRAY;
+        } else if (source == PinotDataType.HASHMAP) {

Review comment:
       This should be done conditionally. In other words, consider two columns:
   
   col1 - primitive column defined as MV int/float/double/string (simple array of primitive types that Pinot supports).  Something like "CompaniesWorkedAt".
   ```
   [
        "val1",
        "val2",
        "val3"
   ]
   ```
   col2 - complex column defined as MV. Something like addresses which is an array (struct) or array (map).
   
   ```
   [
       {
            "k1" : "v1",
            "k2":  "v2",
            "k3": "v3"
      }
   ]
   ```
   
   The second is a complex column whereas is first is standard primitive. Now the AvroRecordExtractor and AvroUtils.convert() would have returned the second as an array of Map/HashMap for our sample data
   
   ```
   "dimension_***" : [ {
       "item_id" : {
         "string" : "some data"
       }
     }, {
       "item_id" : {
         "string" : "some data"
       }
     }
   
   ```
   Now since schema for "dimension_***" column indicates it is an array of primitives, we need to extract the actual value from each HashMap. In such cases, it is also true that map size would be 1 but that alone should not be the condition since map size could be 1 even if the object is actually a complex one. So we should convert from Map to primitive only  if the schema says so. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r460321729



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java
##########
@@ -632,4 +713,22 @@ public static PinotDataType getPinotDataType(FieldSpec fieldSpec) {
             "Unsupported data type: " + dataType + " in field: " + fieldSpec.getName());
     }
   }
+
+  public static PinotDataType getPinotDataTypeFromHashMap(Map<Object, Object> map) {

Review comment:
       Suggest calling it getPrimitivePinotDataTypeFromMap
   
   HashMap is an implementation detail so let's not use that. Map is better since it is also an interface and suitable to represent an abstract complex type




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459804440



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       Also, if the bug is the extraction of multi-value column, I don't quite follow how this fix is handling that. Shouldn't we be looking at handleMultiValue()?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459805052



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();

Review comment:
       It seems that the current implementation of this function is doing what it is supposed to.
   
   The caller chain is something like convert(value) -> handleGenericRecord(record). The latter is invoked by the former only if it determines there is nestedness (that the value passed to convert isn't a SV, MV or MAP. It is essentially a struct. So for that we need to recurse and call convert for each field inside GenericData.Record. The fix seems to be removing all of that code. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459804013



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       If GenericData.Record is supposed imply SV, then what is the distinction between the last two branches of this code
   
   `public static Object convert(Object value) {
       if (value == null) {
         return null;
       }
       Object convertedValue;
       if (value instanceof Collection) {
         convertedValue = handleMultiValue((Collection) value);
       } else if (value instanceof Map) {
         convertedValue = handleMap((Map) value);
       } else if(value instanceof GenericData.Record) {
         convertedValue = handleGenericRecord((GenericData.Record) value);
       } else {
         convertedValue = RecordReaderUtils.convertSingleValue(value);
       }
       return convertedValue;
     }`

##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       If GenericData.Record is supposed imply SV, then what is the distinction between the last two branches of this code
   
   `public static Object convert(Object value) {
       if (value == null) {
         return null;
       }
       Object convertedValue;
       if (value instanceof Collection) {
         convertedValue = handleMultiValue((Collection) value);
       } else if (value instanceof Map) {
         convertedValue = handleMap((Map) value);
       } else if(value instanceof GenericData.Record) {
         convertedValue = handleGenericRecord((GenericData.Record) value);
       } else {
         convertedValue = RecordReaderUtils.convertSingleValue(value);
       }
       return convertedValue;
     }```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jackjlli closed pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
jackjlli closed pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459806163



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       This is going to break the existing functionality of extracting a struct within a record




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar commented on pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
npawar commented on pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#issuecomment-663302060


   @jackjlli this doesn't look like the right fix for the issue. For GenericRecord, we should be looking at each field recursively.
   
   Also the AvroRecordExtractorComplexTypesTest will fail with this change.
   
   The code you're introducing makes an assumption that for a field of type GenericRecord in the avro raw data, we should always only look at the first fiel. That's not a good assumption right? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jackjlli commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
jackjlli commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459805384



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       This is the original code to handle SV:
   ```
     public static Object handleSingleValue(@Nullable Object value) {
       if (value == null) {
         return null;
       }
       if (value instanceof GenericData.Record) {
         return handleSingleValue(((GenericData.Record) value).get(0));
       }
       return value;
     }
   ```
   To determine whether it's SV or MV it's from the Pinot schema.
   
   And the method `handleMultiValue()` would also call `convert()` method for every sub-value for MV column.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jackjlli commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
jackjlli commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r459794468



##########
File path: pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
##########
@@ -353,16 +353,6 @@ public static Object handleMap(Map map) {
    * Handles the conversion of every field of the GenericRecord
    */
   private static Object handleGenericRecord(GenericData.Record record) {
-    List<Field> fields = record.getSchema().getFields();
-    if (fields.isEmpty()) {
-      return null;
-    }
-
-    Map<Object, Object> convertedMap = new HashMap<>();
-    for (Field field : fields) {
-      String fieldName = field.name();
-      convertedMap.put(fieldName, convert(record.get(fieldName)));
-    }
-    return convertedMap;
+    return record.get(0);

Review comment:
       Yes, otherwise the if statement `(value instanceof GenericData.Record)` would have failed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5746: Fix code to correctly extract value of multi-value column from avro file

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5746:
URL: https://github.com/apache/incubator-pinot/pull/5746#discussion_r460321430



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/PinotDataType.java
##########
@@ -470,6 +474,43 @@ public String toString(Object value) {
     }
   },
 
+  HASHMAP,

Review comment:
       We actually don't have to introduce this type (in future we should when we have Map) but now there is no need. 
   
   In the transform() method of DataTypeTransformer, we should do:
   
   ```
   if (values[0] instance of Map) {
    // and if schema says the column is primitive
   // then get the primitive type from Map using the function you have already implemented `getPinotDataTypeFromHashMap`
   call dest.convert()
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org