You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Zouxxyy (via GitHub)" <gi...@apache.org> on 2023/06/13 10:27:04 UTC

[GitHub] [hudi] Zouxxyy opened a new pull request, #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Zouxxyy opened a new pull request, #8955:
URL: https://github.com/apache/hudi/pull/8955

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1227903585


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;
 
   public HoodieAvroParquetReader(InputSplit inputSplit, Configuration conf) throws IOException {
+    // get base schema
+    ParquetMetadata fileFooter =
+        ParquetFileReader.readFooter(conf, ((ParquetInputSplit) inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER);
+    MessageType messageType = fileFooter.getFileMetaData().getSchema();
+    baseSchema = new AvroSchemaConverter(conf).convert(messageType);
+
     // if exists read columns, we need to filter columns.
     List<String> readColNames = Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf));
     if (!readColNames.isEmpty()) {
-      // get base schema
-      ParquetMetadata fileFooter =
-          ParquetFileReader.readFooter(conf, ((ParquetInputSplit) inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER);
-      MessageType messageType = fileFooter.getFileMetaData().getSchema();
-      baseSchema = new AvroSchemaConverter(conf).convert(messageType);
-      // filter schema for reading
-      final Schema filterSchema = Schema.createRecord(baseSchema.getName(),
-          baseSchema.getDoc(), baseSchema.getNamespace(), baseSchema.isError(),
-          baseSchema.getFields().stream()
-              .filter(f -> readColNames.contains(f.name()))
-              .map(f -> new Schema.Field(f.name(), f.schema(), f.doc(), f.defaultVal()))
-              .collect(Collectors.toList()));
+      Schema filterSchema = HoodieAvroUtils.generateProjectionSchema(baseSchema, readColNames);

Review Comment:
   Can we write a test case for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590640122

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * ccd68253c126283a7cccbf401c7e5c3247a1fef0 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822) 
   * f1437ab3a8dd88d71f3b3451be627b09bafdb865 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "CTTY (via GitHub)" <gi...@apache.org>.
CTTY commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1624363521

   I'm seeing this failure when running unit test, this test seems to be added by this PR. Error message:
   ```
   Error:  Failures: 
   Error:    TestHoodieAvroUtils.testGenerateProjectionSchema:453 expected: <Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: [non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col, _hoodie_commit_seqno, _hoodie_file_name, timestamp]> but was: <Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: [_hoodie_commit_time, non_pii_col, _hoodie_partition_path, _row_key, _hoodie_record_key, pii_col, _hoodie_commit_seqno, _hoodie_file_name, timestamp]>
   
   ```
   
   Looks like only the order of column is wrong, but could you help me understand if this is a valid failure or we should fix the test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590568232

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * de5ed6c39038469ee83aabb6b0c68137b24ee118 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820) 
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 02ddd518f6451f4dd0f29a192a52a92504cc6221 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821) 
   * ccd68253c126283a7cccbf401c7e5c3247a1fef0 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1589049421

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fc4b1395758027be5e1614a0547029202c8a9a3e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1591241960

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9bad575368afc21c302894db5f389407bbcf9265",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * f1437ab3a8dd88d71f3b3451be627b09bafdb865 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824) 
   * 9bad575368afc21c302894db5f389407bbcf9265 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229355763


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   > What does this mean `colTypes == null || colTypes.isEmpty()` and why it returns true?
   
   follow https://github.com/apache/hudi/pull/3391#discussion_r1006287106



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] cdmikechen commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "cdmikechen (via GitHub)" <gi...@apache.org>.
cdmikechen commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229585351


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   In my [origin PR HUDI-83](https://github.com/apache/hudi/pull/3391/files) I didn't declare the `baseSchema` variable and didn't modify the `getCurrentValue` method.
   In fact I would like to know if there is any problem or no NPE if we don't declare the `baseSchema`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1227971234


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   Here I just use a simple way to judge (`contains("timestamp")`)
   Field:
   ```text
   ts1 array<timestamp>, 
   ts2 map<string, timestamp>, 
   ts3 struct<province:timestamp, city:string>
   ```
   type(string) will be
   
   ```
   array<timestamp>
   map<string,timestamp>
   struct<province:timestamp,city:string>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xicm commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "xicm (via GitHub)" <gi...@apache.org>.
xicm commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1228996809


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   how about `types.get(i).replace("timestamp:", "").contains("timestamp")`, is there any other casea? :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590498774

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * de5ed6c39038469ee83aabb6b0c68137b24ee118 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820) 
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 02ddd518f6451f4dd0f29a192a52a92504cc6221 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1624527608

   > Looks like only the order of column is wrong, but could you help me understand if this is a valid failure or we should fix the test?
   
   Are you testing java17? https://github.com/apache/hudi/pull/9136/files#top
   It seems that the order of items in set in java17 has changed, we can change the test case like this, if we need to support java17
   
   from
   
   ```java
       assertEquals("Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: "
               + "[non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col,"
               + " _hoodie_commit_seqno, _hoodie_file_name, timestamp]",
           assertThrows(HoodieException.class, () ->
               HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field"))).getMessage());
   ```
   to
   
   ```java
        assertTrue(assertThrows(HoodieException. class, () ->
            HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field")))
            .getMessage().contains("Field fake_field not found in log schema. Query cannot proceed!"));
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1592260189

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835",
       "triggerID" : "9bad575368afc21c302894db5f389407bbcf9265",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844",
       "triggerID" : "1591629538",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17861",
       "triggerID" : "1592256057",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 9bad575368afc21c302894db5f389407bbcf9265 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17861) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229351990


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   What does this mean `colTypes == null || colTypes.isEmpty()` and why it returns true?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;
     }
+
+    ArrayList<TypeInfo> types = TypeInfoUtils.getTypeInfosFromTypeString(colTypes);
     List<String> names = getIOColumns(conf);
-    List<String> types = getIOColumnTypes(conf);
-    return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+    return IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
+        .anyMatch(i -> typeContainsTimestamp(types.get(i)));
+  }
+
+  public static boolean typeContainsTimestamp(TypeInfo type) {
+    Category category = type.getCategory();

Review Comment:
   Can we move this method to `TypeInfoUtils`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229355763


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   > What does this mean `colTypes == null || colTypes.isEmpty()` and why it returns true?
   
   https://github.com/apache/hudi/pull/3391#discussion_r1006287106



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229615890


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xicm commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "xicm (via GitHub)" <gi...@apache.org>.
xicm commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1228926554


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");

Review Comment:
   Agree with you, @cdmikechen do you have any other concern?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1227971234


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   Here I just use a simple way to judge (`contains("timestamp")`)
   
   Field:
   ```text
   ts1 array<timestamp>, 
   ts2 map<string, timestamp>, 
   ts3 struct<province:timestamp, city:string>
   ```
   type(string) will be
   
   ```
   array<timestamp>
   map<string,timestamp>
   struct<province:timestamp,city:string>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229598004


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   > In my [origin PR HUDI-83](https://github.com/apache/hudi/pull/3391/files) I didn't declare the `baseSchema` variable and didn't modify the `getCurrentValue` method. In fact I would like to know if there is any problem or no NPE if we don't declare the `baseSchema`?
   
   I have tested that `baseSchema ` need to be used in getCurrentValue, otherwise, the result field will be null, like this https://github.com/apache/hudi/pull/7173#issuecomment-1316208189
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590454062

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fc4b1395758027be5e1614a0547029202c8a9a3e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799) 
   * de5ed6c39038469ee83aabb6b0c68137b24ee118 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1228950772


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;
 
   public HoodieAvroParquetReader(InputSplit inputSplit, Configuration conf) throws IOException {
+    // get base schema
+    ParquetMetadata fileFooter =
+        ParquetFileReader.readFooter(conf, ((ParquetInputSplit) inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER);
+    MessageType messageType = fileFooter.getFileMetaData().getSchema();
+    baseSchema = new AvroSchemaConverter(conf).convert(messageType);
+
     // if exists read columns, we need to filter columns.
     List<String> readColNames = Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf));
     if (!readColNames.isEmpty()) {
-      // get base schema
-      ParquetMetadata fileFooter =
-          ParquetFileReader.readFooter(conf, ((ParquetInputSplit) inputSplit).getPath(), ParquetMetadataConverter.NO_FILTER);
-      MessageType messageType = fileFooter.getFileMetaData().getSchema();
-      baseSchema = new AvroSchemaConverter(conf).convert(messageType);
-      // filter schema for reading
-      final Schema filterSchema = Schema.createRecord(baseSchema.getName(),
-          baseSchema.getDoc(), baseSchema.getNamespace(), baseSchema.isError(),
-          baseSchema.getFields().stream()
-              .filter(f -> readColNames.contains(f.name()))
-              .map(f -> new Schema.Field(f.name(), f.schema(), f.doc(), f.defaultVal()))
-              .collect(Collectors.toList()));
+      Schema filterSchema = HoodieAvroUtils.generateProjectionSchema(baseSchema, readColNames);

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229365125


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;
     }
+
+    ArrayList<TypeInfo> types = TypeInfoUtils.getTypeInfosFromTypeString(colTypes);
     List<String> names = getIOColumns(conf);
-    List<String> types = getIOColumnTypes(conf);
-    return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+    return IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
+        .anyMatch(i -> typeContainsTimestamp(types.get(i)));
+  }
+
+  public static boolean typeContainsTimestamp(TypeInfo type) {
+    Category category = type.getCategory();

Review Comment:
   Ignore, didn't notice that `TypeInfoUtils` is a Hive class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229372864


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   I'm confused too, @cdmikechen



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1230323622


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   @cdmikechen Have you ever tested `select id, ts1 from test_ts_1`?  will return null if don't use `baseSchema`
   Below is my full test, feel free to try
   
   ```sql
   -- spark-sql
   create table test_ts_1(
     id int, 
     ts1 timestamp)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO test_ts_1
   SELECT 1,
   cast ('2021-12-25 12:01:01' as timestamp);
   
   create table test_ts_2(
     id int, 
     ts1 array<timestamp>, 
     ts2 map<string, timestamp>, 
     ts3 struct<province:timestamp, city:string>)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO test_ts_2
   SELECT 1,
   array(cast ('2021-12-25 12:01:01' as timestamp)),
   map('key', cast ('2021-12-25 12:01:01' as timestamp)),
   struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');
   
   -- hive
   select * from test_ts_1;
   select id from test_ts_1;
   select ts1 from test_ts_1;
   select id, ts1 from test_ts_1;
   select count(*) from test_ts_1;
   
   select * from test_ts_2;
   select id from test_ts_2;
   select ts1 from test_ts_2;
   select id, ts1 from test_ts_2;
   select count(*) from test_ts_2;
   ```
   
   CC @danny0405 @xicm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 merged pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 merged PR #8955:
URL: https://github.com/apache/hudi/pull/8955


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1591623650

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835",
       "triggerID" : "9bad575368afc21c302894db5f389407bbcf9265",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 9bad575368afc21c302894db5f389407bbcf9265 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1227971234


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   Here I just use a simple way to judge (`contains("timestamp")`)。。。
   
   Field:
   ```text
   ts1 array<timestamp>, 
   ts2 map<string, timestamp>, 
   ts3 struct<province:timestamp, city:string>
   ```
   type(string) will be
   
   ```
   array<timestamp>
   map<string,timestamp>
   struct<province:timestamp,city:string>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1592024916

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835",
       "triggerID" : "9bad575368afc21c302894db5f389407bbcf9265",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844",
       "triggerID" : "1591629538",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 9bad575368afc21c302894db5f389407bbcf9265 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590577021

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 02ddd518f6451f4dd0f29a192a52a92504cc6221 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821) 
   * ccd68253c126283a7cccbf401c7e5c3247a1fef0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822) 
   * f1437ab3a8dd88d71f3b3451be627b09bafdb865 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1227968015


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))

Review Comment:
   @xicm Can you explain when `types.isEmpty()` will be true?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   Here I just use a simple way to judge (`contains("timestamp")`)
   Field:
   ```text
     ts1 array<timestamp>, 
     ts2 map<string, timestamp>, 
     ts3 struct<province:timestamp, city:string>
   ```
   type(string) will be
   
   ```
   array<timestamp>
   map<string,timestamp>
   struct<province:timestamp,city:string>
   ```



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");

Review Comment:
   @xicm Here I think it should return false directly, what do you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] cdmikechen commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "cdmikechen (via GitHub)" <gi...@apache.org>.
cdmikechen commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229552284


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   It seems inside my memory that some of the Hive test cases at that time were not configured rightly for test initialization. To be compatible with some errors, so I didn't do the return directly. 
   If there's no problem with CI now, I think it should be OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] cdmikechen commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "cdmikechen (via GitHub)" <gi...@apache.org>.
cdmikechen commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1230281580


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   @Zouxxyy 
   I'm having some confusion, I remember doing some situation testing against Hive when I first made the changes (about 1 year ago), including `count(*)`. 
   I don't know if some subsequent new FEATURE or PR has affected this, I think I'll do another test later this week. Although we have added a separate class to handle `timestamp` types, my original intention was to use Hive or Hadoop origin method as much as possible for other fields, otherwise it would be costly for us to maintain subsequently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229615890


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   return false now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1589338401

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fc4b1395758027be5e1614a0547029202c8a9a3e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xicm commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "xicm (via GitHub)" <gi...@apache.org>.
xicm commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1228926159


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))

Review Comment:
   https://github.com/apache/hudi/pull/3391#discussion_r1006287106



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229372864


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   @cdmikechen



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229591534


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");

Review Comment:
   > In such case, the timestamp can be read correctly anyway?
   
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1592401345

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835",
       "triggerID" : "9bad575368afc21c302894db5f389407bbcf9265",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844",
       "triggerID" : "1591629538",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17861",
       "triggerID" : "1592256057",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 9bad575368afc21c302894db5f389407bbcf9265 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17861) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1230323622


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   @cdmikechen Have you ever tested `select id, ts1 from test_ts_1`?  will return null if don't use `baseSchema`
   Below is my full test, fell free to try
   
   ```sql
   -- spark-sql
   create table test_ts_1(
     id int, 
     ts1 timestamp)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO test_ts_1
   SELECT 1,
   cast ('2021-12-25 12:01:01' as timestamp);
   
   create table test_ts_2(
     id int, 
     ts1 array<timestamp>, 
     ts2 map<string, timestamp>, 
     ts3 struct<province:timestamp, city:string>)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO test_ts_2
   SELECT 1,
   array(cast ('2021-12-25 12:01:01' as timestamp)),
   map('key', cast ('2021-12-25 12:01:01' as timestamp)),
   struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');
   
   -- hive
   select * from test_ts_1;
   select id from test_ts_1;
   select ts1 from test_ts_1;
   select id, ts1 from test_ts_1;
   select count(*) from test_ts_1;
   
   select * from test_ts_2;
   select id from test_ts_2;
   select ts1 from test_ts_2;
   select id, ts1 from test_ts_2;
   select count(*) from test_ts_2;
   ```
   
   CC @danny0405 @xicm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xicm commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "xicm (via GitHub)" <gi...@apache.org>.
xicm commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1228996809


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   how about `types.get(i).replace("timestamp:", "").contains("timestamp")`, is there any other casea? hard code. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] cdmikechen commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "cdmikechen (via GitHub)" <gi...@apache.org>.
cdmikechen commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229585351


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   In my [origin PR HUDI-83](https://github.com/apache/hudi/pull/3391/files) I didn't declare the `baseSchema` variable and didn't modify the `getCurrentValue` method.
   In fact I would like to know if there is any problem if we don't declare the `baseSchema`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xicm commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "xicm (via GitHub)" <gi...@apache.org>.
xicm commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229017168


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   ```
   ArrayList<TypeInfo> typeInfos = TypeInfoUtils.getTypeInfosFromTypeString("struct<timestamp:string,city:string>");
       if (typeInfos.get(0) instanceof StructTypeInfo) {
         ((StructTypeInfo)typeInfos.get(0)).getAllStructFieldTypeInfos().stream().anyMatch(typeInfo -> {
           return typeInfo.getTypeName().equals("timestamp");
         });
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590459255

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fc4b1395758027be5e1614a0547029202c8a9a3e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799) 
   * de5ed6c39038469ee83aabb6b0c68137b24ee118 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1592256057

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590506386

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * de5ed6c39038469ee83aabb6b0c68137b24ee118 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820) 
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 02ddd518f6451f4dd0f29a192a52a92504cc6221 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229336409


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");

Review Comment:
   When the `readCols` can be empty ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229344501


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");

Review Comment:
   > When the `readCols` can be empty ?
   
   As far as I know, such as `count(*)` which doesn't need to read any cols



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1228978004


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   Let's try to solve it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590464243

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fc4b1395758027be5e1614a0547029202c8a9a3e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799) 
   * de5ed6c39038469ee83aabb6b0c68137b24ee118 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820) 
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] cdmikechen commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "cdmikechen (via GitHub)" <gi...@apache.org>.
cdmikechen commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1230281580


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   @Zouxxyy 
   I'm having some confusion, I remember doing some situation testing against Hive when I first made the changes (about 1 year ago), including `count(*)` or specified fields. 
   I don't know if some subsequent new FEATURE or PR has affected this, I think I'll do another test later this week. Although we have added a separate class to handle `timestamp` types, my original intention was to use Hive or Hadoop origin method as much as possible for other fields, otherwise it would be costly for us to maintain subsequently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1228950416


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   Misjudgment may occur here, such as `struct<timestamp:string,city:string>`, but it's a bit complicated to deal with complex structures, or just handle strings hackly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229358031


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;
     }
+
+    ArrayList<TypeInfo> types = TypeInfoUtils.getTypeInfosFromTypeString(colTypes);
     List<String> names = getIOColumns(conf);
-    List<String> types = getIOColumnTypes(conf);
-    return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+    return IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
+        .anyMatch(i -> typeContainsTimestamp(types.get(i)));
+  }
+
+  public static boolean typeContainsTimestamp(TypeInfo type) {
+    Category category = type.getCategory();

Review Comment:
   > Can we move this method to `TypeInfoUtils`.
   
   A new class? ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229364113


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;

Review Comment:
   To be conservative, we should return false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229559387


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");

Review Comment:
   In such case, the timestamp can be read correctly anyway?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1591259974

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835",
       "triggerID" : "9bad575368afc21c302894db5f389407bbcf9265",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * f1437ab3a8dd88d71f3b3451be627b09bafdb865 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824) 
   * 9bad575368afc21c302894db5f389407bbcf9265 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1591629538

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1589015459

   @xicm Hi, can you help with the review ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1589062023

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fc4b1395758027be5e1614a0547029202c8a9a3e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229043602


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +112,21 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
     }
     List<String> names = getIOColumns(conf);
     List<String> types = getIOColumnTypes(conf);
     return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+        .anyMatch(i -> types.get(i).contains("timestamp"));

Review Comment:
   @danny0405 @xicm done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1590858988

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * f1437ab3a8dd88d71f3b3451be627b09bafdb865 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1229358031


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java:
##########
@@ -112,22 +120,53 @@ public static List<Pair<String,String>> getIOColumnNameAndTypes(Configuration co
   /**
    * If schema contains timestamp columns, this method is used for compatibility when there is no timestamp fields.
    *
-   * <p>We expect 3 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
+   * <p>We expect 2 cases to use parquet-avro reader {@link org.apache.hudi.hadoop.avro.HoodieAvroParquetReader} to read timestamp column:
    *
    * <ol>
    *   <li>Read columns contain timestamp type;</li>
    *   <li>Empty original columns;</li>
-   *   <li>Empty read columns but existing original columns contain timestamp type.</li>
    * </ol>
    */
   public static boolean supportTimestamp(Configuration conf) {
     List<String> readCols = Arrays.asList(getReadColumnNames(conf));
     if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");
+      return false;
+    }
+
+    String colTypes = conf.get(IOConstants.COLUMNS_TYPES, "");
+    if (colTypes == null || colTypes.isEmpty()) {
+      return true;
     }
+
+    ArrayList<TypeInfo> types = TypeInfoUtils.getTypeInfosFromTypeString(colTypes);
     List<String> names = getIOColumns(conf);
-    List<String> types = getIOColumnTypes(conf);
-    return types.isEmpty() || IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
-        .anyMatch(i -> types.get(i).equals("timestamp"));
+    return IntStream.range(0, names.size()).filter(i -> readCols.contains(names.get(i)))
+        .anyMatch(i -> typeContainsTimestamp(types.get(i)));
+  }
+
+  public static boolean typeContainsTimestamp(TypeInfo type) {
+    Category category = type.getCategory();

Review Comment:
   > Can we move this method to `TypeInfoUtils`.
   
   A new class? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8955:
URL: https://github.com/apache/hudi/pull/8955#discussion_r1230323622


##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   @cdmikechen 
   Have you ever tested `select id, ts1 from test_ts_1`?  will return null if don't use `baseSchema`
   
   Below is my full test, fell free to try
   
   ```sql
   -- spark-sql
   create table test_ts_1(
     id int, 
     ts1 timestamp)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO xinyu_test.test_ts_1
   SELECT 1,
   cast ('2021-12-25 12:01:01' as timestamp);
   
   create table test_ts_2(
     id int, 
     ts1 array<timestamp>, 
     ts2 map<string, timestamp>, 
     ts3 struct<province:timestamp, city:string>)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO test_ts_2
   SELECT 1,
   array(cast ('2021-12-25 12:01:01' as timestamp)),
   map('key', cast ('2021-12-25 12:01:01' as timestamp)),
   struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');
   
   -- hive
   select * from test_ts_1;
   select id from test_ts_1;
   select ts1 from test_ts_1;
   select id, ts1 from test_ts_1;
   select count(*) from test_ts_1;
   
   select * from test_ts_2;
   select id from test_ts_2;
   select ts1 from test_ts_2;
   select id, ts1 from test_ts_2;
   select count(*) from test_ts_2;
   ```
   
   CC @danny0405 @xicm 



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java:
##########
@@ -50,24 +51,20 @@ public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {
   private Schema baseSchema;

Review Comment:
   @cdmikechen Have you ever tested `select id, ts1 from test_ts_1`?  will return null if don't use `baseSchema`
   Below is my full test, fell free to try
   
   ```sql
   -- spark-sql
   create table test_ts_1(
     id int, 
     ts1 timestamp)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO xinyu_test.test_ts_1
   SELECT 1,
   cast ('2021-12-25 12:01:01' as timestamp);
   
   create table test_ts_2(
     id int, 
     ts1 array<timestamp>, 
     ts2 map<string, timestamp>, 
     ts3 struct<province:timestamp, city:string>)
   using hudi
   tblproperties(
     type='mor', 
     primaryKey='id'
   );
   
   INSERT INTO test_ts_2
   SELECT 1,
   array(cast ('2021-12-25 12:01:01' as timestamp)),
   map('key', cast ('2021-12-25 12:01:01' as timestamp)),
   struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');
   
   -- hive
   select * from test_ts_1;
   select id from test_ts_1;
   select ts1 from test_ts_1;
   select id, ts1 from test_ts_1;
   select count(*) from test_ts_1;
   
   select * from test_ts_2;
   select id from test_ts_2;
   select ts1 from test_ts_2;
   select id, ts1 from test_ts_2;
   select count(*) from test_ts_2;
   ```
   
   CC @danny0405 @xicm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8955: [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8955:
URL: https://github.com/apache/hudi/pull/8955#issuecomment-1591637426

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17799",
       "triggerID" : "fc4b1395758027be5e1614a0547029202c8a9a3e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17820",
       "triggerID" : "de5ed6c39038469ee83aabb6b0c68137b24ee118",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6c161440f1cc359d754c48cf2590303c11786e2b",
       "triggerType" : "PUSH"
     }, {
       "hash" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17821",
       "triggerID" : "02ddd518f6451f4dd0f29a192a52a92504cc6221",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17822",
       "triggerID" : "ccd68253c126283a7cccbf401c7e5c3247a1fef0",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17824",
       "triggerID" : "f1437ab3a8dd88d71f3b3451be627b09bafdb865",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835",
       "triggerID" : "9bad575368afc21c302894db5f389407bbcf9265",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9bad575368afc21c302894db5f389407bbcf9265",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844",
       "triggerID" : "1591629538",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 6c161440f1cc359d754c48cf2590303c11786e2b UNKNOWN
   * 9bad575368afc21c302894db5f389407bbcf9265 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17835) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17844) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org