You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "voonhous (via GitHub)" <gi...@apache.org> on 2023/01/27 10:35:38 UTC

[GitHub] [hudi] voonhous opened a new pull request, #7761: [MINOR] Standardise schema concepts

voonhous opened a new pull request, #7761:
URL: https://github.com/apache/hudi/pull/7761

   ### Change Logs
   
   Standardise the usage of variable names pertaining to schemas across Hudi. 
   
   ### Impact
   
   No impact
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "xiarixiaoyao (via GitHub)" <gi...@apache.org>.
xiarixiaoyao commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1407917143

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1406372636

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93e4876eb944c4b3976ac5422586ff35e8b7aef5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1408264379

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666",
       "triggerID" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756",
       "triggerID" : "1407917143",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "e4d5c47999bb428d215fb0240187b415fe4327ce",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14763",
       "triggerID" : "e4d5c47999bb428d215fb0240187b415fe4327ce",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e4d5c47999bb428d215fb0240187b415fe4327ce Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14763) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1407947263

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666",
       "triggerID" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756",
       "triggerID" : "1407917143",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 93e4876eb944c4b3976ac5422586ff35e8b7aef5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #7761:
URL: https://github.com/apache/hudi/pull/7761#discussion_r1090228713


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/RecordIterators.java:
##########
@@ -50,8 +50,7 @@ public static ClosableIterator<RowData> getParquetRecordIterator(
       Path path,
       long splitStart,
       long splitLength) throws IOException {
-    InternalSchema fileSchema = internalSchemaManager.getFileSchema(path.getName());

Review Comment:
   @trushev Thank you for helping to review this. I've reverted the changes and added more javadocs. Please help to take a look + review it again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1407264960

   @xiarixiaoyao Can you take a look if you have time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #7761:
URL: https://github.com/apache/hudi/pull/7761#discussion_r1090132321


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/InternalSchemaManager.java:
##########
@@ -93,60 +93,62 @@ public InternalSchema getQuerySchema() {
     return querySchema;
   }
 
-  InternalSchema getFileSchema(String fileName) {
+  InternalSchema getMergeSchema(String fileName) {
     if (querySchema.isEmptySchema()) {
       return querySchema;
     }
     long commitInstantTime = Long.parseLong(FSUtils.getCommitTime(fileName));
-    InternalSchema fileSchemaUnmerged = InternalSchemaCache.getInternalSchemaByVersionId(
+    InternalSchema fileSchema = InternalSchemaCache.getInternalSchemaByVersionId(
         commitInstantTime, tablePath, getHadoopConf(), validCommits);
-    if (querySchema.equals(fileSchemaUnmerged)) {
+    if (querySchema.equals(fileSchema)) {
       return InternalSchema.getEmptyInternalSchema();
     }
-    return new InternalSchemaMerger(fileSchemaUnmerged, querySchema, true, true).mergeSchema();
+    return new InternalSchemaMerger(fileSchema, querySchema, true, true).mergeSchema();
   }
 
   /**
-   * This method returns a mapping of columns that have type inconsistencies between the fileSchema and querySchema.
+   * This method returns a mapping of columns that have type inconsistencies between the mergeSchema and querySchema.
    * This is done by:
    * <li>1. Finding the columns with type changes</li>
    * <li>2. Get a map storing the index of these columns with type changes; Map of -> (colIdxInQueryFieldNames, colIdxInQuerySchema)</li>
    * <li>3. For each selectedField with type changes, build a castMap containing the cast/conversion details;
    * Map of -> (selectedPos, Cast([from] fileType, [to] queryType))</li>
    *
-   * @param fileSchema InternalSchema representation of the file's schema (acquired from commit/.schema metadata)
+   * @param mergeSchema InternalSchema representation of mergeSchema (prioritise use of fileSchemaType) that is used for reading base parquet files
    * @param queryFieldNames array containing the columns of a Hudi Flink table
    * @param queryFieldTypes array containing the field types of the columns of a Hudi Flink table
    * @param selectedFields array containing the index of the columns of interest required (indexes are based on queryFieldNames and queryFieldTypes)
    * @return a castMap containing the information of how to cast a selectedField from the fileType to queryType.
    *
    * @see CastMap
    */
-  CastMap getCastMap(InternalSchema fileSchema, String[] queryFieldNames, DataType[] queryFieldTypes, int[] selectedFields) {
+  CastMap getCastMap(InternalSchema mergeSchema, String[] queryFieldNames, DataType[] queryFieldTypes, int[] selectedFields) {
     Preconditions.checkArgument(!querySchema.isEmptySchema(), "querySchema cannot be empty");
-    Preconditions.checkArgument(!fileSchema.isEmptySchema(), "fileSchema cannot be empty");
+    Preconditions.checkArgument(!mergeSchema.isEmptySchema(), "mergeSchema cannot be empty");
 
     CastMap castMap = new CastMap();
     // map storing the indexes of columns with type changes Map of -> (colIdxInQueryFieldNames, colIdxInQuerySchema)
-    Map<Integer, Integer> posProxy = getPosProxy(fileSchema, queryFieldNames);
+    Map<Integer, Integer> posProxy = getPosProxy(mergeSchema, queryFieldNames);
     if (posProxy.isEmpty()) {
       // no type changes
       castMap.setFileFieldTypes(queryFieldTypes);
       return castMap;
     }
     List<Integer> selectedFieldList = IntStream.of(selectedFields).boxed().collect(Collectors.toList());
-    List<DataType> fileSchemaAsDataTypes = AvroSchemaConverter.convertToDataType(
-        AvroInternalSchemaConverter.convert(fileSchema, "tableName")).getChildren();
+    // mergeSchema is built with useColumnTypeFromFileSchema = true

Review Comment:
   Hmmm, I added this comment when tracing the code to remind myself WHY fetching `fileFieldTypes` from `mergeSchema` works.
   
   On top of that, from line 106, it is pretty clear that the code is building the `mergeSchema` with `useColumnTypeFromFileSchema = true`. 
   
   Moving this comment from line 138 to line 106 would change the intention behind the comment. 
   
   If this line is at line 106, it would be telling the readers "WHAT" the code is doing.
   If it is at line 138, it would be telling the readers "WHY" fetching file types from `mergeSchema` works.
   
   So, i don't really think moving this comment is necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "xiarixiaoyao (via GitHub)" <gi...@apache.org>.
xiarixiaoyao commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1407903228

   @voonhous 
   Thank you for your contribution
   With this pr our codes is more readable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "xiarixiaoyao (via GitHub)" <gi...@apache.org>.
xiarixiaoyao commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1409863546

   ut failure has nothing to do with pr
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1408150600

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666",
       "triggerID" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756",
       "triggerID" : "1407917143",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "e4d5c47999bb428d215fb0240187b415fe4327ce",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14763",
       "triggerID" : "e4d5c47999bb428d215fb0240187b415fe4327ce",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93e4876eb944c4b3976ac5422586ff35e8b7aef5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756) 
   * e4d5c47999bb428d215fb0240187b415fe4327ce Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14763) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "xiarixiaoyao (via GitHub)" <gi...@apache.org>.
xiarixiaoyao commented on code in PR #7761:
URL: https://github.com/apache/hudi/pull/7761#discussion_r1090126531


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/InternalSchemaManager.java:
##########
@@ -93,60 +93,62 @@ public InternalSchema getQuerySchema() {
     return querySchema;
   }
 
-  InternalSchema getFileSchema(String fileName) {
+  InternalSchema getMergeSchema(String fileName) {
     if (querySchema.isEmptySchema()) {
       return querySchema;
     }
     long commitInstantTime = Long.parseLong(FSUtils.getCommitTime(fileName));
-    InternalSchema fileSchemaUnmerged = InternalSchemaCache.getInternalSchemaByVersionId(
+    InternalSchema fileSchema = InternalSchemaCache.getInternalSchemaByVersionId(
         commitInstantTime, tablePath, getHadoopConf(), validCommits);
-    if (querySchema.equals(fileSchemaUnmerged)) {
+    if (querySchema.equals(fileSchema)) {
       return InternalSchema.getEmptyInternalSchema();
     }
-    return new InternalSchemaMerger(fileSchemaUnmerged, querySchema, true, true).mergeSchema();
+    return new InternalSchemaMerger(fileSchema, querySchema, true, true).mergeSchema();
   }
 
   /**
-   * This method returns a mapping of columns that have type inconsistencies between the fileSchema and querySchema.
+   * This method returns a mapping of columns that have type inconsistencies between the mergeSchema and querySchema.
    * This is done by:
    * <li>1. Finding the columns with type changes</li>
    * <li>2. Get a map storing the index of these columns with type changes; Map of -> (colIdxInQueryFieldNames, colIdxInQuerySchema)</li>
    * <li>3. For each selectedField with type changes, build a castMap containing the cast/conversion details;
    * Map of -> (selectedPos, Cast([from] fileType, [to] queryType))</li>
    *
-   * @param fileSchema InternalSchema representation of the file's schema (acquired from commit/.schema metadata)
+   * @param mergeSchema InternalSchema representation of mergeSchema (prioritise use of fileSchemaType) that is used for reading base parquet files
    * @param queryFieldNames array containing the columns of a Hudi Flink table
    * @param queryFieldTypes array containing the field types of the columns of a Hudi Flink table
    * @param selectedFields array containing the index of the columns of interest required (indexes are based on queryFieldNames and queryFieldTypes)
    * @return a castMap containing the information of how to cast a selectedField from the fileType to queryType.
    *
    * @see CastMap
    */
-  CastMap getCastMap(InternalSchema fileSchema, String[] queryFieldNames, DataType[] queryFieldTypes, int[] selectedFields) {
+  CastMap getCastMap(InternalSchema mergeSchema, String[] queryFieldNames, DataType[] queryFieldTypes, int[] selectedFields) {
     Preconditions.checkArgument(!querySchema.isEmptySchema(), "querySchema cannot be empty");
-    Preconditions.checkArgument(!fileSchema.isEmptySchema(), "fileSchema cannot be empty");
+    Preconditions.checkArgument(!mergeSchema.isEmptySchema(), "mergeSchema cannot be empty");
 
     CastMap castMap = new CastMap();
     // map storing the indexes of columns with type changes Map of -> (colIdxInQueryFieldNames, colIdxInQuerySchema)
-    Map<Integer, Integer> posProxy = getPosProxy(fileSchema, queryFieldNames);
+    Map<Integer, Integer> posProxy = getPosProxy(mergeSchema, queryFieldNames);
     if (posProxy.isEmpty()) {
       // no type changes
       castMap.setFileFieldTypes(queryFieldTypes);
       return castMap;
     }
     List<Integer> selectedFieldList = IntStream.of(selectedFields).boxed().collect(Collectors.toList());
-    List<DataType> fileSchemaAsDataTypes = AvroSchemaConverter.convertToDataType(
-        AvroInternalSchemaConverter.convert(fileSchema, "tableName")).getChildren();
+    // mergeSchema is built with useColumnTypeFromFileSchema = true

Review Comment:
   how about move this comment to line 106?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] trushev commented on a diff in pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "trushev (via GitHub)" <gi...@apache.org>.
trushev commented on code in PR #7761:
URL: https://github.com/apache/hudi/pull/7761#discussion_r1090203187


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/RecordIterators.java:
##########
@@ -50,8 +50,7 @@ public static ClosableIterator<RowData> getParquetRecordIterator(
       Path path,
       long splitStart,
       long splitLength) throws IOException {
-    InternalSchema fileSchema = internalSchemaManager.getFileSchema(path.getName());

Review Comment:
   Why did you move this code into else branch?
   `getFileSchema()` may return empty schema which allow us avoiding  unnecessary creation of `ParquetSplitRecordIterator`
   
   ```java
     if (querySchema.equals(fileSchema)) {
       return InternalSchema.getEmptyInternalSchema();
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1408070065

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666",
       "triggerID" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756",
       "triggerID" : "1407917143",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 93e4876eb944c4b3976ac5422586ff35e8b7aef5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1408084801

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666",
       "triggerID" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756",
       "triggerID" : "1407917143",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "e4d5c47999bb428d215fb0240187b415fe4327ce",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e4d5c47999bb428d215fb0240187b415fe4327ce",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93e4876eb944c4b3976ac5422586ff35e8b7aef5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14756) 
   * e4d5c47999bb428d215fb0240187b415fe4327ce UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao merged pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "xiarixiaoyao (via GitHub)" <gi...@apache.org>.
xiarixiaoyao merged PR #7761:
URL: https://github.com/apache/hudi/pull/7761


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] trushev commented on a diff in pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "trushev (via GitHub)" <gi...@apache.org>.
trushev commented on code in PR #7761:
URL: https://github.com/apache/hudi/pull/7761#discussion_r1090236238


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/RecordIterators.java:
##########
@@ -50,8 +50,7 @@ public static ClosableIterator<RowData> getParquetRecordIterator(
       Path path,
       long splitStart,
       long splitLength) throws IOException {
-    InternalSchema fileSchema = internalSchemaManager.getFileSchema(path.getName());

Review Comment:
   LGTM, thank you for the improvement



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #7761:
URL: https://github.com/apache/hudi/pull/7761#discussion_r1090132321


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/InternalSchemaManager.java:
##########
@@ -93,60 +93,62 @@ public InternalSchema getQuerySchema() {
     return querySchema;
   }
 
-  InternalSchema getFileSchema(String fileName) {
+  InternalSchema getMergeSchema(String fileName) {
     if (querySchema.isEmptySchema()) {
       return querySchema;
     }
     long commitInstantTime = Long.parseLong(FSUtils.getCommitTime(fileName));
-    InternalSchema fileSchemaUnmerged = InternalSchemaCache.getInternalSchemaByVersionId(
+    InternalSchema fileSchema = InternalSchemaCache.getInternalSchemaByVersionId(
         commitInstantTime, tablePath, getHadoopConf(), validCommits);
-    if (querySchema.equals(fileSchemaUnmerged)) {
+    if (querySchema.equals(fileSchema)) {
       return InternalSchema.getEmptyInternalSchema();
     }
-    return new InternalSchemaMerger(fileSchemaUnmerged, querySchema, true, true).mergeSchema();
+    return new InternalSchemaMerger(fileSchema, querySchema, true, true).mergeSchema();
   }
 
   /**
-   * This method returns a mapping of columns that have type inconsistencies between the fileSchema and querySchema.
+   * This method returns a mapping of columns that have type inconsistencies between the mergeSchema and querySchema.
    * This is done by:
    * <li>1. Finding the columns with type changes</li>
    * <li>2. Get a map storing the index of these columns with type changes; Map of -> (colIdxInQueryFieldNames, colIdxInQuerySchema)</li>
    * <li>3. For each selectedField with type changes, build a castMap containing the cast/conversion details;
    * Map of -> (selectedPos, Cast([from] fileType, [to] queryType))</li>
    *
-   * @param fileSchema InternalSchema representation of the file's schema (acquired from commit/.schema metadata)
+   * @param mergeSchema InternalSchema representation of mergeSchema (prioritise use of fileSchemaType) that is used for reading base parquet files
    * @param queryFieldNames array containing the columns of a Hudi Flink table
    * @param queryFieldTypes array containing the field types of the columns of a Hudi Flink table
    * @param selectedFields array containing the index of the columns of interest required (indexes are based on queryFieldNames and queryFieldTypes)
    * @return a castMap containing the information of how to cast a selectedField from the fileType to queryType.
    *
    * @see CastMap
    */
-  CastMap getCastMap(InternalSchema fileSchema, String[] queryFieldNames, DataType[] queryFieldTypes, int[] selectedFields) {
+  CastMap getCastMap(InternalSchema mergeSchema, String[] queryFieldNames, DataType[] queryFieldTypes, int[] selectedFields) {
     Preconditions.checkArgument(!querySchema.isEmptySchema(), "querySchema cannot be empty");
-    Preconditions.checkArgument(!fileSchema.isEmptySchema(), "fileSchema cannot be empty");
+    Preconditions.checkArgument(!mergeSchema.isEmptySchema(), "mergeSchema cannot be empty");
 
     CastMap castMap = new CastMap();
     // map storing the indexes of columns with type changes Map of -> (colIdxInQueryFieldNames, colIdxInQuerySchema)
-    Map<Integer, Integer> posProxy = getPosProxy(fileSchema, queryFieldNames);
+    Map<Integer, Integer> posProxy = getPosProxy(mergeSchema, queryFieldNames);
     if (posProxy.isEmpty()) {
       // no type changes
       castMap.setFileFieldTypes(queryFieldTypes);
       return castMap;
     }
     List<Integer> selectedFieldList = IntStream.of(selectedFields).boxed().collect(Collectors.toList());
-    List<DataType> fileSchemaAsDataTypes = AvroSchemaConverter.convertToDataType(
-        AvroInternalSchemaConverter.convert(fileSchema, "tableName")).getChildren();
+    // mergeSchema is built with useColumnTypeFromFileSchema = true

Review Comment:
   Hmmm, I added this comment when tracing the code to remind myself WHY fetching `fileFieldTypes` from `mergeSchema` works.
   
   On top of that, from line 106, it is pretty clear that the code is building the `mergeSchema` with `useColumnTypeFromFileSchema = true`. Moving this comment from line 138 to line 106 would change the intention behind the comment. 
   
   If this line is at line 106, it would be telling the readers "WHAT" the code is doing.
   If it is at line 138, it would be telling the readers "WHY" fetching file types from `mergeSchema` works.
   
   So, i don't really think moving this comment is necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7761:
URL: https://github.com/apache/hudi/pull/7761#issuecomment-1406379303

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666",
       "triggerID" : "93e4876eb944c4b3976ac5422586ff35e8b7aef5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93e4876eb944c4b3976ac5422586ff35e8b7aef5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14666) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #7761: [MINOR] Standardise schema concepts on Flink Engine

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #7761:
URL: https://github.com/apache/hudi/pull/7761#discussion_r1090219771


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/RecordIterators.java:
##########
@@ -50,8 +50,7 @@ public static ClosableIterator<RowData> getParquetRecordIterator(
       Path path,
       long splitStart,
       long splitLength) throws IOException {
-    InternalSchema fileSchema = internalSchemaManager.getFileSchema(path.getName());

Review Comment:
   My bad, failed to account for this case. Let me revert this + add a few more comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org