You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "prashantwason (via GitHub)" <gi...@apache.org> on 2023/04/18 06:06:29 UTC

[GitHub] [hudi] prashantwason opened a new pull request, #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

prashantwason opened a new pull request, #8484:
URL: https://github.com/apache/hudi/pull/8484

   [HUDI-6092] Reuse schema objects while deserializing log blocks.
   
   ### Change Logs
   
   1. Added a ConcurrentHashMap in HoodieDataBlock to hold schema string to schema object mapping
   2. In HoodieHFileDataBlock and HoodieAvroDataBlock, use the above map to retrive the schema object rather than parsing the schema every time.
   
   Also introduced some try { } blocks in code to auto close resources which were being leaked.
   
   ### Impact
   
   When reading log files with a very large number of log blocks, there is reduced memory consumption.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8484:
URL: https://github.com/apache/hudi/pull/8484#discussion_r1169743276


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########
@@ -195,14 +193,16 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> keys,
     List<String> sortedKeys = new ArrayList<>(keys);
     Collections.sort(sortedKeys);
 
-    final HoodieAvroHFileReader reader =
-             new HoodieAvroHFileReader(inlineConf, inlinePath, new CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf));
+    try (HoodieAvroHFileReader reader =
+             new HoodieAvroHFileReader(inlineConf, inlinePath, new CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf),
+             Option.of(getSchemaFromHeader()))) {
 
-    // Get writer's schema from the header
-    final ClosableIterator<HoodieRecord<IndexedRecord>> recordIterator =
-        fullKey ? reader.getRecordsByKeysIterator(sortedKeys, readerSchema) : reader.getRecordsByKeyPrefixIterator(sortedKeys, readerSchema);
+      // Get writer's schema from the header
+      final ClosableIterator<HoodieRecord<IndexedRecord>> recordIterator =
+          fullKey ? reader.getRecordsByKeysIterator(sortedKeys, readerSchema) : reader.getRecordsByKeyPrefixIterator(sortedKeys, readerSchema);
 
-    return new CloseableMappingIterator<>(recordIterator, data -> (HoodieRecord<T>) data);
+      return new CloseableMappingIterator<>(recordIterator, data -> (HoodieRecord<T>) data);
+    }

Review Comment:
   Is the iterator still valid when the outer reader has been closed ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1512538577

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 952ed0d4ba59dec11f517d3a55d73b4a8d0cff34 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1515822162

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484",
       "triggerID" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 952ed0d4ba59dec11f517d3a55d73b4a8d0cff34 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415) 
   * 04ccb55ceaf0d577a99bf05de9f681944c877d6e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 merged pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 merged PR #8484:
URL: https://github.com/apache/hudi/pull/8484


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1516527940

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1512549778

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 952ed0d4ba59dec11f517d3a55d73b4a8d0cff34 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1516222385

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484",
       "triggerID" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 04ccb55ceaf0d577a99bf05de9f681944c877d6e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1516573982

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484",
       "triggerID" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16505",
       "triggerID" : "1516527940",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 04ccb55ceaf0d577a99bf05de9f681944c877d6e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16505) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1518503019

   The test is flaky: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=16505&view=logs&j=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de&t=30b5aae4-0ea0-5566-42d0-febf71a7061a&l=717331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1515812571

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 952ed0d4ba59dec11f517d3a55d73b4a8d0cff34 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415) 
   * 04ccb55ceaf0d577a99bf05de9f681944c877d6e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1512771603

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 952ed0d4ba59dec11f517d3a55d73b4a8d0cff34 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on a diff in pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on code in PR #8484:
URL: https://github.com/apache/hudi/pull/8484#discussion_r1172149905


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########
@@ -195,14 +193,16 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> keys,
     List<String> sortedKeys = new ArrayList<>(keys);
     Collections.sort(sortedKeys);
 
-    final HoodieAvroHFileReader reader =
-             new HoodieAvroHFileReader(inlineConf, inlinePath, new CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf));
+    try (HoodieAvroHFileReader reader =
+             new HoodieAvroHFileReader(inlineConf, inlinePath, new CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf),
+             Option.of(getSchemaFromHeader()))) {
 
-    // Get writer's schema from the header
-    final ClosableIterator<HoodieRecord<IndexedRecord>> recordIterator =
-        fullKey ? reader.getRecordsByKeysIterator(sortedKeys, readerSchema) : reader.getRecordsByKeyPrefixIterator(sortedKeys, readerSchema);
+      // Get writer's schema from the header
+      final ClosableIterator<HoodieRecord<IndexedRecord>> recordIterator =
+          fullKey ? reader.getRecordsByKeysIterator(sortedKeys, readerSchema) : reader.getRecordsByKeyPrefixIterator(sortedKeys, readerSchema);
 
-    return new CloseableMappingIterator<>(recordIterator, data -> (HoodieRecord<T>) data);
+      return new CloseableMappingIterator<>(recordIterator, data -> (HoodieRecord<T>) data);
+    }

Review Comment:
   Probably not. I will backup these changes which introduce try block.
   
   I think the intention is to return an iterator which will close the reader after iteration (hence the name ClosableIterator).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8484: [HUDI-6092] Reuse schema objects while deserializing log blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8484:
URL: https://github.com/apache/hudi/pull/8484#issuecomment-1517222042

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16415",
       "triggerID" : "952ed0d4ba59dec11f517d3a55d73b4a8d0cff34",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484",
       "triggerID" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04ccb55ceaf0d577a99bf05de9f681944c877d6e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16505",
       "triggerID" : "1516527940",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 04ccb55ceaf0d577a99bf05de9f681944c877d6e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16484) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16505) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org