You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/16 02:26:38 UTC

[GitHub] [hudi] n3nash commented on a change in pull request #2440: [HUDI-1532] Fixed suboptimal implementation of a magic sequence search

n3nash commented on a change in pull request #2440:
URL: https://github.com/apache/hudi/pull/2440#discussion_r558771039



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
##########
@@ -274,19 +275,27 @@ private boolean isBlockCorrupt(int blocksize) throws IOException {
   }
 
   private long scanForNextAvailableBlockOffset() throws IOException {
+    // Make buffer large enough to scan through the file as quick as possible especially if it is on S3/GCS.
+    // Using lower buffer is incurring a lot of API calls thus drastically increasing the cost of the storage
+    // and also may take days to complete scanning trough the large files.
+    byte[] dataBuf = new byte[1024 * 1024];

Review comment:
       @vburenin Just to confirm, by "lot less additional overhead" you refer to the in-memory to in-memory bytes copy operation that needs to be done for every 6 bytes vs 1MB in a single go in this implementation ? (Since the comparison of byte arrays is the same). Can we quantify this overhead ? 
   Additionally, what is the reasoning behind keeping it at 1MB ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org