You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/16 02:33:28 UTC

[GitHub] [hudi] vburenin commented on a change in pull request #2440: [HUDI-1532] Fixed suboptimal implementation of a magic sequence search

vburenin commented on a change in pull request #2440:
URL: https://github.com/apache/hudi/pull/2440#discussion_r558774265



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
##########
@@ -274,19 +275,27 @@ private boolean isBlockCorrupt(int blocksize) throws IOException {
   }
 
   private long scanForNextAvailableBlockOffset() throws IOException {
+    // Make buffer large enough to scan through the file as quick as possible especially if it is on S3/GCS.
+    // Using lower buffer is incurring a lot of API calls thus drastically increasing the cost of the storage
+    // and also may take days to complete scanning trough the large files.
+    byte[] dataBuf = new byte[1024 * 1024];

Review comment:
       Buffered reader needs to check a few things to copy the right data, readFully logic itself is not trivial, there is also position modification each time it reads 6 bytes, etc so even without profiling it I bet the overhead is significant.
   1MB seems like a good number to me, not too much, not too little. From my past experience dealing with FS IO going with blocks larger than 1MB was giving a diminished return. However, the best number would be the one that matches underlying block read size, but that depends on the reader which can be any.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org