You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "prashantwason (via GitHub)" <gi...@apache.org> on 2023/04/20 23:06:00 UTC

[GitHub] [hudi] prashantwason opened a new pull request, #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

prashantwason opened a new pull request, #8526:
URL: https://github.com/apache/hudi/pull/8526

   [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.
   
   ### Change Logs
   
   1. Removed the eager check for isBlockCorrupted after reading block size in HoodieLogFileReader
   2. Added validation checks after reading each item (version, size, blockType, content, etc) from the log block
   3. Added a unit test which generated various corruption scenarios and validates that the corrupted blocks are found
   
   ### Impact
   
   Improved performance of reading a log file when there is high latency or a large number of log blocks exist.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1577401953

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1610752719

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061",
       "triggerID" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18145",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "24df2fc7cb9b3d7e35543a4d2a72f59f61f441da",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "24df2fc7cb9b3d7e35543a4d2a72f59f61f441da",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18145) 
   * 24df2fc7cb9b3d7e35543a4d2a72f59f61f441da UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1203417127


##########
hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java:
##########
@@ -2571,6 +2579,174 @@ public void testDataBlockFormatAppendAndReadWithProjectedSchema(
     }
   }
 
+  @Test
+  public void testCorruptBlocks() throws IOException, URISyntaxException, InterruptedException {
+    List<byte[]> corruptBlocks = createCorruptDataBlocksForTest();
+    int maxTotalBlocks = 20;
+    int runs = 10;
+    Writer writer =
+        HoodieLogFormat.newWriterBuilder().onParentPath(partitionPath).withFileExtension(HoodieLogFile.DELTA_EXTENSION)
+            .withFileId("test-fileid1").overBaseCommit("100").withFs(fs).build();
+    HoodieLogFile logFile = writer.getLogFile();
+    Path logPath = logFile.getPath();
+    fs = FSUtils.getFs(fs.getUri().toString(), fs.getConf());
+
+    for (; runs >= 0; runs--) {
+      int totalBlocksForThisRun = rand.nextInt(maxTotalBlocks) + 1;
+      fs.create(logPath).close();
+      boolean[] verify = new boolean[totalBlocksForThisRun];
+      List<List<IndexedRecord>> goodBlocks = new ArrayList<>();
+      for (int i = 0; i < totalBlocksForThisRun; i++) {
+        boolean pickCorruptBlock = rand.nextBoolean();
+        if (pickCorruptBlock) {
+          byte[] corruptBlock = corruptBlocks.get(rand.nextInt(corruptBlocks.size()));
+          FSDataOutputStream outputStream = fs.append(logPath);
+          outputStream.write(corruptBlock);
+          outputStream.close();
+        } else {
+          addGoodBlockToLogFile(goodBlocks);
+        }
+        verify[i] = pickCorruptBlock;
+      }
+
+      Reader reader = HoodieLogFormat.newReader(fs, logFile, SchemaTestUtil.getSimpleSchema());
+      int curr = 0;
+      int dataBlocksGenerated = 0;
+      int dataBlocksSeen = 0;
+      ByteArrayOutputStream bOStream = new ByteArrayOutputStream();
+      IOUtils.copyBytes(fs.open(logPath), bOStream, 4096);
+      try {
+        while (reader.hasNext()) {
+          HoodieLogBlockType generatedType = verify[curr] ? HoodieLogBlockType.CORRUPT_BLOCK : HoodieLogBlockType.AVRO_DATA_BLOCK;
+          HoodieLogBlock block = reader.next();
+
+          if (block.getBlockType().equals(HoodieLogBlockType.AVRO_DATA_BLOCK)) {
+            HoodieAvroDataBlock blk = (HoodieAvroDataBlock) block;
+            assertEquals(getRecords(blk), goodBlocks.get(dataBlocksGenerated));
+          }
+
+          dataBlocksGenerated += (generatedType.equals(HoodieLogBlockType.AVRO_DATA_BLOCK)) ? 1 : 0;
+          dataBlocksSeen += (block.getBlockType().equals(HoodieLogBlockType.AVRO_DATA_BLOCK)) ? 1 : 0;
+
+          assertEquals(generatedType, block.getBlockType(), Arrays.toString(bOStream.toByteArray()));
+
+          curr++;
+        }
+        reader.close();
+      } catch (Exception e) {
+        System.out.println(Arrays.toString(bOStream.toByteArray()));

Review Comment:
   removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1211803695


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;

Review Comment:
   hey Danny: for your suggestion on moving whole try catch into separate method, then the readBlock() itself may not have anything much left.so, probably we can leave it as is



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;
 
-    // 3. Read the block type for a log block
-    HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+    try {
+      blockStartPos = inputStream.getPos();
 
-    // 4. Read the header for a log block, if present
+      // 1 Read the total size of the block
+      blockSize = inputStream.readLong();
+
+      // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the
+      // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we
+      // find the corrupted block eventually.
+
+      // 2. Read the version for this log format
+      HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+
+      // 3. Read the block type for a log block
+      HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+
+      // 4. Read the header for a log block, if present
+      Map<HeaderMetadataType, String> header =
+          nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 5. Read the content length for the content
+      // Fallback to full-block size if no content-length
+      // TODO replace w/ hasContentLength
+      long contentLength =
+          nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength);
+
+      // 6. Read the content or skip content based on IO vs Memory trade-off by client
+      long contentPosition = inputStream.getPos();
+      boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
+      Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+
+      // 7. Read footer if any
+      Map<HeaderMetadataType, String> footer =
+          nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
+      // log file in reverse
+      if (nextBlockVersion.hasLogBlockLength()) {
+        long currentPos = inputStream.getPos();
+        long logBlockLength = inputStream.readLong();
+        if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) {
+          return createCorruptBlock(blockStartPos);
+        }
+      }
 
-    Map<HeaderMetadataType, String> header =
-        nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+      // 9. Read the log block end position in the log file
+      long blockEndPos = inputStream.getPos();
 
-    // 5. Read the content length for the content
-    // Fallback to full-block size if no content-length
-    // TODO replace w/ hasContentLength
-    int contentLength =
-        nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
+          new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
 
-    // 6. Read the content or skip content based on IO vs Memory trade-off by client
-    long contentPosition = inputStream.getPos();
-    boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
-    Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+      switch (Objects.requireNonNull(blockType)) {
+        case AVRO_DATA_BLOCK:
+          if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
+            return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
+          } else {
+            return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+                getTargetReaderSchemaForBlock(), header, footer, keyField);
+          }
 
-    // 7. Read footer if any
-    Map<HeaderMetadataType, String> footer =
-        nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+        case HFILE_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
-    // log file in reverse
-    if (nextBlockVersion.hasLogBlockLength()) {
-      inputStream.readLong();
-    }
+          return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+              Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-    // 9. Read the log block end position in the log file
-    long blockEndPos = inputStream.getPos();
+        case PARQUET_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
-        new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
-
-    switch (Objects.requireNonNull(blockType)) {
-      case AVRO_DATA_BLOCK:
-        if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
-          return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
-        } else {
-          return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+          return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
               getTargetReaderSchemaForBlock(), header, footer, keyField);
-        }
-
-      case HFILE_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
-
-        return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-      case PARQUET_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
+        case DELETE_BLOCK:
+          return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-        return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            getTargetReaderSchemaForBlock(), header, footer, keyField);
+        case COMMAND_BLOCK:
+          return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-      case DELETE_BLOCK:
-        return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
+        case CDC_DATA_BLOCK:
+          return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
 
-      case COMMAND_BLOCK:
-        return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
-
-      case CDC_DATA_BLOCK:
-        return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
-
-      default:
-        throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+        default:
+          throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+      }
+    } catch (IOException | CorruptedLogFileException | IllegalArgumentException e) {
+      // check for corrupt block
+      inputStream.seek(blockStartPos);
+      if (isBlockCorrupted(blockSize)) {

Review Comment:
   sg



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1213452295


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;
 
-    // 3. Read the block type for a log block
-    HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+    try {
+      blockStartPos = inputStream.getPos();
 
-    // 4. Read the header for a log block, if present
+      // 1 Read the total size of the block
+      blockSize = inputStream.readLong();
+
+      // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the
+      // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we
+      // find the corrupted block eventually.
+
+      // 2. Read the version for this log format
+      HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+
+      // 3. Read the block type for a log block
+      HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+
+      // 4. Read the header for a log block, if present
+      Map<HeaderMetadataType, String> header =
+          nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 5. Read the content length for the content
+      // Fallback to full-block size if no content-length
+      // TODO replace w/ hasContentLength
+      long contentLength =
+          nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength);
+
+      // 6. Read the content or skip content based on IO vs Memory trade-off by client
+      long contentPosition = inputStream.getPos();
+      boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
+      Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+
+      // 7. Read footer if any
+      Map<HeaderMetadataType, String> footer =
+          nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
+      // log file in reverse
+      if (nextBlockVersion.hasLogBlockLength()) {
+        long currentPos = inputStream.getPos();
+        long logBlockLength = inputStream.readLong();
+        if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) {
+          return createCorruptBlock(blockStartPos);
+        }
+      }
 
-    Map<HeaderMetadataType, String> header =
-        nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+      // 9. Read the log block end position in the log file
+      long blockEndPos = inputStream.getPos();
 
-    // 5. Read the content length for the content
-    // Fallback to full-block size if no content-length
-    // TODO replace w/ hasContentLength
-    int contentLength =
-        nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
+          new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
 
-    // 6. Read the content or skip content based on IO vs Memory trade-off by client
-    long contentPosition = inputStream.getPos();
-    boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
-    Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+      switch (Objects.requireNonNull(blockType)) {
+        case AVRO_DATA_BLOCK:
+          if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
+            return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
+          } else {
+            return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+                getTargetReaderSchemaForBlock(), header, footer, keyField);
+          }
 
-    // 7. Read footer if any
-    Map<HeaderMetadataType, String> footer =
-        nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+        case HFILE_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
-    // log file in reverse
-    if (nextBlockVersion.hasLogBlockLength()) {
-      inputStream.readLong();
-    }
+          return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+              Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-    // 9. Read the log block end position in the log file
-    long blockEndPos = inputStream.getPos();
+        case PARQUET_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
-        new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
-
-    switch (Objects.requireNonNull(blockType)) {
-      case AVRO_DATA_BLOCK:
-        if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
-          return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
-        } else {
-          return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+          return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
               getTargetReaderSchemaForBlock(), header, footer, keyField);
-        }
-
-      case HFILE_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
-
-        return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-      case PARQUET_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
+        case DELETE_BLOCK:
+          return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-        return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            getTargetReaderSchemaForBlock(), header, footer, keyField);
+        case COMMAND_BLOCK:
+          return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-      case DELETE_BLOCK:
-        return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
+        case CDC_DATA_BLOCK:
+          return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
 
-      case COMMAND_BLOCK:
-        return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
-
-      case CDC_DATA_BLOCK:
-        return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
-
-      default:
-        throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+        default:
+          throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+      }
+    } catch (IOException | CorruptedLogFileException | IllegalArgumentException e) {
+      // check for corrupt block
+      inputStream.seek(blockStartPos);
+      if (isBlockCorrupted(blockSize)) {

Review Comment:
   Moved the various exceptions to their own catch blocks to make the handling clear.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1603642074

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061",
       "triggerID" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1603491241

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061",
       "triggerID" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * b14848a571950bddf85114655ccbf12f56d6d4da Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713) 
   * b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1246115508


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java:
##########
@@ -264,8 +267,9 @@ public static Option<byte[]> tryReadContent(FSDataInputStream inputStream, Integ
 
     // TODO re-use buffer if stream is backed by buffer
     // Read the contents in memory
-    byte[] content = new byte[contentLength];
-    inputStream.readFully(content, 0, contentLength);
+    ValidationUtils.checkArgument(contentLength <= Integer.MAX_VALUE, String.format("Content length %d exceeds maximum value of %d", contentLength, Integer.MAX_VALUE));

Review Comment:
   What's the point of changing the `contentLength` from int to long type and then validating that it's less than Integer.MAX_VALUE?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1517047922

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1203418156


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;
 
-    // 3. Read the block type for a log block
-    HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+    try {
+      blockStartPos = inputStream.getPos();
 
-    // 4. Read the header for a log block, if present
+      // 1 Read the total size of the block
+      blockSize = inputStream.readLong();
+
+      // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the
+      // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we
+      // find the corrupted block eventually.
+
+      // 2. Read the version for this log format
+      HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+
+      // 3. Read the block type for a log block
+      HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+
+      // 4. Read the header for a log block, if present
+      Map<HeaderMetadataType, String> header =
+          nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 5. Read the content length for the content
+      // Fallback to full-block size if no content-length
+      // TODO replace w/ hasContentLength
+      long contentLength =
+          nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength);
+
+      // 6. Read the content or skip content based on IO vs Memory trade-off by client
+      long contentPosition = inputStream.getPos();
+      boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
+      Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+
+      // 7. Read footer if any
+      Map<HeaderMetadataType, String> footer =
+          nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
+      // log file in reverse
+      if (nextBlockVersion.hasLogBlockLength()) {
+        long currentPos = inputStream.getPos();
+        long logBlockLength = inputStream.readLong();
+        if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) {
+          return createCorruptBlock(blockStartPos);
+        }
+      }
 
-    Map<HeaderMetadataType, String> header =
-        nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+      // 9. Read the log block end position in the log file
+      long blockEndPos = inputStream.getPos();
 
-    // 5. Read the content length for the content
-    // Fallback to full-block size if no content-length
-    // TODO replace w/ hasContentLength
-    int contentLength =
-        nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
+          new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
 
-    // 6. Read the content or skip content based on IO vs Memory trade-off by client
-    long contentPosition = inputStream.getPos();
-    boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
-    Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+      switch (Objects.requireNonNull(blockType)) {
+        case AVRO_DATA_BLOCK:
+          if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
+            return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
+          } else {
+            return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+                getTargetReaderSchemaForBlock(), header, footer, keyField);
+          }
 
-    // 7. Read footer if any
-    Map<HeaderMetadataType, String> footer =
-        nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+        case HFILE_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
-    // log file in reverse
-    if (nextBlockVersion.hasLogBlockLength()) {
-      inputStream.readLong();
-    }
+          return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+              Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-    // 9. Read the log block end position in the log file
-    long blockEndPos = inputStream.getPos();
+        case PARQUET_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
-        new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
-
-    switch (Objects.requireNonNull(blockType)) {
-      case AVRO_DATA_BLOCK:
-        if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
-          return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
-        } else {
-          return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+          return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
               getTargetReaderSchemaForBlock(), header, footer, keyField);
-        }
-
-      case HFILE_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
-
-        return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-      case PARQUET_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
+        case DELETE_BLOCK:
+          return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-        return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            getTargetReaderSchemaForBlock(), header, footer, keyField);
+        case COMMAND_BLOCK:
+          return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-      case DELETE_BLOCK:
-        return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
+        case CDC_DATA_BLOCK:
+          return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
 
-      case COMMAND_BLOCK:
-        return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
-
-      case CDC_DATA_BLOCK:
-        return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
-
-      default:
-        throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+        default:
+          throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+      }
+    } catch (IOException | CorruptedLogFileException | IllegalArgumentException e) {
+      // check for corrupt block
+      inputStream.seek(blockStartPos);
+      if (isBlockCorrupted(blockSize)) {

Review Comment:
   If we are here due to IOException then it may or may not be a corrupt block. IOException can occur for various reasons.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1578139637

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * bd883a7ac2f3ede102b44e8696fa638266e1c3bc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1560468070

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1610733461

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1578054233

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * bd883a7ac2f3ede102b44e8696fa638266e1c3bc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1577862258

   @prashantwason You may need to rebase with the latest master and force-push the branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1173284388


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;

Review Comment:
   Guess line 169 ~ 172 is the line we wanna fix, can we just enclose this code snippet into the `try catch` block?
   It does not seem right we also include the data block construction from line 212 ~ 246 into the `try catch`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1585045538

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * bd883a7ac2f3ede102b44e8696fa638266e1c3bc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629) 
   * b14848a571950bddf85114655ccbf12f56d6d4da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1572527934

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1610733530

   Rebased on latest master. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1203416332


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;

Review Comment:
   The reason the lines 212 to 246 are within the trey-catch is due to scope of variables. If I move those lines out of try-catch, I would have to define all the variables (blockType, header, footer, content) before the try so that they will be in scope after the try-catch block ends. 
   
   This seemed an ugly code to me.
   
   private HoodieLogBlock readBlock() throws IOException {
       long blockStartPos = 0;
       long blockSize = 0;
       HoodieLogBlockType blockType;
       HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc;
       Map<HeaderMetadataType, String> header;
   
       try {
            .....
            .....
        } catch()
             ...
        }
     
        switch(...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1560473142

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1213451923


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;
 
-    // 3. Read the block type for a log block
-    HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+    try {
+      blockStartPos = inputStream.getPos();
 
-    // 4. Read the header for a log block, if present
+      // 1 Read the total size of the block
+      blockSize = inputStream.readLong();
+
+      // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the
+      // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we
+      // find the corrupted block eventually.
+
+      // 2. Read the version for this log format
+      HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+
+      // 3. Read the block type for a log block
+      HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+
+      // 4. Read the header for a log block, if present
+      Map<HeaderMetadataType, String> header =
+          nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 5. Read the content length for the content
+      // Fallback to full-block size if no content-length
+      // TODO replace w/ hasContentLength
+      long contentLength =
+          nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength);
+
+      // 6. Read the content or skip content based on IO vs Memory trade-off by client
+      long contentPosition = inputStream.getPos();
+      boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
+      Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+
+      // 7. Read footer if any
+      Map<HeaderMetadataType, String> footer =
+          nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
+      // log file in reverse
+      if (nextBlockVersion.hasLogBlockLength()) {
+        long currentPos = inputStream.getPos();
+        long logBlockLength = inputStream.readLong();
+        if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) {
+          return createCorruptBlock(blockStartPos);
+        }
+      }
 
-    Map<HeaderMetadataType, String> header =
-        nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+      // 9. Read the log block end position in the log file
+      long blockEndPos = inputStream.getPos();
 
-    // 5. Read the content length for the content
-    // Fallback to full-block size if no content-length
-    // TODO replace w/ hasContentLength
-    int contentLength =
-        nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
+          new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
 
-    // 6. Read the content or skip content based on IO vs Memory trade-off by client
-    long contentPosition = inputStream.getPos();
-    boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
-    Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+      switch (Objects.requireNonNull(blockType)) {
+        case AVRO_DATA_BLOCK:
+          if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
+            return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
+          } else {
+            return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+                getTargetReaderSchemaForBlock(), header, footer, keyField);
+          }
 
-    // 7. Read footer if any
-    Map<HeaderMetadataType, String> footer =
-        nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+        case HFILE_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
-    // log file in reverse
-    if (nextBlockVersion.hasLogBlockLength()) {
-      inputStream.readLong();
-    }
+          return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+              Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-    // 9. Read the log block end position in the log file
-    long blockEndPos = inputStream.getPos();
+        case PARQUET_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
-        new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
-
-    switch (Objects.requireNonNull(blockType)) {
-      case AVRO_DATA_BLOCK:
-        if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
-          return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
-        } else {
-          return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+          return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
               getTargetReaderSchemaForBlock(), header, footer, keyField);
-        }
-
-      case HFILE_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
-
-        return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-      case PARQUET_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
+        case DELETE_BLOCK:
+          return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-        return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            getTargetReaderSchemaForBlock(), header, footer, keyField);
+        case COMMAND_BLOCK:
+          return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-      case DELETE_BLOCK:
-        return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
+        case CDC_DATA_BLOCK:
+          return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
 
-      case COMMAND_BLOCK:
-        return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
-
-      case CDC_DATA_BLOCK:
-        return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
-
-      default:
-        throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+        default:
+          throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+      }
+    } catch (IOException | CorruptedLogFileException | IllegalArgumentException e) {
+      // check for corrupt block
+      inputStream.seek(blockStartPos);
+      if (isBlockCorrupted(blockSize)) {

Review Comment:
   Moved the various exceptions to their own catch blocks to make the handling clear.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 closed pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 closed pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.
URL: https://github.com/apache/hudi/pull/8526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1186576344


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;
 
-    // 3. Read the block type for a log block
-    HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+    try {
+      blockStartPos = inputStream.getPos();
 
-    // 4. Read the header for a log block, if present
+      // 1 Read the total size of the block
+      blockSize = inputStream.readLong();
+
+      // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the
+      // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we
+      // find the corrupted block eventually.
+
+      // 2. Read the version for this log format
+      HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+
+      // 3. Read the block type for a log block
+      HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+
+      // 4. Read the header for a log block, if present
+      Map<HeaderMetadataType, String> header =
+          nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 5. Read the content length for the content
+      // Fallback to full-block size if no content-length
+      // TODO replace w/ hasContentLength
+      long contentLength =
+          nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength);
+
+      // 6. Read the content or skip content based on IO vs Memory trade-off by client
+      long contentPosition = inputStream.getPos();
+      boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
+      Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+
+      // 7. Read footer if any
+      Map<HeaderMetadataType, String> footer =
+          nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
+      // log file in reverse
+      if (nextBlockVersion.hasLogBlockLength()) {
+        long currentPos = inputStream.getPos();
+        long logBlockLength = inputStream.readLong();
+        if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) {
+          return createCorruptBlock(blockStartPos);
+        }
+      }
 
-    Map<HeaderMetadataType, String> header =
-        nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+      // 9. Read the log block end position in the log file
+      long blockEndPos = inputStream.getPos();
 
-    // 5. Read the content length for the content
-    // Fallback to full-block size if no content-length
-    // TODO replace w/ hasContentLength
-    int contentLength =
-        nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
+          new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
 
-    // 6. Read the content or skip content based on IO vs Memory trade-off by client
-    long contentPosition = inputStream.getPos();
-    boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
-    Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+      switch (Objects.requireNonNull(blockType)) {
+        case AVRO_DATA_BLOCK:
+          if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
+            return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
+          } else {
+            return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+                getTargetReaderSchemaForBlock(), header, footer, keyField);
+          }
 
-    // 7. Read footer if any
-    Map<HeaderMetadataType, String> footer =
-        nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+        case HFILE_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
-    // log file in reverse
-    if (nextBlockVersion.hasLogBlockLength()) {
-      inputStream.readLong();
-    }
+          return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+              Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-    // 9. Read the log block end position in the log file
-    long blockEndPos = inputStream.getPos();
+        case PARQUET_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
-        new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
-
-    switch (Objects.requireNonNull(blockType)) {
-      case AVRO_DATA_BLOCK:
-        if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
-          return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
-        } else {
-          return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+          return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
               getTargetReaderSchemaForBlock(), header, footer, keyField);
-        }
-
-      case HFILE_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
-
-        return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-      case PARQUET_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
+        case DELETE_BLOCK:
+          return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-        return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            getTargetReaderSchemaForBlock(), header, footer, keyField);
+        case COMMAND_BLOCK:
+          return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-      case DELETE_BLOCK:
-        return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
+        case CDC_DATA_BLOCK:
+          return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
 
-      case COMMAND_BLOCK:
-        return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
-
-      case CDC_DATA_BLOCK:
-        return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
-
-      default:
-        throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+        default:
+          throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+      }
+    } catch (IOException | CorruptedLogFileException | IllegalArgumentException e) {
+      // check for corrupt block
+      inputStream.seek(blockStartPos);
+      if (isBlockCorrupted(blockSize)) {

Review Comment:
   may I know why do we check for block corrupt here again? can't we assume if something is thrown, its is indeed corrupt? 



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;

Review Comment:
   yeah, I also feel we can keep line 212 to 246 outside of try catch where we deduce corruptness



##########
hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java:
##########
@@ -2571,6 +2579,174 @@ public void testDataBlockFormatAppendAndReadWithProjectedSchema(
     }
   }
 
+  @Test
+  public void testCorruptBlocks() throws IOException, URISyntaxException, InterruptedException {
+    List<byte[]> corruptBlocks = createCorruptDataBlocksForTest();
+    int maxTotalBlocks = 20;
+    int runs = 10;
+    Writer writer =
+        HoodieLogFormat.newWriterBuilder().onParentPath(partitionPath).withFileExtension(HoodieLogFile.DELTA_EXTENSION)
+            .withFileId("test-fileid1").overBaseCommit("100").withFs(fs).build();
+    HoodieLogFile logFile = writer.getLogFile();
+    Path logPath = logFile.getPath();
+    fs = FSUtils.getFs(fs.getUri().toString(), fs.getConf());
+
+    for (; runs >= 0; runs--) {
+      int totalBlocksForThisRun = rand.nextInt(maxTotalBlocks) + 1;
+      fs.create(logPath).close();
+      boolean[] verify = new boolean[totalBlocksForThisRun];
+      List<List<IndexedRecord>> goodBlocks = new ArrayList<>();
+      for (int i = 0; i < totalBlocksForThisRun; i++) {
+        boolean pickCorruptBlock = rand.nextBoolean();
+        if (pickCorruptBlock) {
+          byte[] corruptBlock = corruptBlocks.get(rand.nextInt(corruptBlocks.size()));
+          FSDataOutputStream outputStream = fs.append(logPath);
+          outputStream.write(corruptBlock);
+          outputStream.close();
+        } else {
+          addGoodBlockToLogFile(goodBlocks);
+        }
+        verify[i] = pickCorruptBlock;
+      }
+
+      Reader reader = HoodieLogFormat.newReader(fs, logFile, SchemaTestUtil.getSimpleSchema());
+      int curr = 0;
+      int dataBlocksGenerated = 0;
+      int dataBlocksSeen = 0;
+      ByteArrayOutputStream bOStream = new ByteArrayOutputStream();
+      IOUtils.copyBytes(fs.open(logPath), bOStream, 4096);
+      try {
+        while (reader.hasNext()) {
+          HoodieLogBlockType generatedType = verify[curr] ? HoodieLogBlockType.CORRUPT_BLOCK : HoodieLogBlockType.AVRO_DATA_BLOCK;
+          HoodieLogBlock block = reader.next();
+
+          if (block.getBlockType().equals(HoodieLogBlockType.AVRO_DATA_BLOCK)) {
+            HoodieAvroDataBlock blk = (HoodieAvroDataBlock) block;
+            assertEquals(getRecords(blk), goodBlocks.get(dataBlocksGenerated));
+          }
+
+          dataBlocksGenerated += (generatedType.equals(HoodieLogBlockType.AVRO_DATA_BLOCK)) ? 1 : 0;
+          dataBlocksSeen += (block.getBlockType().equals(HoodieLogBlockType.AVRO_DATA_BLOCK)) ? 1 : 0;
+
+          assertEquals(generatedType, block.getBlockType(), Arrays.toString(bOStream.toByteArray()));
+
+          curr++;
+        }
+        reader.close();
+      } catch (Exception e) {
+        System.out.println(Arrays.toString(bOStream.toByteArray()));

Review Comment:
   lets remove SOPs. 



##########
hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java:
##########
@@ -2571,6 +2579,174 @@ public void testDataBlockFormatAppendAndReadWithProjectedSchema(
     }
   }
 
+  @Test
+  public void testCorruptBlocks() throws IOException, URISyntaxException, InterruptedException {

Review Comment:
   good one. we should also add a test for w/ lot of rollback blocks. not required in this patch. but we should try and add something in a follow up when time permits 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1603487041

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * b14848a571950bddf85114655ccbf12f56d6d4da Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713) 
   * b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1212724688


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;

Review Comment:
   Yeah, I agree, I'm okay with it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1212732181


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;
 
-    // 3. Read the block type for a log block
-    HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+    try {
+      blockStartPos = inputStream.getPos();
 
-    // 4. Read the header for a log block, if present
+      // 1 Read the total size of the block
+      blockSize = inputStream.readLong();
+
+      // We may have had a crash which could have written this block partially. We are deferring the check for corrupted block so as not to pay the
+      // penalty of doing seeks + read and then re-seeks. More aggressive checks after reading each item as well as a final corrupted check should ensure we
+      // find the corrupted block eventually.
+
+      // 2. Read the version for this log format
+      HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+
+      // 3. Read the block type for a log block
+      HoodieLogBlockType blockType = tryReadBlockType(nextBlockVersion);
+
+      // 4. Read the header for a log block, if present
+      Map<HeaderMetadataType, String> header =
+          nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 5. Read the content length for the content
+      // Fallback to full-block size if no content-length
+      // TODO replace w/ hasContentLength
+      long contentLength =
+          nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      checkArgument(contentLength >= 0, "Content Length should be greater than or equal to 0 " + contentLength);
+
+      // 6. Read the content or skip content based on IO vs Memory trade-off by client
+      long contentPosition = inputStream.getPos();
+      boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
+      Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+
+      // 7. Read footer if any
+      Map<HeaderMetadataType, String> footer =
+          nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+
+      // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
+      // log file in reverse
+      if (nextBlockVersion.hasLogBlockLength()) {
+        long currentPos = inputStream.getPos();
+        long logBlockLength = inputStream.readLong();
+        if (blockSize != (logBlockLength - magicBuffer.length) || currentPos != (blockStartPos + blockSize)) {
+          return createCorruptBlock(blockStartPos);
+        }
+      }
 
-    Map<HeaderMetadataType, String> header =
-        nextBlockVersion.hasHeader() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+      // 9. Read the log block end position in the log file
+      long blockEndPos = inputStream.getPos();
 
-    // 5. Read the content length for the content
-    // Fallback to full-block size if no content-length
-    // TODO replace w/ hasContentLength
-    int contentLength =
-        nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION ? (int) inputStream.readLong() : blockSize;
+      HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
+          new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
 
-    // 6. Read the content or skip content based on IO vs Memory trade-off by client
-    long contentPosition = inputStream.getPos();
-    boolean shouldReadLazily = readBlockLazily && nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION;
-    Option<byte[]> content = HoodieLogBlock.tryReadContent(inputStream, contentLength, shouldReadLazily);
+      switch (Objects.requireNonNull(blockType)) {
+        case AVRO_DATA_BLOCK:
+          if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
+            return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
+          } else {
+            return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+                getTargetReaderSchemaForBlock(), header, footer, keyField);
+          }
 
-    // 7. Read footer if any
-    Map<HeaderMetadataType, String> footer =
-        nextBlockVersion.hasFooter() ? HoodieLogBlock.getLogMetadata(inputStream) : null;
+        case HFILE_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    // 8. Read log block length, if present. This acts as a reverse pointer when traversing a
-    // log file in reverse
-    if (nextBlockVersion.hasLogBlockLength()) {
-      inputStream.readLong();
-    }
+          return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+              Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-    // 9. Read the log block end position in the log file
-    long blockEndPos = inputStream.getPos();
+        case PARQUET_DATA_BLOCK:
+          checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
+              String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
 
-    HoodieLogBlock.HoodieLogBlockContentLocation logBlockContentLoc =
-        new HoodieLogBlock.HoodieLogBlockContentLocation(hadoopConf, logFile, contentPosition, contentLength, blockEndPos);
-
-    switch (Objects.requireNonNull(blockType)) {
-      case AVRO_DATA_BLOCK:
-        if (nextBlockVersion.getVersion() == HoodieLogFormatVersion.DEFAULT_VERSION) {
-          return HoodieAvroDataBlock.getBlock(content.get(), readerSchema, internalSchema);
-        } else {
-          return new HoodieAvroDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
+          return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
               getTargetReaderSchemaForBlock(), header, footer, keyField);
-        }
-
-      case HFILE_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("HFile block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
-
-        return new HoodieHFileDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            Option.ofNullable(readerSchema), header, footer, enableRecordLookups, logFile.getPath());
 
-      case PARQUET_DATA_BLOCK:
-        checkState(nextBlockVersion.getVersion() != HoodieLogFormatVersion.DEFAULT_VERSION,
-            String.format("Parquet block could not be of version (%d)", HoodieLogFormatVersion.DEFAULT_VERSION));
+        case DELETE_BLOCK:
+          return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-        return new HoodieParquetDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc,
-            getTargetReaderSchemaForBlock(), header, footer, keyField);
+        case COMMAND_BLOCK:
+          return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
 
-      case DELETE_BLOCK:
-        return new HoodieDeleteBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
+        case CDC_DATA_BLOCK:
+          return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
 
-      case COMMAND_BLOCK:
-        return new HoodieCommandBlock(content, inputStream, readBlockLazily, Option.of(logBlockContentLoc), header, footer);
-
-      case CDC_DATA_BLOCK:
-        return new HoodieCDCDataBlock(inputStream, content, readBlockLazily, logBlockContentLoc, readerSchema, header, keyField);
-
-      default:
-        throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+        default:
+          throw new HoodieNotSupportedException("Unsupported Block " + blockType);
+      }
+    } catch (IOException | CorruptedLogFileException | IllegalArgumentException e) {
+      // check for corrupt block
+      inputStream.seek(blockStartPos);
+      if (isBlockCorrupted(blockSize)) {

Review Comment:
   Can we remove the catch for `IllegalArgumentException e` and can we check that the exception is a `IOException` before we do another check for `isBlockCorrupted(blockSize)` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1577385460

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1578734947

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * bd883a7ac2f3ede102b44e8696fa638266e1c3bc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1585038930

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * bd883a7ac2f3ede102b44e8696fa638266e1c3bc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629) 
   * b14848a571950bddf85114655ccbf12f56d6d4da UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1610962764

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061",
       "triggerID" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18145",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "24df2fc7cb9b3d7e35543a4d2a72f59f61f441da",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18146",
       "triggerID" : "24df2fc7cb9b3d7e35543a4d2a72f59f61f441da",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * 24df2fc7cb9b3d7e35543a4d2a72f59f61f441da Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18146) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1560451763

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1517423738

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1517073967

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1585234947

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * b14848a571950bddf85114655ccbf12f56d6d4da Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8526:
URL: https://github.com/apache/hudi/pull/8526#discussion_r1203488106


##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##########
@@ -152,98 +153,107 @@ private void addShutDownHook() {
   // TODO : convert content and block length to long by using ByteBuffer, raw byte [] allows
   // for max of Integer size
   private HoodieLogBlock readBlock() throws IOException {
-    int blockSize;
-    long blockStartPos = inputStream.getPos();
-    try {
-      // 1 Read the total size of the block
-      blockSize = (int) inputStream.readLong();
-    } catch (EOFException | CorruptedLogFileException e) {
-      // An exception reading any of the above indicates a corrupt block
-      // Create a corrupt block by finding the next MAGIC marker or EOF
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // We may have had a crash which could have written this block partially
-    // Skip blockSize in the stream and we should either find a sync marker (start of the next
-    // block) or EOF. If we did not find either of it, then this block is a corrupted block.
-    boolean isCorrupted = isBlockCorrupted(blockSize);
-    if (isCorrupted) {
-      return createCorruptBlock(blockStartPos);
-    }
-
-    // 2. Read the version for this log format
-    HoodieLogFormat.LogFormatVersion nextBlockVersion = readVersion();
+    long blockStartPos = 0;
+    long blockSize = 0;

Review Comment:
   Does moving the whole try catch code into a separate method makes sense to you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1610745082

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061",
       "triggerID" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18145",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18145) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1610759115

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f2f4ddd192879cdc6a9c91aa2b2c5c6813ab490",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16511",
       "triggerID" : "1560451763",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "673f502686ebf316ab9f6ba802fd318e5c5bd613",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1577385460",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17629",
       "triggerID" : "bd883a7ac2f3ede102b44e8696fa638266e1c3bc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17713",
       "triggerID" : "b14848a571950bddf85114655ccbf12f56d6d4da",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061",
       "triggerID" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "09a1ea8789d509b6200018e60ec6911bea50bca7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18145",
       "triggerID" : "1610733461",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "24df2fc7cb9b3d7e35543a4d2a72f59f61f441da",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18146",
       "triggerID" : "24df2fc7cb9b3d7e35543a4d2a72f59f61f441da",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 673f502686ebf316ab9f6ba802fd318e5c5bd613 UNKNOWN
   * 09a1ea8789d509b6200018e60ec6911bea50bca7 UNKNOWN
   * b70c66bc68aa1fb2adf1f5486e4b3e0c58aef8fa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18061) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18145) 
   * 24df2fc7cb9b3d7e35543a4d2a72f59f61f441da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18146) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] prashantwason commented on pull request #8526: [HUDI-6116] Optimize log block reading by removing seeks to check corrupted blocks.

Posted by "prashantwason (via GitHub)" <gi...@apache.org>.
prashantwason commented on PR #8526:
URL: https://github.com/apache/hudi/pull/8526#issuecomment-1572480457

   @danny0405 PTAL again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org