You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Prashant Wason (Jira)" <ji...@apache.org> on 2023/04/20 21:33:00 UTC

[jira] [Created] (HUDI-6116) Optimize log block reading by removing seeks to check corrupted blocks

Prashant Wason created HUDI-6116:
------------------------------------

             Summary: Optimize log block reading by removing seeks to check corrupted blocks
                 Key: HUDI-6116
                 URL: https://issues.apache.org/jira/browse/HUDI-6116
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason
            Assignee: Prashant Wason


The code currently does an eager isCorruptedCheck for which we do a seek and then a read which invalidates our internal buffers in opened file stream to the log file and makes a call to DataNode to start a new blockReader.

The seek + read becomes apparent when we do cross datacenter reads or where the latency to the file is HIGH. In cases, a single RPC will cost us about 120ms + Cost of RPC (west coast to east coast) so this seek is bad for performance.

Delaying the corrupt check also gives us many benefits in low latency env where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a moderately sized files of 250MB.

NOTE:  The more number of log blocks to read, the greater the performance improvements.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)