You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2021/10/04 18:06:00 UTC

[jira] [Updated] (HUDI-1604) Fix archival max log size and potentially a bug in archival

     [ https://issues.apache.org/jira/browse/HUDI-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan updated HUDI-1604:
--------------------------------------
    Labels: sev:high sev:triage user-support-issues  (was: sev:triage user-support-issues)

> Fix archival max log size and potentially a bug in archival
> -----------------------------------------------------------
>
>                 Key: HUDI-1604
>                 URL: https://issues.apache.org/jira/browse/HUDI-1604
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Cleaner
>    Affects Versions: 0.7.0
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: sev:high, sev:triage, user-support-issues
>
> Gist of the issue from Udit
>  
> I took a deeper look at this. For you this seems to be happening in the archival code path:
>  
> {{ at org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:309)
>  at org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:282)
>  at org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:133)
>  at org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:381)}}
> In {{HoodieTimelineArchiveLog}} where it needs to write log files with commit record, similar to how log files are written for MOR tables. However, in this code I notice a couple of issues:
>  * The default maximum log block size of 256 MB defined [here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L51], is not utilized for this class and is only used for the MOR log blocks writing case. As a result, there is no real control over the block size that it can end up writing which can potentially overflow {{ByteArrayOutputStream}} whose maximum size is {{Integer.MAX_VALE - 8}}. That is what seems to be happening in this scenario here because of an integer overflow following that code path inside {{ByteArrayOutputStream}}. So we need to use the maximum block size concept here as well.
>  * In addition I see a bug in code [here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L302] where even after flushing out the records into a file after a batch size of 10 (default) it is not clearing the list and just goes on accumulating the records. This seems logically wrong as well (duplication), apart from the fact that it would keep increasing the log file blocks size it is writing.
> Reference: https://github.com/apache/hudi/issues/2408#issuecomment-758320870



--
This message was sent by Atlassian Jira
(v8.3.4#803005)