You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Manoj Govindassamy (Jira)" <ji...@apache.org> on 2022/02/01 20:20:00 UTC

[jira] [Commented] (HUDI-3301) MergedLogRecordReader inline reading should be stateless and thread safe

    [ https://issues.apache.org/jira/browse/HUDI-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485455#comment-17485455 ] 

Manoj Govindassamy commented on HUDI-3301:
------------------------------------------

Perf related comments in [https://github.com/apache/hudi/pull/4352]

Siva: I see we call HoodieBackedTableMetadata::getPartitionLatestMergedFileSlices and getPartitionFileSlices in few places and could be repeated as well. Can we cache the return value based on latest instant. If latest instant has not changed, then latestFileSlice is not going to change right. So, might as well used the cached copy if we have one.

 

Manoj: HoodieBackedTableMetadata::getRecordsByKeys() -> getPartitionFileSliceToKeysMapping() -> getPartitionLatestMergedFileSlices(). But we need to address HUDI-3301 before this. 

> MergedLogRecordReader inline reading should be stateless and thread safe
> ------------------------------------------------------------------------
>
>                 Key: HUDI-3301
>                 URL: https://issues.apache.org/jira/browse/HUDI-3301
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: metadata
>            Reporter: Manoj Govindassamy
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> Metadata table inline reading (enable.full.scan.log.files = false) today alters instance member fields and not thread safe.
>  
> When the inline reading is enabled, HoodieMetadataMergedLogRecordReader doesn't do full read of log and base files and doesn't fill in the ExternalSpillableMap records cache. Each getRecordsByKeys() thereby will re-read the log and base files by design. But the issue here is this reading alters the instance members and the filled in records are relevant only for that request. Any concurrent getRecordsByKeys() is also modifying the member variable leading to NPE.
>  
> To avoid this, a temporary fix of making getRecordsByKeys() a synchronized method has been pushed to master. But this fix doesn't solve all usecases. We need to make the whole class stateless and thread safe for inline reading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)