You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yue Zhang (Jira)" <ji...@apache.org> on 2022/02/17 03:27:00 UTC

[jira] [Comment Edited] (HUDI-3301) MergedLogRecordReader inline reading should be stateless and thread safe

    [ https://issues.apache.org/jira/browse/HUDI-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493623#comment-17493623 ] 

Yue Zhang edited comment on HUDI-3301 at 2/17/22, 3:26 AM:
-----------------------------------------------------------

Hi [~manojg] and [~guoyihua]
Just a quick think of this problem, for now we cache logScanner which hold shard `records` instant at <partition,slice> level, no matter we `enable.full.scan.log.files` or not.

And if we disable full scan and look up of only interested entries will meet concurrency issue.
How about do a little change here:
1. If enable full scan ==> we cache scanner at partition level
2. If disable full scan ==> we create new  scanner each time, assuming that the data set obtained by the partial read method is not large also avoid concurrency issue at root level maybe.


was (Author: zhangyue19921010):
Hi [~manojg] and [~guoyihua]
Just a quick think of this problem, for now we cache logScanner which hold shard `records` instant at partition level, no matter we `enable.full.scan.log.files` or not.

And if we disable full scan and look up of only interested entries will meet concurrency issue.
How about do a little change here:
1. If enable full scan ==> we cache scanner at partition level
2. If disable full scan ==> we cache scanner at slice level, assuming that the data set obtained by the partial read method is not large also avoid concurrency issue at root level maybe.

> MergedLogRecordReader inline reading should be stateless and thread safe
> ------------------------------------------------------------------------
>
>                 Key: HUDI-3301
>                 URL: https://issues.apache.org/jira/browse/HUDI-3301
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: metadata
>            Reporter: Manoj Govindassamy
>            Assignee: Ethan Guo
>            Priority: Blocker
>              Labels: HUDI-bug
>             Fix For: 0.11.0
>
>
> Metadata table inline reading (enable.full.scan.log.files = false) today alters instance member fields and not thread safe.
>  
> When the inline reading is enabled, HoodieMetadataMergedLogRecordReader doesn't do full read of log and base files and doesn't fill in the ExternalSpillableMap records cache. Each getRecordsByKeys() thereby will re-read the log and base files by design. But the issue here is this reading alters the instance members and the filled in records are relevant only for that request. Any concurrent getRecordsByKeys() is also modifying the member variable leading to NPE.
>  
> To avoid this, a temporary fix of making getRecordsByKeys() a synchronized method has been pushed to master. But this fix doesn't solve all usecases. We need to make the whole class stateless and thread safe for inline reading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)