You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2021/11/17 13:33:00 UTC

[jira] [Commented] (HUDI-2750) Improve the incremental data files metadata more efficiently for streaming source

    [ https://issues.apache.org/jira/browse/HUDI-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445162#comment-17445162 ] 

Vinoth Chandar commented on HUDI-2750:
--------------------------------------

+1 on this. Dumping my thoughts here.  When the start commit is far away, 2/3 can be more performant, since they already filter out the files that have already been cleaned etc. Reading the entire timeline archive log can be time consuming. 

I think we can index the timeline as well and support efficient range retrievals. but wondering why you think 2/3 is just only suitable for full history reads? Is it because the log files don't have the delta commit instant today in their names? With these (at-least on object storage), we can figure out what files changes between any given interval, right?

Is this the gap?

 

 

 

 

 

 

 

> Improve the incremental data files metadata more efficiently for streaming source
> ---------------------------------------------------------------------------------
>
>                 Key: HUDI-2750
>                 URL: https://issues.apache.org/jira/browse/HUDI-2750
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Common Core
>            Reporter: Danny Chen
>            Priority: Major
>             Fix For: 0.11.0
>
>
> There are 3 ways for fetching the incremental data files for streaming read now:
> 1. Read the incremental commit metadata and resolve the data files to construct the inc filesystem view
> 2. Scan the filesystem directly and filter the data files with start commit time if the consuming starts from the 'earliest' offset
> 3. For 2, there is a more efficient way: to look up the metadata table if it is enabled
> While these 3 ways are far away from enough for production:
> for 1: there was a bottleneck when the start commit time has been far away from now, and the instants may have been archived, it takes too much time to load those metadata files, in our production, more than 30 minutes, which is unacceptable.
> for 2&3: they are only suitable for cases that read the full history and incremental data set.
> We better propose a way to look up the incremental data files with arbitrary time interval instants, to construct the filesystem efficiently.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)