You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2021/11/07 16:56:00 UTC

[jira] [Comment Edited] (HUDI-2488) Support bootstrapping a single or more partitions in metadata table while regular writers and table services are in progress

    [ https://issues.apache.org/jira/browse/HUDI-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440043#comment-17440043 ] 

Vinoth Chandar edited comment on HUDI-2488 at 11/7/21, 4:55 PM:
----------------------------------------------------------------

I skimmed through the approach above. There are aspects that I would like to avoid - esp the index building process (let's not please use "bootstrap" anymore, it's very confusing, given we have something else called bootstrap) trying to catch up to the writers, this is generally a hard model to pull off. 

Here's my strawman proposal. 
 * We introduce a new action `{_}*indexing*{_}` which will denote the index build process
 * From an external process, we issue a _CREATE INDEX ..._ or similar statement to add a new index to an existing table. This will add a `<instant_time>.indexing.requested` to the timeline, which contains the indexing plan. 
 ** From here on, the index building process will continue to build an index up to instant time _*t,*_ where t is the latest completed instant time on the timeline without any "holes" i.e pending async operations prior to it. 
 ** The indexing process will write these out as base files within the corresponding metadata partition. 
 ** the metadata partition cannot be used if there is any pending indexing actions against it. 
 * Any inflight writers (i.e with instant time {_}*t' > t*{_})  will check for any new indexing requests on the timeline prior to preparing to commit. 
 ** Such writers will proceed to additionally add log entries corresponding to each such indexing request into the metadata partition. 
 ** there is always a TOCTOU issue here, where the inflight writer may not see an indexing request that was just added and proceed to commit without that. We will correct this during indexing action completion. In the average case, this may not happen and the design has liveness. 
 * When the indexing process is about to complete, it will check for all completed commit actions to ensure each of them added entries per its indexing plan, otherwise simply abort. 

 * 
 ** The corner case here would be that the indexing check does not factor in the inflight writer just about to commit. But given indexing would take some finite amount of time to go from requested to completion (or we can add some, configurable artificial delays here say 60 seconds), an inflight writer, that is just about to commit concurrently, has a very very high chance of seeing the indexing plan and aborting itself. 

We can just introduce a lock for adding events to the timeline and these races would vanish completely, still providing great scalability and asynchrony for these processes. Above is an attempt at a practical solution for this without any extra locking. 

We can also add a transition time to the timeline events and it will actually understand what actions conflicted/concurrent with what other actions on the timeline. E.g By reading timeline events and their start and transition times, we would know that the rare race above actually occurred and can deem the metadata partition un-usable since its missing some commit's indexing update or have the next writer fix it forward. Fundamentally, we need to ensure there is some lightweight locking that can be done always. or move to a model of allowing inconsistency and resolving using CRDT like approaches. 


was (Author: vc):
I skimmed through the approach above. There are aspects that I would like to avoid - esp the index building process (let's not please use "bootstrap" anymore, it's very confusing, given we have something else called bootstrap) trying to catch up to the writers, this is generally a hard model to pull off. 

 

Here's my strawman proposal. 
 * We introduce a new action `{_}*indexing*{_}` which will denote the index build process
 * From an external process, we issue a _CREATE INDEX ..._ or similar statement to add a new index to an existing table. This will add a `<instant_time>.indexing.requested` to the timeline, which contains the indexing plan. 
 ** From here on, the index building process will continue to build an index up to instant time _*t,*_ where t is the latest completed instant time on the timeline without any "holes" i.e pending async operations prior to it. 
 ** The indexing process will write these out as base files within the corresponding metadata partition. 
 ** the metadata partition cannot be used if there is any pending indexing actions against it. 
 * Any inflight writers (i.e with instant time {_}*t' > t*{_})  will check for any new indexing requests on the timeline prior to preparing to commit. 
 ** Such writers will proceed to additionally add log entries corresponding to each such indexing request into the metadata partition. 
 ** there is always a TOCTOU issue here, where the inflight writer may not see an indexing request that was just added and proceed to commit without that. We will correct this during indexing action completion. In the average case, this may not happen and the design has liveness. 
 * When the indexing process is about to complete, it will check for all completed commit actions to ensure each of them added entries per its indexing plan, otherwise simply abort. 

 ** The corner case here would be that the indexing check does not factor in the inflight writer just about to commit. But given indexing would take some finite amount of time to go from requested to completion (or we can add some, configurable artificial delays here say 60 seconds), an inflight writer, that is just about to commit concurrently, has a very very high chance of seeing the indexing plan and aborting itself. 

We can just introduce a lock for adding events to the timeline and these races would vanish completely, still providing great scalability and asynchrony for these processes. Above is an attempt at a practical solution for this without any extra locking. 

We can also add a transition time to the timeline events and it will actually understand what actions conflicted/concurrent with what other actions on the timeline. E.g By reading timeline events and their start and transition times, we would know that the rare race above actually occurred and can deem the metadata partition un-usable since its missing some commit's indexing update or have the next writer fix it forward. Fundamentally, we need to ensure there is some lightweight locking that can be done always. or move to a model of allowing inconsistency and resolving using CRDT like approaches. 

 

 

 

> Support bootstrapping a single or more partitions in metadata table while regular writers and table services are in progress
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-2488
>                 URL: https://issues.apache.org/jira/browse/HUDI-2488
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: sivabalan narayanan
>            Assignee: Vinoth Chandar
>            Priority: Blocker
>             Fix For: 0.10.0
>
>
> For now, we have only FILES partition in metadata table. and our suggestion is to stop all processes and then restart one by one by enabling metadata table. first process to start back will invoke bootstrapping of the metadata table. 
>  
> But this may not work out well as we add more and more partitions to metadata table. 
> We need to support bootstrapping a single or more partitions in metadata table while regular writers and table services are in progress. 
>  
>  
> Penning down my thoughts/idea. 
> I tried to find a way to get this done w/o adding an additional lock, but could not crack that. So, here is one way to support async bootstrap. 
>  
> Introducing a file called "available_partitions" in some special file under metadata table. This file will contain the list of partitions that are available to apply updates from data table. i.e. when we do synchronous updates from data table to metadata table, when we have N no of partitions in metadata table, we need to know what partitions are fully bootstrapped and ready to take updates. this file will assist in maintaining that info. We can debate on how to maintain this info (tbl props, or separate file etc, but for now let's say this file is the source of truth). Idea here is that, any async bootstrap process will update this file with the new partition that got bootstrapped once the bootstrap is fully complete. So that all other writers will know what partitions to update. 
> Add we need to introduce a metadata_lock as well. 
>  
> here is how writers and async bootstrap will pan out.
>  
> Regular writer or any async table service(compaction, etc): 
>     when changes are required to be applied to metadata table: // fyi. as of today this already happens within data table lock. 
>            Take metadata_lock
>                   read contents of available_partitions. 
>                   prep records and apply updates to metadata table.
>            release lock.
>  
> Async bootstrap process:
>      Start bootstrapping of a given partition (eg files) in metadata table.
>      do it in a loop. i.e. first iteration of bootstrap could take 10 mins for eg. and then again catch up new commits that happened in the last 10 mins which could take 1 min for instance. and then again go for another loop. 
>      Whenever total bootstrap time for a round is ~ 1min or less, in the next round, we can go in for final iteration.
>            During the final iteration, take the metadata_lock. // this lock should not be held for more than few secs. 
>                      apply any new commits that happened while last iteration of bootstrap was happening. 
>                      update "available_partitions" file with this partition info that got fully bootstrapped. 
>           release lock.
>  
> metadata_lock: will ensure when async bootstrap is in final stages of bootstrapping, we should not miss any commits that is nearing completion. So, we ought to take a lock to ensure we don't miss out on any commits. Either async bootstrap will apply the update, or the actual writer itself will update directly if bootstrap is fully complete. 
>  
> Rgdn "available_partitions": 
> I was looking for a way to know what partitions are fully ready to take in direct updates from regular writers and hence chose this way. We can also think about creating a temp_partition(files_temp or something) while bootstrap in progress and then rename to original partition name once bootstrap is fully complete. If we can ensure reliably renaming of these partitions(i.e, once files partition is available, it is fully ready to take in direct updates), we can take this route as well. 
> Here is how it might pan out w/ folder/partition renaming.
>  
> Regular writer or any async table service(compaction, etc): 
>     when changes are required to be applied to metadata table: // fyi. as of today this already happens within data table lock. 
>            Take metadata_lock
>                   list partitions in metadata table. ignore temp partitions.  
>                   prep records and apply updates to metadata table.
>            release lock.
>  
> Async bootstrap process:
>      Start bootstrapping of a given partition (eg files) in metadata table. create a temp folder for partition thats getting bootstrapped. (for eg: files_temp)
>      do it in a loop. i.e. first iteration of bootstrap could take 10 mins for eg. and then again catch up new commits that happened in the last 10 mins which could take 1 min for instance. and then again go for another loop. 
>      Whenever total bootstrap time for a round is ~ 1min or less, in the next round, we can go in for final iteration.
>            During the final iteration, take the metadata_lock. // this lock should not be held for more than few secs. 
>                      apply any new commits that happened while last iteration of bootstrap was happening. 
>                      rename files_temp to files. 
>           release lock.
> Note: we just need to ensure that folder renaming is consistent. On crash, either new folder is fully intact or not available. contents of old folder does not matter. 
>  
> Failures: 
> a. if bootstrap failed midway, until "files" hasn't been created, we can delete files_temp and start all over again. 
> b. if bootstrap failed just after rename, again we should be good. Just that lock may not have been released. We need to ensure the metadata lock is released. So, to tackle this, if acquiring metadata_lock from regular writer fails, we will just proceed onto listing partitions and applying updates. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)