You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Hui An (Jira)" <ji...@apache.org> on 2022/10/17 07:18:00 UTC

[jira] [Commented] (HUDI-4432) Checkpoint management for muti-writer scenario

    [ https://issues.apache.org/jira/browse/HUDI-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618691#comment-17618691 ] 

Hui An commented on HUDI-4432:
------------------------------

Hey [~harsh1231], Can we fix this soon, if you don't have time, I'm willing to fix it :), as it blocks many our users using multi writers to upgrade to 0.12.1.

[~codope] Please correct me if I'm wrong :), from my understanding, we need to reversely iterate all commits and get checkpoint metadata through {{SINK_CHECKPOINT_KEY}}, and compare {{writerContext}} is same to the running job, if It's same, compare the latest batchId with the current batchId, otherwise iterate the next commit.

> Checkpoint management for muti-writer scenario
> ----------------------------------------------
>
>                 Key: HUDI-4432
>                 URL: https://issues.apache.org/jira/browse/HUDI-4432
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Assignee: Harshal Patil
>            Priority: Major
>             Fix For: 0.13.0
>
>
> Please check [https://github.com/apache/hudi/pull/6098/files#r923232330]
> ```
> do we need to design/impl this similar to how deltastreamer checkpointing is done. with Deltastreamer, its feasible to do 1 writer w/ DS and another writer w/ Spark datasource and still Deltastreamer will be able to fetch the right checkpoint to resume from everytime.
> Here I see, we are fetching only the latest commit. So this may not work w/ multi -writer scenarios. may be we can create a follow up ticket and work on it rather than expanding the scope of this patch.
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)