You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/27 09:21:51 UTC

[GitHub] [hudi] YannByron commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

YannByron commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1110771677

> left some initial comments. I think the main decision here is whether or not to reuse the existing record level commit metadata and build CDC on top or do a separate `.cdc` folder? Can you clarify what exactly is contained in the files under .cdc.?

Sorry for leaving some points that i can't make clear in this RFC doc. let me mention them here, and i'll update RFC later.

1. for COW tables, query efficiency is the main focus. I definitely do not want to write out the log files, if i have to persist the CDC data. So it has to, i prefer to double-write. But i will try to reuse the normal data files, and reduce extra workload. And answer the question above: `.cdc` folder will keep these files that we have to write out.

2. for MOR tables, we care about the write efficiency. In my thoughts and design, we don't have to write any more data and files. When query CDC for MOR, we need to merge inc data written in log Files and base files to judge which records are deleted, which ones are updated (for those, we also need to find the previous values), and which ones are inserted.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org