You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "vinothchandar (via GitHub)" <gi...@apache.org> on 2023/04/04 15:16:51 UTC

[GitHub] [hudi] vinothchandar commented on issue #8222: [SUPPORT] Incremental read with MOR does not work as COW

vinothchandar commented on issue #8222:
URL: https://github.com/apache/hudi/issues/8222#issuecomment-1496165517

@parisni To clarify the semantics a bit. Incremental query provides all the records that changed between a start and end commit time range. If there are multiple writes (CoW) or multiple compactions (MoR) between queries, you would only see the latest record (per pre combine logic) up to the compacted point, then log records after that. This is similar to the Kafka compacted topic [design](https://kafka.apache.org/documentation/#compaction), to bound the "catch up" time for downstream jobs. If one wants every change record i.e, multiple rows in incremental query output per key for each change, that's what the CDC feature solves, right now it's supported for CoW).

As for this problem, the issue is the reads are served out of the logs based on the commit time range and it's fine as long as we are just returning the latest committed records. In this case, there is a pre-combine field to respect and that's not handled yet. The solution would be to perform a base + log merge first (which will consider the precombine fields), then filter for the commit range (increases the cost of the query, but will give you same semantics).

How much of a blocker is this for your project? This will help us prioritize this.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org