You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/08 10:11:30 UTC

[GitHub] [hudi] xushiyan commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

xushiyan commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1120389506

I read through the rfc and Danny's design doc and I also prefer option B leveraging `_hoodie_operation` which introduce less complexity and overhead onto the storage. I have similar concerns that the management/cost overhead of `.cdc/` might not justify the gain on the read efficiency, for e.g., we've put a lot of efforts in making metadata table stablized and in sync with data table; a reference for what we might have to do to `.cdc/`. Storage-wise, as mentioned above, even if the fraction is small, the actual cost can be significant, due to the table size being huge per se. Even if the storage size is acceptable, for cloud storage users, the added API calls to save new objects incurs more billings regardless of size. In update-heavy tables, this becomes impactful.

On option B using `_hoodie_operation`, i agree some benchmarking can be super helpful. It may worth putting more energy there to optimize the logic if needed. UX-wise, it fits nicer to users already running incremental query pipelines; a new config to turn on then they'll get the cdc info.

In short, i prefer leveraging on / improving what we already have in hudi. Regardless of design approach, this is a great initiative anyway!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org