You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/02 19:59:54 UTC

[GitHub] [hudi] yihua commented on a change in pull request #3527: [HUDI-2347] Blog on improving marker mechanism

yihua commented on a change in pull request #3527:
URL: https://github.com/apache/hudi/pull/3527#discussion_r701382951



##########
File path: website/blog/2021-08-18-improving-marker-mechanism.md
##########
@@ -0,0 +1,65 @@
+---
+title: "Improving Marker Mechanism in Apache Hudi"
+excerpt: "We introduce a new marker mechanism leveraging the timeline server to address performance bottlenecks due to rate-limiting on cloud storage like AWS S3."
+author: yihua
+category: blog
+---
+Write operations in an Apache Hudi table use markers to efficiently identify the data files written to the file system.  In this blog, we dive into the design of the existing direct marker file mechanism and explain its performance problem on cloud storage like AWS S3.  We demonstrate how we improve the write performance with timeline-server-based markers.
+
+<!--truncate-->
+
+## What is a marker and why it’s needed in write operations on Hudi Table
+ 
+A **marker** in Hudi, such as a marker file with a unique filename, is a label to indicate that a corresponding data file exists in the file system.  Each marker entry is composed of three parts, the data file name, the marker extension (`.marker`), and the I/O type (`CREATE`, `MERGE`, or `APPEND`).  For example, the marker `91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet.marker.CREATE` indicates that the corresponding data file is `91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet` and the I/O type is `CREATE`.  Before writing each data file, the Hudi write client creates a marker first in the file system.  Markers are persistent in the file system unless they are explicitly deleted by the write client.  The write client deletes all markers when the commit is successful.

Review comment:
       Reworded a bit in a follow-up PR.  I used "marker" instead of "marker file" to be general.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org