You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/27 13:18:03 UTC

[GitHub] [hudi] vinothchandar commented on a change in pull request #3515: [HUDI-2341] Adding blog on immutable data lakes

vinothchandar commented on a change in pull request #3515:
URL: https://github.com/apache/hudi/pull/3515#discussion_r697433025



##########
File path: website/blog/2021-08-20-immutable-data-lakes.md
##########
@@ -0,0 +1,73 @@
+---
+title: "Immutable data lakes using Apache Hudi"
+excerpt: "How to leverage Apache Hudi for your immutable (or) append only data use-case"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+We strive to listen to community and build features based on the need. From our interactions with the community, we got 
+to know there are quite a few use-cases where Hudi is being used for immutable or append only data. This blog will go 
+over details on how to leverage Apache Hudi in building your data lake for such immutable or append only data.
+<!--truncate-->
+
+# Immutable data
+Often times, users route log entries to data lakes, where data is immutable. (Add some concrete 
+examples here). Data once ingested won't be updated and can only be deleted. Also, most likely, deletes are issued at 
+partition level (delete partitions older than 1 week) granularity.
+
+# Immutable data lakes using Apache Hudi 
+Hudi has an efficient way to ingest data into Hudi for such immutable use-cases. "Bulk_Insert" operation in Hudi is 
+commonly used for initial bootstrapping of data into hudi, but also exactly fits the bill for such immutable or append 
+only data. And it is known to be performant when compared to regular "insert"s or "upsert"s. 
+
+## Bulk_insert vs regular Inserts/Upserts
+With regular inserts and upserts, Hudi executes few steps before data can be written to data files. For example, 
+index lookup, small file handling, etc has to be performed before actual write. But with bulk_insert, such overhead can 
+be avoided since data is known to be immutable. 
+
+Here is an illustration of steps involved in different operations of interest. 
+
+![Inserts/Upserts](/assets/images/blog/immutable_datalakes/immutable_data_lakes1.jpeg)

Review comment:
       +1 lets stick to png 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org