You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by si...@apache.org on 2022/06/10 22:54:25 UTC
[hudi] branch asf-site updated: [DOCS] Adding FAQ on why hudi has record key requirement (#5839)

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 63b0f757a0 [DOCS] Adding FAQ on why hudi has record key requirement (#5839)
63b0f757a0 is described below

commit 63b0f757a00c7b4669fbfc5a5c5a123d7b9f2942
Author: Sivabalan Narayanan <n....@gmail.com>
AuthorDate: Fri Jun 10 18:54:21 2022 -0400

    [DOCS] Adding FAQ on why hudi has record key requirement (#5839)
---
 website/docs/faq.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/website/docs/faq.md b/website/docs/faq.md
index be340e3b38..52137d1be5 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -67,6 +67,17 @@ When writing data into Hudi, you model the records like how you would on a key-v
 
 When querying/reading data, Hudi just presents itself as a json-like hierarchical table, everyone is used to querying using Hive/Spark/Presto over Parquet/Json/Avro. 
 
+### Why does Hudi require a key field to be configured?
+Hudi was designed to support fast record level Upserts and thus requires a key to identify whether an incoming record is 
+an insert or update or delete, and process accordingly. Additionally, Hudi automatically maintains indexes on this primary 
+key and for many use-cases like CDC, ensuring such primary key constraints is crucial to ensure data quality. In this context, 
+pre combine key helps reconcile multiple records with same key in a single batch of input records. Even for append-only data 
+streams, Hudi supports key based de-duplication before inserting records. For e-g; you may have atleast once data integration 
+systems like Kafka MirrorMaker that can introduce duplicates during failures. Even for plain old batch pipelines, keys 
+help eliminate duplication that could be caused by backfill pipelines, where commonly it's unclear what set of records 
+need to be re-written. We are actively working on making keys easier by only requiring them for Upsert and/or automatically
+generate the key internally (much like RDBMS row_ids)
+
 ### Does Hudi support cloud storage/object stores?
 
 Yes. Generally speaking, Hudi is able to provide its functionality on any Hadoop FileSystem implementation and thus can read and write datasets on [Cloud stores](https://hudi.apache.org/docs/cloud) (Amazon S3 or Microsoft Azure or Google Cloud Storage). Over time, Hudi has also incorporated specific design aspects that make building Hudi datasets on the cloud easy, such as [consistency checks for s3](https://hudi.apache.org/docs/configurations#hoodieconsistencycheckenabled), Zero moves/r [...]