You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/20 09:33:06 UTC
[GitHub] [hudi] yanghua commented on a change in pull request #1992: [BLOG] Incremental processing on data lakes by vinoyang

yanghua commented on a change in pull request #1992:
URL: https://github.com/apache/hudi/pull/1992#discussion_r473611869



##########
File path: docs/_posts/2020-08-18-hudi-incremental-processing-on-data-lakes.md
##########
@@ -0,0 +1,275 @@
+---
+title: "Incremental Processing on the Data Lake"
+excerpt: "How Apache Hudi provides ability for incremental data processing."
+author: vinoyang
+category: blog
+---
+
+### NOTE: This article is a translation of the infoq.cn article, found [here](https://www.infoq.cn/article/CAgIDpfJBVcJHKJLSbhe), with minor edits
+
+Apache Hudi is a data lake framework which provides the ability to ingest, manage and query large analytical data sets on a distributed file system/cloud stores. 
+Hudi joined the Apache incubator for incubation in January 2019, and was promoted to the top Apache project in May 2020. This article mainly discusses the importance 
+of Hudi to the data lake from the perspective of "incremental processing". More information about Apache Hudi's framework functions, features, usage scenarios, and 
+latest developments can be found at QCon Global Software Development Conference (Beijing Station) 2020.

Review comment:
       `(Beijing Station)` -> `(Shanghai Station)`

##########
File path: docs/_posts/2020-08-18-hudi-incremental-processing-on-data-lakes.md
##########
@@ -0,0 +1,275 @@
+---
+title: "Incremental Processing on the Data Lake"
+excerpt: "How Apache Hudi provides ability for incremental data processing."
+author: vinoyang
+category: blog
+---
+
+### NOTE: This article is a translation of the infoq.cn article, found [here](https://www.infoq.cn/article/CAgIDpfJBVcJHKJLSbhe), with minor edits
+
+Apache Hudi is a data lake framework which provides the ability to ingest, manage and query large analytical data sets on a distributed file system/cloud stores. 
+Hudi joined the Apache incubator for incubation in January 2019, and was promoted to the top Apache project in May 2020. This article mainly discusses the importance 
+of Hudi to the data lake from the perspective of "incremental processing". More information about Apache Hudi's framework functions, features, usage scenarios, and 
+latest developments can be found at QCon Global Software Development Conference (Beijing Station) 2020.
+
+Throughout the development of big data technology, Hadoop has steadily seized the opportunities of this era and has become the de-facto standard for enterprises to build big data infrastructure. 
+Among them, the distributed file system HDFS that supports the Hadoop ecosystem almost naturally has become the standard interface for big data storage systems. In recent years, with the rise of 
+cloud-native architectures, we have seen a wave of newer models embracing low-cost cloud storage emerging, a number of data lake frameworks compatible with HDFS interfaces 
+embracing cloud vendor storage have emerged in the industry as well. 
+
+However, we are still processing data pretty much in the same way we did 10 years ago. This article will try to talk about its importance to the data lake from the perspective of "incremental processing".
+
+## Traditional data lakes lack the primitives for incremental processing
+
+In the era of mobile Internet and Internet of Things, delayed arrival of data is very common. 
+Here we are involved in the definition of two time semantics: [event time and processing time](https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/). 
+
+As the name suggests:
+
+ - **Event time:** the time when the event actually occurred;
+ - **Processing time:** the time when an event is observed (processed) in the system;
+
+Ideally, the event time and the processing time are the same, but in reality, they may have more or less deviation, which we often call "Time Skew". 
+Whether for low-latency stream computing or common batch processing, the processing of event time and processing time and late data is a common and difficult problem. 
+In general, in order to ensure correctness, when we strictly follow the "event time" semantics, late data will trigger the 
+[recalculation of the time window](https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html#late-elements-considerations) 
+(usually Hive partitions for batch processing), although the results of these "windows" may have been calculated or even interacted with the end user. 
+For recalculation, the extensible key-value storage structure is usually used in streaming processing, which is processed incrementally at the record/event level and optimized 
+based on point queries and updates. However, in data lakes, recalculating usually means rewriting the entire (immutable) Hive partition (or simply a folder in DFS), and 
+re-triggering the recalculation of cascading tasks that have consumed that Hive partition.
+
+With data lakes supporting massive amounts of data, many long-tail businesses still have a strong demand for updating cold data. However, for a long time, 
+the data in a single partition in the data lake was designed to be non-updatable. If it needs to be updated, the entire partition needs to be rewritten. 
+This will seriously damage the efficiency of the entire ecosystem. From the perspective of latency and resource utilization, these operations on Hadoop will incur expensive overhead.
+Besides, this overhead is usually also cascaded to the entire Hadoop data processing pipeline, which ultimately leads to an increase in latency by several hours.
+
+In response to the two problems mentioned above, if the data lake supports fine-grained incremental processing, we can incorporate changes into existing Hive partitions 
+more effectively, and provide a way for downstream table consumers to obtain only the changed data. For effectively supporting incremental processing, we can decompose it into the 
+following two primitive operations:
+
+ - **Update insert (upsert):** Conceptually, rewriting the entire partition can be regarded as a very inefficient upsert operation, which will eventually write much more data than the 
+original data itself. Therefore, support for (bulk) upsert is considered a very important feature. [Google's Mesa](https://research.google/pubs/pub42851/) (Google's data warehouse system) also 
+talks about several techniques that can be applied to rapid data ingestion scenarios.
+
+ - **Incremental consumption:** Although upsert can solve the problem of quickly releasing new data to a partition, downstream data consumers do not know 
+ which data has been changed from which time in the past. Usually, consumers can only know the changed data by scanning the entire partition/data table and 
+ recalculating all the data, which requires considerable time and resources. Therefore, we also need a mechanism to more efficiently obtain data records that 
+ have changed since the last time the partition was consumed.
+
+With the above two primitive operations, you can upsert a data set, and then incrementally consume from it, and create another (also incremental) data set to solve the two problems 
+we mentioned above and support many common cases, so as to support end-to-end incremental processing and reduce end-to-end latency. These two primitives combine with each other, 
+unlocking the ability of stream/incremental processing based on DFS abstraction.
+
+The storage scale of the data lake far exceeds that of the data warehouse. Although the two have different focuses on the definition of functions, 
+there is still a considerable intersection (of course, there are still disputes and deviations from definition and implementation. 
+This is not the topic this article tries to discuss). In any case, the data lake will support larger analytical data sets with cheaper storage, 
+so incremental processing is also very important for it. Next let's discuss the significance of incremental processing for the data lake.
+
+## The significance of incremental processing for the data lake
+
+### Streaming Semantics
+
+It has long been stated that there is a "[dualism](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)" 
+between the change log (that is, the "flow" in the conventional sense we understand) and the table.
+
+![dualism](/assets/images/blog/incr-processing/image4.jpg)
+
+The core of this discussion is: if there is a change log, you can use these changes to generate a data table and get the current status. If you update a table, 
+you can record these changes and publish all "change logs" to the table's status information. This interchangeable nature is called "stream table duality" for short.
+
+A more general understanding of "stream table duality": when the business system is modifying the data in the MySQL table, MySQL will reflect these changes as Binlog, 
+if we publish these continuous Binlog (stream) to Kafka, and then let the downstream processing system subscribe to the Kafka, and use the state store to gradually 
+accumulate the intermediate results. Then the current state of this intermediate result can reflects the current snapshot of the table.
+
+If the two primitives mentioned above that support incremental processing can be introduced to the data lake, the above pipeline, which can reflect the 
+"duality of flow table", is also applicable on the data lake. Based on the first primitive, the data lake can also ingest the Binlog log streams in Kafka, 

Review comment:
       `"duality of flow table"` -> `"stream table duality"`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org