You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by da...@apache.org on 2022/01/12 09:06:27 UTC
[hudi] branch asf-site updated: [HUDI-3230] Add streaming read for flink document (#4571)

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 7374326  [HUDI-3230] Add streaming read for flink document (#4571)
7374326 is described below

commit 7374326ccc160367c566e7052eab34f6b0ee556e
Author: Danny Chan <yu...@gmail.com>
AuthorDate: Wed Jan 12 17:05:35 2022 +0800

    [HUDI-3230] Add streaming read for flink document (#4571)
---
 website/docs/flink-quick-start-guide.md            | 11 ++++++-----
 website/docs/hoodie_deltastreamer.md               | 22 ++++++++++++++++++++--
 .../version-0.10.0/flink-quick-start-guide.md      | 11 ++++++-----
 .../version-0.10.0/hoodie_deltastreamer.md         | 22 ++++++++++++++++++++--
 4 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/website/docs/flink-quick-start-guide.md b/website/docs/flink-quick-start-guide.md
index d5dd05d..323acad 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -4,15 +4,16 @@ toc: true
 last_modified_at: 2020-08-12T15:19:57+08:00
 ---
 
-This guide provides a document at Hudi's capabilities using Flink SQL. We can feel the unique charm of Flink stream computing engine on Hudi.
-Reading this guide, you can quickly start using Flink to write to(read from) Hudi, have a deeper understanding of configuration and optimization:
+This guide provides an instruction for Flink Hudi integration. We can feel the unique charm of how Flink brings in the power of streaming into Hudi.
+Reading this guide, you can quickly start using Flink on Hudi, learn different modes for reading/writing Hudi by Flink:
 
 - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
-- **Configuration** : For [Flink Configuration](flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_configuration#table-options).
-- **Writing Data** : Flink supports different writing data use cases, such as [CDC Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](hoodie_deltastreamer#bulk-insert), [Index Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog Mode](hoodie_deltastreamer#changelog-mode) and [Append Mode](hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different querying data use cases, such as [Incremental Query](hoodie_deltastreamer#incremental-query), [Hive Query](syncing_metastore#flink-setup), [Presto Query](query_engine_setup#prestodb).
+- **Configuration** : For [Global Configuration](flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_configuration#table-options).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](hoodie_deltastreamer#bulk-insert), [Index Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog Mode](hoodie_deltastreamer#changelog-mode) and [Append Mode](hoodie_deltastreamer#append-mode).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](hoodie_deltastreamer#streaming-query) and [Incremental Query](hoodie_deltastreamer#incremental-query).
 - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](flink_configuration#memory-optimization) and [Write Rate Limit](flink_configuration#write-rate-limit).
 - **Optimization**: Offline compaction is supported [Offline Compaction](compaction#flink-offline-compaction).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](syncing_metastore#flink-setup), [Presto Query](query_engine_setup#prestodb).
 
 ## Quick Start
 
diff --git a/website/docs/hoodie_deltastreamer.md b/website/docs/hoodie_deltastreamer.md
index a979788..f212f57 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -462,6 +462,24 @@ There are many use cases that user put the full history data set onto the messag
 |  -----------  | -------  | ------- | ------- |
 | `write.rate.limit` | `false` | `0` | Default disable the rate limit |
 
+### Streaming Query
+By default, the hoodie table is read as batch, that is to read the latest snapshot data set and returns. Turns on the streaming read
+mode by setting option `read.streaming.enabled` as `true`. Sets up option `read.start-commit` to specify the read start offset, specifies the
+value as `earliest` if you want to consume all the history data set.
+
+#### Options
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `read.streaming.enabled` | false | `false` | Specify `true` to read as streaming |
+| `read.start-commit` | false | the latest commit | Start commit time in format 'yyyyMMddHHmmss', use `earliest` to consume from the start commit |
+| `read.streaming.skip_compaction` | false | `false` | Whether to skip compaction commits while reading, generally for two purposes: 1) Avoid consuming duplications from the compaction instants 2) When change log mode is enabled, to only consume change logs for right semantics. |
+| `clean.retain_commits` | false | `10` | The max number of commits to retain before cleaning, when change log mode is enabled, tweaks this option to adjust the change log live time. For example, the default strategy keeps 50 minutes of change logs if the checkpoint interval is set up as 5 minutes. |
+
+:::note
+When option `read.streaming.skip_compaction` turns on and the streaming reader lags behind by commits of number
+`clean.retain_commits`, the data loss may occur.
+:::
+
 ### Incremental Query
 There are 3 use cases for incremental query:
 1. Streaming query: specify the start commit with option `read.start-commit`;
@@ -472,8 +490,8 @@ There are 3 use cases for incremental query:
 #### Options
 |  Option Name  | Required | Default | Remarks |
 |  -----------  | -------  | ------- | ------- |
-| `write.start-commit` | `false` | the latest commit | Specify `earliest` to consume from the start commit |
-| `write.end-commit` | `false` | the latest commit | -- |
+| `read.start-commit` | `false` | the latest commit | Specify `earliest` to consume from the start commit |
+| `read.end-commit` | `false` | the latest commit | -- |
 
 ## Kafka Connect Sink
 If you want to perform streaming ingestion into Hudi format similar to HoodieDeltaStreamer, but you don't want to depend on Spark,
diff --git a/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md b/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md
index d5dd05d..323acad 100644
--- a/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.10.0/flink-quick-start-guide.md
@@ -4,15 +4,16 @@ toc: true
 last_modified_at: 2020-08-12T15:19:57+08:00
 ---
 
-This guide provides a document at Hudi's capabilities using Flink SQL. We can feel the unique charm of Flink stream computing engine on Hudi.
-Reading this guide, you can quickly start using Flink to write to(read from) Hudi, have a deeper understanding of configuration and optimization:
+This guide provides an instruction for Flink Hudi integration. We can feel the unique charm of how Flink brings in the power of streaming into Hudi.
+Reading this guide, you can quickly start using Flink on Hudi, learn different modes for reading/writing Hudi by Flink:
 
 - **Quick Start** : Read [Quick Start](#quick-start) to get started quickly Flink sql client to write to(read from) Hudi.
-- **Configuration** : For [Flink Configuration](flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_configuration#table-options).
-- **Writing Data** : Flink supports different writing data use cases, such as [CDC Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](hoodie_deltastreamer#bulk-insert), [Index Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog Mode](hoodie_deltastreamer#changelog-mode) and [Append Mode](hoodie_deltastreamer#append-mode).
-- **Querying Data** : Flink supports different querying data use cases, such as [Incremental Query](hoodie_deltastreamer#incremental-query), [Hive Query](syncing_metastore#flink-setup), [Presto Query](query_engine_setup#prestodb).
+- **Configuration** : For [Global Configuration](flink_configuration#global-configurations), sets up through `$FLINK_HOME/conf/flink-conf.yaml`. For per job configuration, sets up through [Table Option](flink_configuration#table-options).
+- **Writing Data** : Flink supports different modes for writing, such as [CDC Ingestion](hoodie_deltastreamer#cdc-ingestion), [Bulk Insert](hoodie_deltastreamer#bulk-insert), [Index Bootstrap](hoodie_deltastreamer#index-bootstrap), [Changelog Mode](hoodie_deltastreamer#changelog-mode) and [Append Mode](hoodie_deltastreamer#append-mode).
+- **Querying Data** : Flink supports different modes for reading, such as [Streaming Query](hoodie_deltastreamer#streaming-query) and [Incremental Query](hoodie_deltastreamer#incremental-query).
 - **Tuning** : For write/read tasks, this guide gives some tuning suggestions, such as [Memory Optimization](flink_configuration#memory-optimization) and [Write Rate Limit](flink_configuration#write-rate-limit).
 - **Optimization**: Offline compaction is supported [Offline Compaction](compaction#flink-offline-compaction).
+- **Query Engines**: Besides Flink, many other engines are integrated: [Hive Query](syncing_metastore#flink-setup), [Presto Query](query_engine_setup#prestodb).
 
 ## Quick Start
 
diff --git a/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md b/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md
index a979788..f212f57 100644
--- a/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md
+++ b/website/versioned_docs/version-0.10.0/hoodie_deltastreamer.md
@@ -462,6 +462,24 @@ There are many use cases that user put the full history data set onto the messag
 |  -----------  | -------  | ------- | ------- |
 | `write.rate.limit` | `false` | `0` | Default disable the rate limit |
 
+### Streaming Query
+By default, the hoodie table is read as batch, that is to read the latest snapshot data set and returns. Turns on the streaming read
+mode by setting option `read.streaming.enabled` as `true`. Sets up option `read.start-commit` to specify the read start offset, specifies the
+value as `earliest` if you want to consume all the history data set.
+
+#### Options
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `read.streaming.enabled` | false | `false` | Specify `true` to read as streaming |
+| `read.start-commit` | false | the latest commit | Start commit time in format 'yyyyMMddHHmmss', use `earliest` to consume from the start commit |
+| `read.streaming.skip_compaction` | false | `false` | Whether to skip compaction commits while reading, generally for two purposes: 1) Avoid consuming duplications from the compaction instants 2) When change log mode is enabled, to only consume change logs for right semantics. |
+| `clean.retain_commits` | false | `10` | The max number of commits to retain before cleaning, when change log mode is enabled, tweaks this option to adjust the change log live time. For example, the default strategy keeps 50 minutes of change logs if the checkpoint interval is set up as 5 minutes. |
+
+:::note
+When option `read.streaming.skip_compaction` turns on and the streaming reader lags behind by commits of number
+`clean.retain_commits`, the data loss may occur.
+:::
+
 ### Incremental Query
 There are 3 use cases for incremental query:
 1. Streaming query: specify the start commit with option `read.start-commit`;
@@ -472,8 +490,8 @@ There are 3 use cases for incremental query:
 #### Options
 |  Option Name  | Required | Default | Remarks |
 |  -----------  | -------  | ------- | ------- |
-| `write.start-commit` | `false` | the latest commit | Specify `earliest` to consume from the start commit |
-| `write.end-commit` | `false` | the latest commit | -- |
+| `read.start-commit` | `false` | the latest commit | Specify `earliest` to consume from the start commit |
+| `read.end-commit` | `false` | the latest commit | -- |
 
 ## Kafka Connect Sink
 If you want to perform streaming ingestion into Hudi format similar to HoodieDeltaStreamer, but you don't want to depend on Spark,