You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by da...@apache.org on 2022/05/09 09:47:21 UTC

[hudi] branch asf-site updated: [HUDI-4063] Update the site doc for flink since release 0.11 (#5538)

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 964ba2329b [HUDI-4063] Update the site doc for flink since release 0.11 (#5538)
964ba2329b is described below

commit 964ba2329b6c96902941254c4196f193cf543d02
Author: Danny Chan <yu...@gmail.com>
AuthorDate: Mon May 9 17:47:16 2022 +0800

    [HUDI-4063] Update the site doc for flink since release 0.11 (#5538)
---
 website/docs/compaction.md                                    |  4 +++-
 website/docs/hoodie_deltastreamer.md                          | 10 +++++-----
 website/versioned_docs/version-0.11.0/compaction.md           |  4 +++-
 website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md | 10 +++++-----
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index fe679f4ac9..9d73e31bd5 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -135,4 +135,6 @@ Offline compaction needs to submit the Flink task on the command line. The progr
 | `--path` | `frue` | `--` | The path where the target table is stored on Hudi |
 | `--compaction-max-memory` | `false` | `100` | The index map size of log data during compaction, 100 MB by default. If you have enough memory, you can turn up this parameter |
 | `--schedule` | `false` | `false` | whether to execute the operation of scheduling compaction plan. When the write process is still writing, turning on this parameter have a risk of losing data. Therefore, it must be ensured that there are no write tasks currently writing data to this table when this parameter is turned on |
-| `--seq` | `false` | `LIFO` | The order in which compaction tasks are executed. Executing from the latest compaction plan by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the oldest plan. |
\ No newline at end of file
+| `--seq` | `false` | `LIFO` | The order in which compaction tasks are executed. Executing from the latest compaction plan by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the oldest plan. |
+| `--service` | `false` | `false` | Whether to start a monitoring service that checks and schedules new compaction task in configured interval. |
+| `--min-compaction-interval-seconds` | `false` | `600(s)` | The checking interval for service mode, by default 10 minutes. |
\ No newline at end of file
diff --git a/website/docs/hoodie_deltastreamer.md b/website/docs/hoodie_deltastreamer.md
index 6f2c80d5cf..2efa2aa416 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -369,8 +369,6 @@ We recommend two ways for syncing CDC data into Hudi:
 
 :::note
 - If the upstream data cannot guarantee the order, you need to specify option `write.precombine.field` explicitly;
-- The MOR table can not handle DELETEs in event time sequence now, thus causing data loss. You better switch on the changelog mode through
-  option `changelog.enabled`.
 :::
 
 ### Bulk Insert
@@ -401,8 +399,8 @@ will rollover to the new file handle. Finally, `the number of files` >= [`write.
 |  -----------  | -------  | ------- | ------- |
 | `write.operation` | `true` | `upsert` | Setting as `bulk_insert` to open this function  |
 | `write.tasks`  |  `false`  | `4` | The parallelism of `bulk_insert`, `the number of files` >= [`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks) |
-| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to shuffle data according to the partition field before writing. Enabling this option will reduce the number of small files, but there may be a risk of data skew |
-| `write.bulk_insert.sort_by_partition` | `false`  | `true` | Whether to sort data according to the partition field before writing. Enabling this option will reduce the number of small files when a write task writes multiple partitions |
+| `write.bulk_insert.shuffle_input` | `false` | `true` | Whether to shuffle data according to the input field before writing. Enabling this option will reduce the number of small files, but there may be a risk of data skew |
+| `write.bulk_insert.sort_input` | `false`  | `true` | Whether to sort data according to the input field before writing. Enabling this option will reduce the number of small files when a write task writes multiple partitions |
 | `write.sort.memory` | `false` | `128` | Available managed memory of sort operator. default  `128` MB |
 
 ### Index Bootstrap
@@ -495,7 +493,9 @@ value as `earliest` if you want to consume all the history data set.
 
 :::note
 When option `read.streaming.skip_compaction` turns on and the streaming reader lags behind by commits of number
-`clean.retain_commits`, the data loss may occur.
+`clean.retain_commits`, the data loss may occur. The compaction keeps the original instant time as the per-record metadata,
+the streaming reader would read and skip the whole base files if the log has been consumed. For efficiency, option `read.streaming.skip_compaction`
+is till suggested being `true`.
 :::
 
 ### Incremental Query
diff --git a/website/versioned_docs/version-0.11.0/compaction.md b/website/versioned_docs/version-0.11.0/compaction.md
index fe679f4ac9..9d73e31bd5 100644
--- a/website/versioned_docs/version-0.11.0/compaction.md
+++ b/website/versioned_docs/version-0.11.0/compaction.md
@@ -135,4 +135,6 @@ Offline compaction needs to submit the Flink task on the command line. The progr
 | `--path` | `frue` | `--` | The path where the target table is stored on Hudi |
 | `--compaction-max-memory` | `false` | `100` | The index map size of log data during compaction, 100 MB by default. If you have enough memory, you can turn up this parameter |
 | `--schedule` | `false` | `false` | whether to execute the operation of scheduling compaction plan. When the write process is still writing, turning on this parameter have a risk of losing data. Therefore, it must be ensured that there are no write tasks currently writing data to this table when this parameter is turned on |
-| `--seq` | `false` | `LIFO` | The order in which compaction tasks are executed. Executing from the latest compaction plan by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the oldest plan. |
\ No newline at end of file
+| `--seq` | `false` | `LIFO` | The order in which compaction tasks are executed. Executing from the latest compaction plan by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the oldest plan. |
+| `--service` | `false` | `false` | Whether to start a monitoring service that checks and schedules new compaction task in configured interval. |
+| `--min-compaction-interval-seconds` | `false` | `600(s)` | The checking interval for service mode, by default 10 minutes. |
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md b/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md
index 6f2c80d5cf..2efa2aa416 100644
--- a/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md
+++ b/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md
@@ -369,8 +369,6 @@ We recommend two ways for syncing CDC data into Hudi:
 
 :::note
 - If the upstream data cannot guarantee the order, you need to specify option `write.precombine.field` explicitly;
-- The MOR table can not handle DELETEs in event time sequence now, thus causing data loss. You better switch on the changelog mode through
-  option `changelog.enabled`.
 :::
 
 ### Bulk Insert
@@ -401,8 +399,8 @@ will rollover to the new file handle. Finally, `the number of files` >= [`write.
 |  -----------  | -------  | ------- | ------- |
 | `write.operation` | `true` | `upsert` | Setting as `bulk_insert` to open this function  |
 | `write.tasks`  |  `false`  | `4` | The parallelism of `bulk_insert`, `the number of files` >= [`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks) |
-| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to shuffle data according to the partition field before writing. Enabling this option will reduce the number of small files, but there may be a risk of data skew |
-| `write.bulk_insert.sort_by_partition` | `false`  | `true` | Whether to sort data according to the partition field before writing. Enabling this option will reduce the number of small files when a write task writes multiple partitions |
+| `write.bulk_insert.shuffle_input` | `false` | `true` | Whether to shuffle data according to the input field before writing. Enabling this option will reduce the number of small files, but there may be a risk of data skew |
+| `write.bulk_insert.sort_input` | `false`  | `true` | Whether to sort data according to the input field before writing. Enabling this option will reduce the number of small files when a write task writes multiple partitions |
 | `write.sort.memory` | `false` | `128` | Available managed memory of sort operator. default  `128` MB |
 
 ### Index Bootstrap
@@ -495,7 +493,9 @@ value as `earliest` if you want to consume all the history data set.
 
 :::note
 When option `read.streaming.skip_compaction` turns on and the streaming reader lags behind by commits of number
-`clean.retain_commits`, the data loss may occur.
+`clean.retain_commits`, the data loss may occur. The compaction keeps the original instant time as the per-record metadata,
+the streaming reader would read and skip the whole base files if the log has been consumed. For efficiency, option `read.streaming.skip_compaction`
+is till suggested being `true`.
 :::
 
 ### Incremental Query