You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by bh...@apache.org on 2021/11/24 17:48:02 UTC

[hudi] branch asf-site updated: updated clustering and compaction docs to update that --instant-time is no longer a required parameter; Added the new kafka-connect-sink reference into the streming ingestion page (#4010)

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new c25db93  updated clustering and compaction docs to update that --instant-time is no longer a required parameter; Added the new kafka-connect-sink reference into the streming ingestion page (#4010)
c25db93 is described below

commit c25db93e39585f0ca38422cb2c8662a8dfddea73
Author: Kyle Weller <ky...@gmail.com>
AuthorDate: Wed Nov 24 10:47:47 2021 -0700

    updated clustering and compaction docs to update that --instant-time is no longer a required parameter; Added the new kafka-connect-sink reference into the streming ingestion page (#4010)
---
 website/docs/clustering.md                  |   6 +-
 website/docs/compaction.md                  |  54 +++--
 website/docs/hoodie_deltastreamer.md        |   5 +
 website/versioned_docs/version-0.9.0/cli.md | 357 ++++++++++++++++++++++++++++
 4 files changed, 400 insertions(+), 22 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index bccba93..58c237f 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -161,11 +161,11 @@ Users can leverage [HoodieClusteringJob](https://cwiki.apache.org/confluence/dis
 to setup 2-step asynchronous clustering.
 
 ### HoodieClusteringJob
-With the release of Hudi version 0.9.0, we can schedule as well as execute clustering in the same step. We just need to
-specify the `—mode` or `-m` option. There are three modes:
+By specifying the `scheduleAndExecute` mode both schedule as well as clustering can be achieved in the same step. 
+The appropriate mode can be specified using `-mode` or `-m` option. There are three modes:
 
 1. `schedule`: Make a clustering plan. This gives an instant which can be passed in execute mode.
-2. `execute`: Execute a clustering plan at given instant which means --instant-time is required here.
+2. `execute`: Execute a clustering plan at a particular instant. If no instant-time is specified, HoodieClusteringJob will execute for the earliest instant on the Hudi timeline.
 3. `scheduleAndExecute`: Make a clustering plan first and execute that plan immediately.
 
 Note that to run this job while the original writer is still running, please enable multi-writing:
diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index c3110aa..70fba50 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -5,11 +5,7 @@ toc: true
 last_modified_at:
 ---
 
-For Merge-On-Read table, data is stored using a combination of columnar (e.g parquet) + row based (e.g avro) file formats.
-Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or
-asynchronously. One of the main motivations behind Merge-On-Read is to reduce data latency when ingesting records.
-Hence, it makes sense to run compaction asynchronously without blocking ingestion.
-
+Compaction is executed asynchronously with Hudi by default.
 
 ## Async Compaction
 
@@ -19,15 +15,13 @@ Async Compaction is performed in 2 steps:
    slices** to be compacted. A compaction plan is finally written to Hudi timeline.
 1. ***Compaction Execution***: A separate process reads the compaction plan and performs compaction of file slices.
 
-
 ## Deployment Models
 
 There are few ways by which we can execute compactions asynchronously.
 
 ### Spark Structured Streaming
 
-With 0.6.0, we now have support for running async compactions in Spark
-Structured Streaming jobs. Compactions are scheduled and executed asynchronously inside the
+Compactions are scheduled and executed asynchronously inside the
 streaming job.  Async Compactions are enabled by default for structured streaming jobs
 on Merge-On-Read table.
 
@@ -74,22 +68,44 @@ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
 --continous
 ```
 
+### Hudi Compactor Utility
+Hudi provides a standalone tool to execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/deployment#compactions)
+
+Example:
+```properties
+spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
+--class org.apache.hudi.utilities.HoodieCompactor \
+--base-path <base_path> \
+--table-name <table_name> \
+--schema-file <schema_file> \
+--instant-time <compaction_instant>
+```
+
+Note, the `instant-time` parameter is now optional for the Hudi Compactor Utility. If using the utility without `--instant time`, 
+the spark-submit will execute the earliest scheduled compaction on the Hudi timeline.
+
 ### Hudi CLI
-Hudi CLI is yet another way to execute specific compactions asynchronously. Here is an example
+Hudi CLI is yet another way to execute specific compactions asynchronously. Here is an example and you can read more in the [deployment guide](/docs/cli#compactions)
 
+Example:
 ```properties
 hudi:trips->compaction run --tableName <table_name> --parallelism <parallelism> --compactionInstant <InstantTime>
 ...
 ```
 
-### Hudi Compactor Script
-Hudi provides a standalone tool to also execute specific compactions asynchronously. Below is an example and you can read more in the [deployment guide](/docs/next/deployment#compactions)
+## Synchronous Compaction
+By default, compaction is run asynchronously.
 
-```properties
-spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
---class org.apache.hudi.utilities.HoodieCompactor \
---base-path <base_path> \
---table-name <table_name> \
---instant-time <compaction_instant> \
---schema-file <schema_file>
-```
+If latency of ingesting records is important for you, you are most likely using Merge-On-Read tables.
+Merge-On-Read tables store data using a combination of columnar (e.g parquet) + row based (e.g avro) file formats.
+Updates are logged to delta files & later compacted to produce new versions of columnar files. 
+To improve ingestion latency, Async Compaction is the default configuration.
+
+If immediate read performance of a new commit is important for you, or you want simplicity of not managing separate compaction jobs,
+you may want Synchronous compaction, which means that as a commit is written it is also compacted by the same job.
+
+Compaction is run synchronously by passing the flag "--disable-compaction" (Meaning to disable async compaction scheduling).
+When both ingestion and compaction is running in the same spark context, you can use resource allocation configuration 
+in DeltaStreamer CLI such as ("--delta-sync-scheduling-weight",
+"--compact-scheduling-weight", ""--delta-sync-scheduling-minshare", and "--compact-scheduling-minshare")
+to control executor allocation between ingestion and compaction.
diff --git a/website/docs/hoodie_deltastreamer.md b/website/docs/hoodie_deltastreamer.md
index a97f1cb..c147500 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -209,3 +209,8 @@ A deltastreamer job can then be triggered as follows:
 ```
 
 Read more in depth about concurrency control in the [concurrency control concepts](/docs/concurrency_control) section
+
+## Hudi Kafka Connect Sink
+If you want to perform streaming ingestion into Hudi format similar to HoodieDeltaStreamer, but you don't want to depend on Spark,
+try out the new experimental release of Hudi Kafka Connect Sink. Read the [ReadMe](https://github.com/apache/hudi/tree/master/hudi-kafka-connect) 
+for full documentation.
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.9.0/cli.md b/website/versioned_docs/version-0.9.0/cli.md
new file mode 100644
index 0000000..64f0c1e
--- /dev/null
+++ b/website/versioned_docs/version-0.9.0/cli.md
@@ -0,0 +1,357 @@
+---
+title: CLI
+keywords: [hudi, cli]
+last_modified_at: 2021-08-18T15:59:57-04:00
+---
+
+Once hudi has been built, the shell can be fired by via  `cd hudi-cli && ./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the `basePath` and
+we would need this location in order to connect to a Hudi table. Hudi library effectively manages this table internally, using `.hoodie` subfolder to track all metadata.
+
+To initialize a hudi table, use the following command.
+
+```java
+===================================================================
+*         ___                          ___                        *
+*        /\__\          ___           /\  \           ___         *
+*       / /  /         /\__\         /  \  \         /\  \        *
+*      / /__/         / /  /        / /\ \  \        \ \  \       *
+*     /  \  \ ___    / /  /        / /  \ \__\       /  \__\      *
+*    / /\ \  /\__\  / /__/  ___   / /__/ \ |__|     / /\/__/      *
+*    \/  \ \/ /  /  \ \  \ /\__\  \ \  \ / /  /  /\/ /  /         *
+*         \  /  /    \ \  / /  /   \ \  / /  /   \  /__/          *
+*         / /  /      \ \/ /  /     \ \/ /  /     \ \__\          *
+*        / /  /        \  /  /       \  /  /       \/__/          *
+*        \/__/          \/__/         \/__/    Apache Hudi CLI    *
+*                                                                 *
+===================================================================
+
+hudi->create --path /user/hive/warehouse/table1 --tableName hoodie_table_1 --tableType COPY_ON_WRITE
+.....
+```
+
+To see the description of hudi table, use the command:
+
+```java
+hudi:hoodie_table_1->desc
+18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
+    _________________________________________________________
+    | Property                | Value                        |
+    |========================================================|
+    | basePath                | ...                          |
+    | metaPath                | ...                          |
+    | fileSystem              | hdfs                         |
+    | hoodie.table.name       | hoodie_table_1               |
+    | hoodie.table.type       | COPY_ON_WRITE                |
+    | hoodie.archivelog.folder|                              |
+```
+
+Following is a sample command to connect to a Hudi table contains uber trips.
+
+```java
+hudi:trips->connect --path /app/uber/trips
+
+16/10/05 23:20:37 INFO model.HoodieTableMetadata: All commits :HoodieCommits{commitList=[20161002045850, 20161002052915, 20161002055918, 20161002065317, 20161002075932, 20161002082904, 20161002085949, 20161002092936, 20161002105903, 20161002112938, 20161002123005, 20161002133002, 20161002155940, 20161002165924, 20161002172907, 20161002175905, 20161002190016, 20161002192954, 20161002195925, 20161002205935, 20161002215928, 20161002222938, 20161002225915, 20161002232906, 20161003003028, 201 [...]
+Metadata for table trips loaded
+```
+
+Once connected to the table, a lot of other commands become available. The shell has contextual autocomplete help (press TAB) and below is a list of all commands, few of which are reviewed in this section
+are reviewed
+
+```java
+hudi:trips->help
+* ! - Allows execution of operating system (OS) commands
+* // - Inline comment markers (start of line only)
+* ; - Inline comment markers (start of line only)
+* addpartitionmeta - Add partition metadata to a table, if not present
+* clear - Clears the console
+* cls - Clears the console
+* commit rollback - Rollback a commit
+* commits compare - Compare commits with another Hoodie table
+* commit showfiles - Show file level details of a commit
+* commit showpartitions - Show partition level details of a commit
+* commits refresh - Refresh the commits
+* commits show - Show the commits
+* commits sync - Compare commits with another Hoodie table
+* connect - Connect to a hoodie table
+* date - Displays the local date and time
+* exit - Exits the shell
+* help - List all commands usage
+* quit - Exits the shell
+* records deduplicate - De-duplicate a partition path contains duplicates & produce repaired files to replace with
+* script - Parses the specified resource file and executes its commands
+* stats filesizes - File Sizes. Display summary stats on sizes of files
+* stats wa - Write Amplification. Ratio of how many records were upserted to how many records were actually written
+* sync validate - Validate the sync by counting the number of records
+* system properties - Shows the shell's properties
+* utils loadClass - Load a class
+* version - Displays shell version
+
+hudi:trips->
+```
+
+
+### Inspecting Commits
+
+The task of upserting or inserting a batch of incoming records is known as a **commit** in Hudi. A commit provides basic atomicity guarantees such that only committed data is available for querying.
+Each commit has a monotonically increasing string/number called the **commit number**. Typically, this is the time at which we started the commit.
+
+To view some basic information about the last 10 commits,
+
+
+```java
+hudi:trips->commits show --sortBy "Total Bytes Written" --desc true --limit 10
+    ________________________________________________________________________________________________________________________________________________________________________
+    | CommitTime    | Total Bytes Written| Total Files Added| Total Files Updated| Total Partitions Written| Total Records Written| Total Update Records Written| Total Errors|
+    |=======================================================================================================================================================================|
+    ....
+    ....
+    ....
+```
+
+At the start of each write, Hudi also writes a .inflight commit to the .hoodie folder. You can use the timestamp there to estimate how long the commit has been inflight
+
+
+```java
+$ hdfs dfs -ls /app/uber/trips/.hoodie/*.inflight
+-rw-r--r--   3 vinoth supergroup     321984 2016-10-05 23:18 /app/uber/trips/.hoodie/20161005225920.inflight
+```
+
+
+### Drilling Down to a specific Commit
+
+To understand how the writes spread across specific partiions,
+
+
+```java
+hudi:trips->commit showpartitions --commit 20161005165855 --sortBy "Total Bytes Written" --desc true --limit 10
+    __________________________________________________________________________________________________________________________________________
+    | Partition Path| Total Files Added| Total Files Updated| Total Records Inserted| Total Records Updated| Total Bytes Written| Total Errors|
+    |=========================================================================================================================================|
+     ....
+     ....
+```
+
+If you need file level granularity , we can do the following
+
+
+```java
+hudi:trips->commit showfiles --commit 20161005165855 --sortBy "Partition Path"
+    ________________________________________________________________________________________________________________________________________________________
+    | Partition Path| File ID                             | Previous Commit| Total Records Updated| Total Records Written| Total Bytes Written| Total Errors|
+    |=======================================================================================================================================================|
+    ....
+    ....
+```
+
+
+### FileSystem View
+
+Hudi views each partition as a collection of file-groups with each file-group containing a list of file-slices in commit order (See concepts).
+The below commands allow users to view the file-slices for a data-set.
+
+```java
+hudi:stock_ticks_mor->show fsview all
+ ....
+  _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+ | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta File Size| Delta Files |
+ |==============================================================================================================================================================================================================================================================================================================================================================================================================|
+ | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]|
+
+
+
+hudi:stock_ticks_mor->show fsview latest --partitionPath "2018/08/31"
+ ......
+ ___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ [...]
+ | Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta Files| Total Delta Size| Delta Size - compaction scheduled| Delta Size - compaction unscheduled| Delta To Base Ratio - compaction scheduled| Delta To Base Ratio - compaction unscheduled| Delta Files - compaction scheduled | Delta Files - compaction unscheduled|
+ |========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================== [...]
+ | 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759| hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet| 432.5 KB | 1 | 20.8 KB | 20.8 KB | 0.0 B | 0.0 B | 0.0 B | [HoodieLogFile {hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]| [] |
+
+```
+
+
+### Statistics
+
+Since Hudi directly manages file sizes for DFS table, it might be good to get an overall picture
+
+
+```java
+hudi:trips->stats filesizes --partitionPath 2016/09/01 --sortBy "95th" --desc true --limit 10
+    ________________________________________________________________________________________________
+    | CommitTime    | Min     | 10th    | 50th    | avg     | 95th    | Max     | NumFiles| StdDev  |
+    |===============================================================================================|
+    | <COMMIT_ID>   | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 93.9 MB | 2       | 2.3 KB  |
+    ....
+    ....
+```
+
+In case of Hudi write taking much longer, it might be good to see the write amplification for any sudden increases
+
+
+```java
+hudi:trips->stats wa
+    __________________________________________________________________________
+    | CommitTime    | Total Upserted| Total Written| Write Amplifiation Factor|
+    |=========================================================================|
+    ....
+    ....
+```
+
+
+### Archived Commits
+
+In order to limit the amount of growth of .commit files on DFS, Hudi archives older .commit files (with due respect to the cleaner policy) into a commits.archived file.
+This is a sequence file that contains a mapping from commitNumber => json with raw information about the commit (same that is nicely rolled up above).
+
+
+### Compactions
+
+To get an idea of the lag between compaction and writer applications, use the below command to list down all
+pending compactions.
+
+```java
+hudi:trips->compactions show all
+     ___________________________________________________________________
+    | Compaction Instant Time| State    | Total FileIds to be Compacted|
+    |==================================================================|
+    | <INSTANT_1>            | REQUESTED| 35                           |
+    | <INSTANT_2>            | INFLIGHT | 27                           |
+```
+
+To inspect a specific compaction plan, use
+
+```java
+hudi:trips->compaction show --instant <INSTANT_1>
+    _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | Partition Path| File Id | Base Instant  | Data File Path                                    | Total Delta Files| getMetrics                                                                                                                    |
+    |================================================================================================================================================================================================================================================
+    | 2018/07/17    | <UUID>  | <INSTANT_1>   | viewfs://ns-default/.../../UUID_<INSTANT>.parquet | 1                | {TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=1230.0, TOTAL_LOG_FILES_SIZE=2.51255751E8, TOTAL_IO_WRITE_MB=991.0, TOTAL_IO_MB=2221.0}|
+
+```
+
+To manually schedule or run a compaction, use the below command. This command uses spark launcher to perform compaction
+operations.
+
+**NOTE:** Make sure no other application is scheduling compaction for this table concurrently
+{: .notice--info}
+
+```java
+hudi:trips->help compaction schedule
+Keyword:                   compaction schedule
+Description:               Schedule Compaction
+ Keyword:                  sparkMemory
+   Help:                   Spark executor memory
+   Mandatory:              false
+   Default if specified:   '__NULL__'
+   Default if unspecified: '1G'
+
+* compaction schedule - Schedule Compaction
+```
+
+```java
+hudi:trips->help compaction run
+Keyword:                   compaction run
+Description:               Run Compaction for given instant time
+ Keyword:                  tableName
+   Help:                   Table name
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  parallelism
+   Help:                   Parallelism for hoodie compaction
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  schemaFilePath
+   Help:                   Path for Avro schema file
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  sparkMemory
+   Help:                   Spark executor memory
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  retry
+   Help:                   Number of retries
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+ Keyword:                  compactionInstant
+   Help:                   Base path for the target hoodie table
+   Mandatory:              true
+   Default if specified:   '__NULL__'
+   Default if unspecified: '__NULL__'
+
+* compaction run - Run Compaction for given instant time
+```
+
+### Validate Compaction
+
+Validating a compaction plan : Check if all the files necessary for compactions are present and are valid
+
+```java
+hudi:stock_ticks_mor->compaction validate --instant 20181005222611
+...
+
+   COMPACTION PLAN VALID
+
+    ___________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error|
+    |==========================================================================================================================================================================================================================|
+    | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1              | true |      |
+
+
+
+hudi:stock_ticks_mor->compaction validate --instant 20181005222601
+
+   COMPACTION PLAN INVALID
+
+    _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+    | File Id                             | Base Instant Time| Base Data File                                                                                                                   | Num Delta Files| Valid| Error                                                                           |
+    |=====================================================================================================================================================================================================================================================================================================|
+    | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet| 1              | false| All log files specified in compaction operation is not present. Missing ....    |
+```
+
+**NOTE:** The following commands must be executed without any other writer/ingestion application running.
+{: .notice--warning}
+
+Sometimes, it becomes necessary to remove a fileId from a compaction-plan inorder to speed-up or unblock compaction
+operation. Any new log-files that happened on this file after the compaction got scheduled will be safely renamed
+so that are preserved. Hudi provides the following CLI to support it
+
+
+### Unscheduling Compaction
+
+```java
+hudi:trips->compaction unscheduleFileId --fileId <FileUUID>
+....
+No File renames needed to unschedule file from pending compaction. Operation successful.
+```
+
+In other cases, an entire compaction plan needs to be reverted. This is supported by the following CLI
+
+```java
+hudi:trips->compaction unschedule --compactionInstant <compactionInstant>
+.....
+No File renames needed to unschedule pending compaction. Operation successful.
+```
+
+### Repair Compaction
+
+The above compaction unscheduling operations could sometimes fail partially (e:g -> DFS temporarily unavailable). With
+partial failures, the compaction operation could become inconsistent with the state of file-slices. When you run
+`compaction validate`, you can notice invalid compaction operations if there is one.  In these cases, the repair
+command comes to the rescue, it will rearrange the file-slices so that there is no loss and the file-slices are
+consistent with the compaction plan
+
+```java
+hudi:stock_ticks_mor->compaction repair --instant 20181005222611
+......
+Compaction successfully repaired
+.....
+```