You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by xu...@apache.org on 2022/04/18 03:32:57 UTC

[hudi] branch asf-site updated: [HUDI-3779] Update metadata table docs (#5332)

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new d153f24989 [HUDI-3779] Update metadata table docs (#5332)
d153f24989 is described below

commit d153f24989508dd87ba026b3813ccdedd8214a5d
Author: Y Ethan Guo <et...@gmail.com>
AuthorDate: Sun Apr 17 20:32:52 2022 -0700

    [HUDI-3779] Update metadata table docs (#5332)
---
 website/docs/metadata.md | 81 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 70 insertions(+), 11 deletions(-)

diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index ec4b2eeb96..662f24c2a5 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -23,17 +23,76 @@ Running a TPCDS benchmark the p50 list latencies for a single folder scales ~lin
 Whereas listings from the Metadata Table will not scale linearly with file/object count and instead take about 100-500ms per read even for very large tables.
 Even better, the timeline server caches portions of the metadata (currently only for writers), and provides ~10ms performance for listings.
 
-## Enable Hudi Metadata Table
-The Hudi Metadata Table is not enabled by default. If you wish to turn it on you need to enable the following configuration:
+### Supporting Multi-Modal Index
 
-[`hoodie.metadata.enable`](/docs/configurations#hoodiemetadataenable)
+Multi-modal index can drastically improve the lookup performance in file index and query latency with data skipping.
+Bloom filter index containing the file-level bloom filter facilitates the key lookup and file pruning.  The column stats
+index containing the statistics of all columns improves file pruning based on key and column value range in both the
+writer and the reader, in query planning in Spark for example.  Multi-modal index is implemented as independent partitions
+containing the indices in the metadata table.
+
+## Enable Hudi Metadata Table and Multi-Modal Index
+In 0.11.0, the metadata table with synchronous updates and metadata-table-based file listing are enabled by default.
+There are prerequisite configurations and steps in [Deployment considerations](#deployment-considerations) to
+safely use this feature.  The metadata table and related file listing functionality can still be turned off by setting
+[`hoodie.metadata.enable`](/docs/configurations#hoodiemetadataenable) to `false`.  For 0.10.1 and prior releases, metadata
+table is disabled by default, and you can turn it on by setting the same config to `true`.
+
+If you turn off the metadata table after enabling, be sure to wait for a few commits so that the metadata table is fully
+cleaned up, before re-enabling the metadata table again.
+
+The multi-modal index is introduced in 0.11.0 release.  They are disabled by default.  You can choose to enable bloom
+filter index by setting `hoodie.metadata.index.bloom.filter.enable` to `true` and enable column stats index by setting
+`hoodie.metadata.index.column.stats.enable` to `true`, when metadata table is enabled.  In 0.11.0 release, data skipping
+to improve queries in Spark now relies on the column stats index in metadata table.  The enabling of metadata table and
+column stats index is prerequisite to enabling data skipping with `hoodie.enable.data.skipping`.
 
 ## Deployment considerations
-Once you turn on the Hudi Metadata Table, ensure that all write and read operations enable the configuration above to 
-ensure the Metadata Table stays up to date.
-
-:::note
-If your current deployment model is single writer along with async table services (such as cleaning, clustering, compaction) 
-configured, then it is a must to have [lock providers configured](/docs/next/concurrency_control#enabling-multi-writing) 
-before turning on the metadata table.
-:::
\ No newline at end of file
+To ensure that Metadata Table stays up to date, all write operations on the same Hudi table need additional configurations
+besides the above in different deployment models.  Before enabling metadata table, all writers on the same table must
+be stopped.
+
+### Deployment Model A: Single writer with inline table services
+
+If your current deployment model is single writer and all table services (cleaning, clustering, compaction) are configured
+to be inline, such as Deltastreamer sync-once mode and Spark Datasource with default configs, there is no additional configuration
+required.  After setting [`hoodie.metadata.enable`](/docs/configurations#hoodiemetadataenable) to `true`, restarting
+the single writer is sufficient to safely enable metadata table.
+
+### Deployment Model B: Single writer with async table services
+
+If your current deployment model is single writer along with async table services (such as cleaning, clustering, compaction)
+running in the same process, such as Deltastreamer continuous mode writing MOR table, Spark streaming (where compaction is async by default),
+and your own job setup enabling async table services inside the same writer, it is a must to have the optimistic concurrency control,
+the lock provider, and lazy failed write clean policy configured before enabling metadata table as follows.  This is to guarantee
+the proper behavior of [optimistic concurrency control](/docs/concurrency_control#enabling-multi-writing) when enabling
+metadata table. Failing to follow the configuration guide leads to loss of data.  Note that these configurations are
+required only if metadata table is enabled in this deployment model.
+
+```properties
+hoodie.write.concurrency.mode=optimistic_concurrency_control
+hoodie.cleaner.policy.failed.writes=LAZY
+hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
+```
+
+If multiple writers in different processes are present, including one writer with async table services, please refer to
+[Deployment Model C: Multi-writer](#deployment-model-c-multi-writer) for configs, with the difference of using a
+distributed lock provider.  Note that running a separate compaction (`HoodieCompactor`) or clustering (`HoodieClusteringJob`)
+job apart from the ingestion writer is considered as multi-writer deployment, as they are not running in the same
+process which cannot rely on the in-process lock provider.
+
+### Deployment Model C: Multi-writer
+
+If your current deployment model is multi-writer along with a lock provider and other required configs set for every writer
+as follows, there is no additional configuration required.  You can bring up the writers sequentially after stopping the
+writers for enabling metadata table.  Applying the proper configurations to only partial writers leads to loss of data
+from the inconsistent writer. So, ensure you enable metadata table across all writers.
+
+```properties
+hoodie.write.concurrency.mode=optimistic_concurrency_control
+hoodie.cleaner.policy.failed.writes=LAZY
+hoodie.write.lock.provider=<distributed-lock-provider-classname>
+```
+
+Note that there are 3 different distributed [lock providers available](/docs/concurrency_control#enabling-multi-writing)
+to choose from: `ZookeeperBasedLockProvider`, `HiveMetastoreBasedLockProvider`, and `DynamoDBBasedLockProvider`.
\ No newline at end of file