You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by yi...@apache.org on 2022/09/24 01:24:04 UTC

[hudi] branch asf-site updated: [HUDI-4884] Fixing faq for indexes with hudi and fixing docker demo for hive sync (#6730)

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 32b1d361ed [HUDI-4884] Fixing faq for indexes with hudi and fixing docker demo for hive sync (#6730)
32b1d361ed is described below

commit 32b1d361edc87dff7695557c45d3b0aef1b468b6
Author: Sivabalan Narayanan <n....@gmail.com>
AuthorDate: Fri Sep 23 18:23:59 2022 -0700

    [HUDI-4884] Fixing faq for indexes with hudi and fixing docker demo for hive sync (#6730)
    
    Co-authored-by: Y Ethan Guo <et...@gmail.com>
---
 website/docs/docker_demo.md                  |  6 ++++--
 website/docs/faq.md                          |  8 ++++++--
 website/versioned_docs/version-0.12.0/faq.md | 10 +++++++---
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md
index 4a390506c3..698aec5439 100644
--- a/website/docs/docker_demo.md
+++ b/website/docs/docker_demo.md
@@ -247,7 +247,8 @@ docker exec -it adhoc-2 /bin/bash
   --partitioned-by dt \
   --base-path /user/hive/warehouse/stock_ticks_cow \
   --database default \
-  --table stock_ticks_cow
+  --table stock_ticks_cow \
+  --partition-value-extractor org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
 .....
 2020-01-25 19:51:28,953 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_cow
 .....
@@ -260,7 +261,8 @@ docker exec -it adhoc-2 /bin/bash
   --partitioned-by dt \
   --base-path /user/hive/warehouse/stock_ticks_mor \
   --database default \
-  --table stock_ticks_mor
+  --table stock_ticks_mor \
+  --partition-value-extractor org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
 ...
 2020-01-25 19:51:51,066 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(129)) - Sync complete for stock_ticks_mor_ro
 ...
diff --git a/website/docs/faq.md b/website/docs/faq.md
index 40b2c0fca6..00ef95c2ec 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -245,9 +245,13 @@ The indexing component is a key part of the Hudi writing and it maps a given rec
 
 Hudi supports a few options for indexing as below
 
- - *HoodieBloomIndex (default)* : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
- - *HoodieGlobalBloomIndex* : The default indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is us [...]
+ - *HoodieBloomIndex * : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
+ - *HoodieGlobalBloomIndex* : The non global indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is [...]
  - *HBaseIndex* : Apache HBase is a key value store, typically found in close proximity to HDFS. You can also store the index inside HBase, which could be handy if you are already operating HBase.
+ - *HoodieSimpleIndex (default)* : A simple index which reads interested fields (record key and partition path) from base files and joins with incoming records to find the tagged location.
+ - *HoodieGlobalSimpleIndex* : Global version of Simple Index, where in uniqueness is on record key across entire table. 
+ - *HoodieBucketIndex* : Each partition has statically defined buckets to which records are tagged with. Since locations are tagged via hashing mechanism, this index lookup will be very efficient. 
+ - *HoodieSparkConsistentBucketIndex* : This is also similar to Bucket Index. Only difference is that, data skews can be tackled by dynamically changing the bucket number.  
 
 You can implement your own index if you'd like, by subclassing the `HoodieIndex` class and configuring the index class name in configs. 
 
diff --git a/website/versioned_docs/version-0.12.0/faq.md b/website/versioned_docs/version-0.12.0/faq.md
index c06c58a612..43e80aea1e 100644
--- a/website/versioned_docs/version-0.12.0/faq.md
+++ b/website/versioned_docs/version-0.12.0/faq.md
@@ -245,9 +245,13 @@ The indexing component is a key part of the Hudi writing and it maps a given rec
 
 Hudi supports a few options for indexing as below
 
- - *HoodieBloomIndex (default)* : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
- - *HoodieGlobalBloomIndex* : The default indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is us [...]
- - *HBaseIndex* : Apache HBase is a key value store, typically found in close proximity to HDFS. You can also store the index inside HBase, which could be handy if you are already operating HBase.
+- *HoodieBloomIndex * : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
+- *HoodieGlobalBloomIndex* : The non global indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is  [...]
+- *HBaseIndex* : Apache HBase is a key value store, typically found in close proximity to HDFS. You can also store the index inside HBase, which could be handy if you are already operating HBase.
+- *HoodieSimpleIndex (default)* : A simple index which reads interested fields (record key and partition path) from base files and joins with incoming records to find the tagged location.
+- *HoodieGlobalSimpleIndex* : Global version of Simple Index, where in uniqueness is on record key across entire table.
+- *HoodieBucketIndex* : Each partition has statically defined buckets to which records are tagged with. Since locations are tagged via hashing mechanism, this index lookup will be very efficient.
+- *HoodieSparkConsistentBucketIndex* : This is also similar to Bucket Index. Only difference is that, data skews can be tackled by dynamically changing the bucket number.
 
 You can implement your own index if you'd like, by subclassing the `HoodieIndex` class and configuring the index class name in configs.