You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/24 01:12:10 UTC

[GitHub] [hudi] yihua commented on a diff in pull request #6730: [HUDI-4884] Fixing faq for indexes with hudi and fixing docker demo for hive sync

yihua commented on code in PR #6730:
URL: https://github.com/apache/hudi/pull/6730#discussion_r979134260


##########
website/docs/faq.md:
##########
@@ -245,9 +245,13 @@ The indexing component is a key part of the Hudi writing and it maps a given rec
 
 Hudi supports a few options for indexing as below
 
- - *HoodieBloomIndex (default)* : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
- - *HoodieGlobalBloomIndex* : The default indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is used, incoming records are compared to files across the entire dataset and ensure a recordKey is only present in one partition.
+ - *HoodieBloomIndex * : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
+ - *HoodieGlobalBloomIndex* : The non global indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is used, incoming records are compared to files across the entire dataset and ensure a recordKey is only present in one partition.
  - *HBaseIndex* : Apache HBase is a key value store, typically found in close proximity to HDFS. You can also store the index inside HBase, which could be handy if you are already operating HBase.
+ - *HoodieSimpleIndex (default)* : A simple index which reads interested fields(record key and partition path) from base files and joins with incoming records to find the tagged location.

Review Comment:
   nit: `interested fields(record key` -> `interested fields (record key` 



##########
website/docs/faq.md:
##########
@@ -245,9 +245,13 @@ The indexing component is a key part of the Hudi writing and it maps a given rec
 
 Hudi supports a few options for indexing as below
 
- - *HoodieBloomIndex (default)* : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
- - *HoodieGlobalBloomIndex* : The default indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is used, incoming records are compared to files across the entire dataset and ensure a recordKey is only present in one partition.
+ - *HoodieBloomIndex * : Uses a bloom filter and ranges information placed in the footer of parquet/base files (and soon log files as well)
+ - *HoodieGlobalBloomIndex* : The non global indexing only enforces uniqueness of a key inside a single partition i.e the user is expected to know the partition under which a given record key is stored. This helps the indexing scale very well for even [very large datasets](https://eng.uber.com/uber-big-data-platform/). However, in some cases, it might be necessary instead to do the de-duping/enforce uniqueness across all partitions and the global bloom index does exactly that. If this is used, incoming records are compared to files across the entire dataset and ensure a recordKey is only present in one partition.
  - *HBaseIndex* : Apache HBase is a key value store, typically found in close proximity to HDFS. You can also store the index inside HBase, which could be handy if you are already operating HBase.
+ - *HoodieSimpleIndex (default)* : A simple index which reads interested fields(record key and partition path) from base files and joins with incoming records to find the tagged location.
+ - *HoodieGlobalSimpleIndex* : Global version of Simple Index, where in uniqueness is on record key across entire table. 
+ - *HoodieBucketIndex* : Each partition has statically defined buckets to which records are tagged with. Since locations are tagged via hashing mechanism, this index look up will be very efficient. 

Review Comment:
   nit: `look up` -> `lookup`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org