You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/11 12:49:34 UTC

[GitHub] [hudi] nsivabalan opened a new pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

nsivabalan opened a new pull request #2245:
URL: https://github.com/apache/hudi/pull/2245


   
   ## What is the purpose of the pull request
   
   Adding Hudi indexing mechanisms blog
   
   ## Brief change log
   
     - Adding Hudi indexing mechanisms blog
   
   ## Verify this pull request
   
   WIP PR. Will have to verify blog rendering once outline is agreed
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r523850991



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 

Review comment:
       Apache Hudi please. everywhere :) 

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 

Review comment:
       more motivation on why this is important from use-case perspective. for e.g upstream database may be updated in random ways and the downstream hudi table needs to absorb them well. 
   
   

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations one for partitioned dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 

Review comment:
       its also pluggable. we should mention that

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations one for partitioned dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory

Review comment:
       this is not worth mentioning. its just s test impl

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations one for partitioned dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 
+
+You could use “hoodie.index.type” to choose any of these indices. 
+
+### 1.1 Motivation
+Different workloads have different access patterns. Hudi supports different indexing schemes to cater to the needs of different workloads. So depending on one’s use-case, indexing schema can be chosen. 
+
+For eg: ……. 
+To Be filled
+
+Let's take a brief look at each of these indices.
+
+## 2. InMemory
+Stores an in memory hashmap of records to location mapping. Intended to be used for local testing. 
+
+## 3. Bloom
+Leverages bloom index stored with data files to find the location for the incoming records. This is the most commonly used Index in Hudi and is the default one. On a high level, this does a range pruning followed by bloom look up. So, if the record keys are laid out such that it follows some type of ordering like timestamps, then this will essentially cut down a lot of files to be looked up as bloom would have filtered out most of the files. But Range pruning is optional depending on your use-case. If your write batch is such that the records have no ordering in them (e.g uuid), but the pattern is such that mostly the recent partitions are updated with a long tail of updates/deletes to the older partitions, then still bloom index will be faster. But better to turn off range pruning as it just incurs the cost of checking w/o much benefit. 
+
+For instance, consider a list of file slices in a partition
+
+F1 : key_t0 to key_t10000
+F2 : key_t10001 to key_t20000
+F3 : key_t20001 to key_t30000
+F4 : key_t30001 to key_t40000
+F5 : key_t40001 to key_t50000
+
+So, when looking up records ranging from key_t25000 to key_t28000, bloom will filter every file slice except F3 with range pruning. 
+
+Here is a high level pseudocode used for this bloom:

Review comment:
       this is more like steps. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r528295271



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations one for partitioned dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 
+
+You could use “hoodie.index.type” to choose any of these indices. 
+
+### 1.1 Motivation
+Different workloads have different access patterns. Hudi supports different indexing schemes to cater to the needs of different workloads. So depending on one’s use-case, indexing schema can be chosen. 
+
+For eg: ……. 
+To Be filled
+
+Let's take a brief look at each of these indices.
+
+## 2. InMemory
+Stores an in memory hashmap of records to location mapping. Intended to be used for local testing. 
+
+## 3. Bloom
+Leverages bloom index stored with data files to find the location for the incoming records. This is the most commonly used Index in Hudi and is the default one. On a high level, this does a range pruning followed by bloom look up. So, if the record keys are laid out such that it follows some type of ordering like timestamps, then this will essentially cut down a lot of files to be looked up as bloom would have filtered out most of the files. But Range pruning is optional depending on your use-case. If your write batch is such that the records have no ordering in them (e.g uuid), but the pattern is such that mostly the recent partitions are updated with a long tail of updates/deletes to the older partitions, then still bloom index will be faster. But better to turn off range pruning as it just incurs the cost of checking w/o much benefit. 
+
+For instance, consider a list of file slices in a partition
+
+F1 : key_t0 to key_t10000
+F2 : key_t10001 to key_t20000
+F3 : key_t20001 to key_t30000
+F4 : key_t30001 to key_t40000
+F5 : key_t40001 to key_t50000
+
+So, when looking up records ranging from key_t25000 to key_t28000, bloom will filter every file slice except F3 with range pruning. 
+
+Here is a high level pseudocode used for this bloom:
+
+- Fetch interested partitions from incoming records
+- Load all file info (range info) for every partition. So, we have Map of <partition -> List<FileInfo> >
+- Find all file -> hoodie key pairs to be looked up.
+// For every <partition, record key> pairs, use index File filter to filter interested files. Index file filter will leverage file range info and trim down the files to be looked up. Hoodie has a tree map like structure for efficient index file filtering. 
+- Sort <file, hoodie key> pairs. 
+- Load each file and look up mapped keys to find the exact location for the record keys. 
+- Tag back location to incoming records. // this step is required for those newly inserted records in the incoming batch. 
+
+As you could see, first range pruning is done to cut down on files to be looked up. Following which actual bloom look up is done. By default this is the index type chosen. 
+
+## 4. Simple Index
+For a decent sized dataset, Simple index comes in handy. In the bloom index discussed above, hoodie reads the file twice. Once to load the file range info and again to load the bloom filter. So, this simple index simplifies if the data is within reasonable size. 
+
+- From incoming records, find Pair<record key, partition path>
+- Load interested fields (record keys, partition path and location) from all files and to find Pair<record key, partition path, location> for all entries in storage. 
+- Join above two outputs to find the location for all incoming records. 
+
+Since we load only interested fields from files and join directly w/ incoming records, this works pretty well for small scale data even when compared to bloom index. But at larger scale, this may deteriorate since all files are touched w/o any upfront trimming. 
+
+## 5. HBase
+Both bloom and simple index are implicit index. In other words, there is no explicit or external index files created/stored. But Hbase is an external index where record locations are stored and retrieved. This is straightforward as fetch location will do a get on hbase table and update location will update the records in hbase. 
+
+// talk about hbase configs? 
+
+## 6. UserDefinedIndex
+Hoodie also support user defined index. All you need to do is to implement “org.apache.hudi.index.SparkHoodieIndex”. You can use this config to set the user defined class name. If this value is set, this will take precedence over “hoodie.index.type”.
+
+## 7. Global versions 
+// Talk about Global versions ? 
+
+// Talk about Simple vs Dynamic Bloom Filter ?? 
+
+## 8. Bloom index
+As far as actual bloom filter is concerned (which is stored along with data file), Hoodie has two types, namely Simple and Dynamic. This can be configured using “hoodie.bloom.index.filter.type” config. 
+
+### 8.1. Simple
+Simple bloom filter is just the regular bloom filter as you might have seen elsewhere. Based on the input values set for num of entries and false positive probability, bloom allocates the bit size and proceeds accordingly. Configs of interest are “hoodie.index.bloom.num_entries” and “hoodie.index.bloom.fpp”. You can check the formula used to determine the size and hash functions here. This bloom is static in the sense that the configured fpp will be honored if the entries added to bloom do not surpass the num entries set. But if you keep adding more entries than what was configured, then fpp may not be honored since more entries fill up more buckets. 
+
+### 8.2. Dynamic
+Compared to simple, dynamic bloom as the name suggests is dynamic in nature. It grows relatively as the number of entries increases. Basically users are expected to set two configs, namely “hoodie.index.bloom.num_entries” and “hoodie.bloom.index.filter.dynamic.max.entries” apart from the fpp. Initially bloom is allocated only for “hoodie.index.bloom.num_entries”, but as the number of entries reaches this value, the bloom grows to increase to 2x. This proceeds until “hoodie.bloom.index.filter.dynamic.max.entries” is reached. So until the max value is reached fpp is guaranteed in this bloom type. Beyond that, fpp is not guaranteed similar to Simple bloom. In general this will be beneficial compared to Simple as it may not allocate a larger sized bloom unless or otherwise required. Especially if you don’t have control over your incoming traffic it may be an unnecessary overhead to allocate a larger sized bloom upfront and never get to add so many entries as configured
 . Because, reading a larger sized bloom will have some impact on your index look up performance. 
+
+ 

Review comment:
       IMO we could add the Flink State index as well




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar merged pull request #2245: Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

vinothchandar merged pull request #2245:
URL: https://github.com/apache/hudi/pull/2245


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r543739496



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.md
##########
@@ -0,0 +1,80 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## Introduction
+Hudi employs index to find and update the location of incoming records during write operations. To be specific, index assist in differentiating 
+inserts vs updates. This blog talks about different indices and when to each of them.
+
+Hudi dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations, one for partitioned dataset 
+and another for non-partitioned called as global index.
+
+These are the types of index supported by Hudi as of now.
+
+- InMemory

Review comment:
       Even though the blog talks about only 3 of these, just to be comprehensive, have included InMemory also here. 

##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.md
##########
@@ -0,0 +1,80 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## Introduction
+Hudi employs index to find and update the location of incoming records during write operations. To be specific, index assist in differentiating 
+inserts vs updates. This blog talks about different indices and when to each of them.
+
+Hudi dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations, one for partitioned dataset 
+and another for non-partitioned called as global index.
+
+These are the types of index supported by Hudi as of now.
+
+- InMemory
+- Bloom
+- Simple
+- Hbase
+
+You could use “hoodie.index.type” to choose any of these indices.
+
+## Different workloads
+Since data comes in at different volumes, velocity and has different access patterns, different indices could be used for different workloads. 
+Let’s walk through some of the typical workloads and see how to leverage Hudi index for such use-cases.
+
+### Fact table
+These are typical primary table in a dimensional model. It contains measures or quantitative figures and is used for analysis and decision making. 
+For eg, trip tables in case of ride-sharing, user buying and selling of shares, or any other similar use-case can be categorized as fact tables. 
+These tables are usually ever growing with random updates on most recent data with long tail of older data. In other words, most updates go into 
+the latest partitions with few updates going to older ones.
+
+![Fact table](/assets/images/blog/hudi-indexes/Hudi_Index_Blog_Fact_table.png)
+Figure showing the spread of updates for Fact table.
+
+Hudi "BLOOM" index is the way to go for these kinds of tables, since index look-up will prune a lot of data files. So, effectively actual look up will 
+happen only in a very few data files where the records are most likely present. This bloom index will also benefit a lot for use-cases where record 
+keys have some kind of ordering (timestamp) among them. File pruning will cut down a lot of data files to be looked up resulting in very fast look-up times.
+On a high level, bloom index does pruning based on ranges of data files, followed by bloom filter look up. Depending on the workload, this could 
+result in a lot of shuffling depending on the amount of data touched. Hudi is planning to support [record level indexing](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets?src=contextnavpagetreemode) 

Review comment:
       for now, have added links to RFCs. if you prefer to link jiras, can you assist me w/ right links(for all). I was trying to look for secondary index and couldn't find a jira and hence resorted to use RFC links. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] garyli1019 commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

garyli1019 commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r528295097



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,93 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations one for partitioned dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 
+
+You could use “hoodie.index.type” to choose any of these indices. 
+
+### 1.1 Motivation
+Different workloads have different access patterns. Hudi supports different indexing schemes to cater to the needs of different workloads. So depending on one’s use-case, indexing schema can be chosen. 
+
+For eg: ……. 
+To Be filled
+
+Let's take a brief look at each of these indices.
+
+## 2. InMemory
+Stores an in memory hashmap of records to location mapping. Intended to be used for local testing. 
+
+## 3. Bloom
+Leverages bloom index stored with data files to find the location for the incoming records. This is the most commonly used Index in Hudi and is the default one. On a high level, this does a range pruning followed by bloom look up. So, if the record keys are laid out such that it follows some type of ordering like timestamps, then this will essentially cut down a lot of files to be looked up as bloom would have filtered out most of the files. But Range pruning is optional depending on your use-case. If your write batch is such that the records have no ordering in them (e.g uuid), but the pattern is such that mostly the recent partitions are updated with a long tail of updates/deletes to the older partitions, then still bloom index will be faster. But better to turn off range pruning as it just incurs the cost of checking w/o much benefit. 
+
+For instance, consider a list of file slices in a partition
+
+F1 : key_t0 to key_t10000
+F2 : key_t10001 to key_t20000
+F3 : key_t20001 to key_t30000
+F4 : key_t30001 to key_t40000
+F5 : key_t40001 to key_t50000
+
+So, when looking up records ranging from key_t25000 to key_t28000, bloom will filter every file slice except F3 with range pruning. 
+
+Here is a high level pseudocode used for this bloom:
+
+- Fetch interested partitions from incoming records
+- Load all file info (range info) for every partition. So, we have Map of <partition -> List<FileInfo> >
+- Find all file -> hoodie key pairs to be looked up.
+// For every <partition, record key> pairs, use index File filter to filter interested files. Index file filter will leverage file range info and trim down the files to be looked up. Hoodie has a tree map like structure for efficient index file filtering. 
+- Sort <file, hoodie key> pairs. 
+- Load each file and look up mapped keys to find the exact location for the record keys. 
+- Tag back location to incoming records. // this step is required for those newly inserted records in the incoming batch. 
+
+As you could see, first range pruning is done to cut down on files to be looked up. Following which actual bloom look up is done. By default this is the index type chosen. 
+
+## 4. Simple Index
+For a decent sized dataset, Simple index comes in handy. In the bloom index discussed above, hoodie reads the file twice. Once to load the file range info and again to load the bloom filter. So, this simple index simplifies if the data is within reasonable size. 
+
+- From incoming records, find Pair<record key, partition path>
+- Load interested fields (record keys, partition path and location) from all files and to find Pair<record key, partition path, location> for all entries in storage. 
+- Join above two outputs to find the location for all incoming records. 
+
+Since we load only interested fields from files and join directly w/ incoming records, this works pretty well for small scale data even when compared to bloom index. But at larger scale, this may deteriorate since all files are touched w/o any upfront trimming. 
+
+## 5. HBase
+Both bloom and simple index are implicit index. In other words, there is no explicit or external index files created/stored. But Hbase is an external index where record locations are stored and retrieved. This is straightforward as fetch location will do a get on hbase table and update location will update the records in hbase. 
+
+// talk about hbase configs? 
+
+## 6. UserDefinedIndex
+Hoodie also support user defined index. All you need to do is to implement “org.apache.hudi.index.SparkHoodieIndex”. You can use this config to set the user defined class name. If this value is set, this will take precedence over “hoodie.index.type”.

Review comment:
       Should we use `HoodieIndex` instead of `SparkHoodieIndex`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #2245: Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r543773277



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.md
##########
@@ -0,0 +1,80 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## Introduction
+Hudi employs index to find and update the location of incoming records during write operations. To be specific, index assist in differentiating 
+inserts vs updates. This blog talks about different indices and when to each of them.
+
+Hudi dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations, one for partitioned dataset 
+and another for non-partitioned called as global index.
+
+These are the types of index supported by Hudi as of now.
+
+- InMemory

Review comment:
       lets please remove InMemory as an option.. it's just something that is used by tests atm . 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#issuecomment-727590887


   @vinothchandar : wanted to discuss an outline for this blog. Do review when you get a chance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r523751401



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,92 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations one for partitioned dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 
+
+You could use “hoodie.index.type” to choose any of these indices. 
+
+### 1.1 Motivation
+Different workloads have different access patterns. Hudi supports different indexing schemes to cater to the needs of different workloads. So depending on one’s use-case, indexing schema can be chosen. 
+
+For eg: ……. 

Review comment:
       to be filled. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#issuecomment-745613194


   @vinothchandar : feel free to review the patch. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2245: [WIP] Adding Hudi indexing mechanisms blog

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on a change in pull request #2245:
URL: https://github.com/apache/hudi/pull/2245#discussion_r523751401



##########
File path: docs/_posts/2020-11-11-hudi-indexing-mechanisms.mb
##########
@@ -0,0 +1,92 @@
+---
+title: "Apache Hudi Indexing mechanisms"
+excerpt: "Detailing different indexing mechanisms in Hudi and when to use each of them"
+author: sivabalan
+category: blog
+---
+
+
+## 1. Introduction
+Hoodie employs index to find and update the location of incoming records during write operations. Hoodie index is a very critical piece in Hoodie as it gives record level lookup support to Hudi for efficient write operations. This blog talks about different indices and when to use which one. 
+
+Hoodie dataset can be of two types in general, partitioned and non-partitioned. So, most index has two implementations one for partitioned dataset and another for non-partitioned called as global index. 
+
+These are the types of index supported by Hoodie as of now. 
+
+- InMemory
+- Bloom
+- Simple
+- Hbase 
+
+You could use “hoodie.index.type” to choose any of these indices. 
+
+### 1.1 Motivation
+Different workloads have different access patterns. Hudi supports different indexing schemes to cater to the needs of different workloads. So depending on one’s use-case, indexing schema can be chosen. 
+
+For eg: ……. 

Review comment:
       to be filled. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org