You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/18 19:13:03 UTC

[GitHub] [hudi] nsivabalan opened a new pull request #3497: [HUDI-2317] Adding virtual keys blog

nsivabalan opened a new pull request #3497:
URL: https://github.com/apache/hudi/pull/3497


   
   ## What is the purpose of the pull request
   
   - Adding virtual keys blog
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#issuecomment-901373998


   <img width="1181" alt="Screen Shot 2021-08-18 at 3 26 57 PM" src="https://user-images.githubusercontent.com/513218/129960313-b93d3bfe-1530-4e3d-bc29-1e7ac5f83e9d.png">
   <img width="1198" alt="Screen Shot 2021-08-18 at 3 27 08 PM" src="https://user-images.githubusercontent.com/513218/129960318-29f69100-fc02-430b-aa44-924c52c0cd3e.png">
   <img width="1195" alt="Screen Shot 2021-08-18 at 3 27 19 PM" src="https://user-images.githubusercontent.com/513218/129960320-17b572ef-293a-4930-9595-f5a86ac4f9a7.png">
   <img width="1201" alt="Screen Shot 2021-08-18 at 3 27 30 PM" src="https://user-images.githubusercontent.com/513218/129960322-76a9e2ea-129e-479f-9ea6-be23b7697267.png">
   <img width="1197" alt="Screen Shot 2021-08-18 at 3 27 42 PM" src="https://user-images.githubusercontent.com/513218/129960323-5ab56c55-7c65-4eab-af6c-8a75e33e8ed8.png">
   <img width="1203" alt="Screen Shot 2021-08-18 at 3 27 51 PM" src="https://user-images.githubusercontent.com/513218/129960324-0ade6b4e-1861-432a-bc2e-0d3cf8cbb666.png">
   <img width="1195" alt="Screen Shot 2021-08-18 at 3 28 00 PM" src="https://user-images.githubusercontent.com/513218/129960325-a3649e3c-67a6-42ed-a1d1-cdd1580ecc5c.png">
   <img width="1194" alt="Screen Shot 2021-08-18 at 3 28 09 PM" src="https://user-images.githubusercontent.com/513218/129960327-f577f053-9322-40f5-850d-22875651b26b.png">
   <img width="1187" alt="Screen Shot 2021-08-18 at 3 28 19 PM" src="https://user-images.githubusercontent.com/513218/129960328-55e39ea1-0cde-45a2-87c8-d9fec8cf19db.png">
   <img width="1192" alt="Screen Shot 2021-08-18 at 3 28 29 PM" src="https://user-images.githubusercontent.com/513218/129960329-0905c729-3ee5-4ca2-a91f-2f0bc72f47fa.png">
   <img width="1182" alt="Screen Shot 2021-08-18 at 3 28 40 PM" src="https://user-images.githubusercontent.com/513218/129960330-e47b15ef-f0ad-42f2-a296-f7a13ea1ccef.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan removed a comment on pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
nsivabalan removed a comment on pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#issuecomment-901373998


   <img width="1181" alt="Screen Shot 2021-08-18 at 3 26 57 PM" src="https://user-images.githubusercontent.com/513218/129960313-b93d3bfe-1530-4e3d-bc29-1e7ac5f83e9d.png">
   <img width="1198" alt="Screen Shot 2021-08-18 at 3 27 08 PM" src="https://user-images.githubusercontent.com/513218/129960318-29f69100-fc02-430b-aa44-924c52c0cd3e.png">
   <img width="1195" alt="Screen Shot 2021-08-18 at 3 27 19 PM" src="https://user-images.githubusercontent.com/513218/129960320-17b572ef-293a-4930-9595-f5a86ac4f9a7.png">
   <img width="1201" alt="Screen Shot 2021-08-18 at 3 27 30 PM" src="https://user-images.githubusercontent.com/513218/129960322-76a9e2ea-129e-479f-9ea6-be23b7697267.png">
   <img width="1197" alt="Screen Shot 2021-08-18 at 3 27 42 PM" src="https://user-images.githubusercontent.com/513218/129960323-5ab56c55-7c65-4eab-af6c-8a75e33e8ed8.png">
   <img width="1203" alt="Screen Shot 2021-08-18 at 3 27 51 PM" src="https://user-images.githubusercontent.com/513218/129960324-0ade6b4e-1861-432a-bc2e-0d3cf8cbb666.png">
   <img width="1195" alt="Screen Shot 2021-08-18 at 3 28 00 PM" src="https://user-images.githubusercontent.com/513218/129960325-a3649e3c-67a6-42ed-a1d1-cdd1580ecc5c.png">
   <img width="1194" alt="Screen Shot 2021-08-18 at 3 28 09 PM" src="https://user-images.githubusercontent.com/513218/129960327-f577f053-9322-40f5-850d-22875651b26b.png">
   <img width="1187" alt="Screen Shot 2021-08-18 at 3 28 19 PM" src="https://user-images.githubusercontent.com/513218/129960328-55e39ea1-0cde-45a2-87c8-d9fec8cf19db.png">
   <img width="1192" alt="Screen Shot 2021-08-18 at 3 28 29 PM" src="https://user-images.githubusercontent.com/513218/129960329-0905c729-3ee5-4ca2-a91f-2f0bc72f47fa.png">
   <img width="1182" alt="Screen Shot 2021-08-18 at 3 28 40 PM" src="https://user-images.githubusercontent.com/513218/129960330-e47b15ef-f0ad-42f2-a296-f7a13ea1ccef.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] pratyakshsharma commented on a change in pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
pratyakshsharma commented on a change in pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#discussion_r697826326



##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,291 @@
+---
+title: "Adding support for Virtual Keys in Hudi"
+excerpt: "Supporting Virtual keys in Hudi for reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata fields like `_hoodie_record_key`, `_hoodie_partition path`, `_hoodie_commit_time` which serves multiple purposes. 

Review comment:
       nit: _hoodie_partition path -> _hoodie_partition_path




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vingov commented on pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
vingov commented on pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#issuecomment-901476961


   nit. formatting tip:
   ```
   :::note
   Only `Append` mode is supported for delete operation.
   :::
   ```
   
   will format the text nicely like this:
   ![image](https://user-images.githubusercontent.com/1142498/129981336-7e135308-9d96-4ff7-a639-73323aa0d169.png)
   
   More formatting tips are [here](https://docusaurus.io/docs/next/markdown-features/admonitions)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #3497:
URL: https://github.com/apache/hudi/pull/3497


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vingov commented on pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
vingov commented on pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#issuecomment-901464940


   Thanks for the blog @nsivabalan! This is super helpful for dbt integration! I was looking for this today and you already created a PR for it! :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#issuecomment-902770622


   @vingov : great to know we already have interest in this. After some rethought, I have decided to split this into two blogs, one for virtual keys and another for immutable data lake use-cases. 
   I read your comment about using Note in markdown style. My note section is actually bigger and hence have left it as is. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#issuecomment-902769615


   <img width="1209" alt="Screen Shot 2021-08-20 at 11 16 18 AM" src="https://user-images.githubusercontent.com/513218/130255599-678cba2c-236a-42a9-a90b-0145cfe1560b.png">
   <img width="1189" alt="Screen Shot 2021-08-20 at 11 16 29 AM" src="https://user-images.githubusercontent.com/513218/130255604-c21dc0f1-7d29-4165-b71b-d7cd8bdcb37b.png">
   <img width="1188" alt="Screen Shot 2021-08-20 at 11 16 37 AM" src="https://user-images.githubusercontent.com/513218/130255608-c8ba9420-af26-43e0-830c-a28b88cdf45c.png">
   <img width="1202" alt="Screen Shot 2021-08-20 at 11 16 46 AM" src="https://user-images.githubusercontent.com/513218/130255611-4c88bf21-dc35-41f2-97a7-6e39bf0f9bef.png">
   <img width="1190" alt="Screen Shot 2021-08-20 at 11 16 56 AM" src="https://user-images.githubusercontent.com/513218/130255613-21302e70-a28b-4103-90c7-a77d0c5fdbe1.png">
   <img width="1185" alt="Screen Shot 2021-08-20 at 11 17 05 AM" src="https://user-images.githubusercontent.com/513218/130255614-20bdf19d-14b2-40a3-a47f-c3d7734593a5.png">
   <img width="1208" alt="Screen Shot 2021-08-20 at 11 17 14 AM" src="https://user-images.githubusercontent.com/513218/130255619-a7a700eb-e8c9-4281-9725-a520e7f613e5.png">
   <img width="1198" alt="Screen Shot 2021-08-20 at 11 17 24 AM" src="https://user-images.githubusercontent.com/513218/130255623-0372ab47-ec5f-457e-8e8d-d0030c702c3e.png">
   <img width="1194" alt="Screen Shot 2021-08-20 at 11 17 35 AM" src="https://user-images.githubusercontent.com/513218/130255627-af621594-c2a8-41e0-b36c-27ef46e32621.png">
   <img width="1196" alt="Screen Shot 2021-08-20 at 11 17 48 AM" src="https://user-images.githubusercontent.com/513218/130255630-303f56dc-dd48-4e88-b0bd-bf168cc65c4f.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #3497: [HUDI-2317] Adding virtual keys blog

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#discussion_r697543536



##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query

Review comment:
       are we just trying to show that queries work. If so, lets remove this?

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet

Review comment:
       can we use a much simpler quickstart example using some other schema. 

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator

Review comment:
       lets please use actual config names and values and avoid referring loosely to class names out of context. I would argue these are downright hostile for reader friendliness :) . 

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query
+```
+val tripsSnapshotDF = spark.
+  read.
+  format("hudi").
+  load(basePath + "/*/*/*/*")
+//load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+```
+
+```
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+```
+
+#### Output

Review comment:
       can we remove this `Output` subheading?

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:

Review comment:
       do they need to be sections? Can we do a table? its easy to convey these thigns in a table?

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query
+```
+val tripsSnapshotDF = spark.
+  read.
+  format("hudi").
+  load(basePath + "/*/*/*/*")
+//load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+```
+
+```
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+```
+
+#### Output
+```
++------------------+-------------------+-------------------+-------------+
+|              fare|          begin_lon|          begin_lat|           ts|
++------------------+-------------------+-------------------+-------------+
+| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1628951609798|
+| 93.56018115236618|0.14285051259466197|0.21624150367601136|1629012489526|
+| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1629163264651|
+| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1628701606278|
+|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|1628787101240|
+| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1628802740084|
+|34.158284716382845|0.46157858450465483| 0.4726905879569653|1629018593339|
+| 41.06290929046368| 0.8192868687714224|  0.651058505660742|1629131594334|
++------------------+-------------------+-------------------+-------------+
+```
+
+```
+spark.sql("select uuid, partitionpath, rider, driver, fare from  hudi_trips_snapshot").show(false)
+```
+
+#### Output
+```
++------------------------------------+------------------------------------+---------+----------+------------------+
+|uuid                                |partitionpath                       |rider    |driver    |fare              |
++------------------------------------+------------------------------------+---------+----------+------------------+
+|eb7819f1-6f04-429d-8371-df77620b9527|americas/united_states/san_francisco|rider-213|driver-213|27.79478688582596 |
+|37ea44f1-fda7-4ec4-84de-f43f5b5a4d84|americas/united_states/san_francisco|rider-213|driver-213|19.179139106643607|
+|aa601d6b-7cc5-4b82-9687-675d0081616e|americas/united_states/san_francisco|rider-213|driver-213|93.56018115236618 |
+|494bc080-881c-48be-8f8a-8f1739781816|americas/united_states/san_francisco|rider-213|driver-213|33.92216483948643 |
+|09573277-e1c1-4cdd-9b45-57176f184d4d|americas/united_states/san_francisco|rider-213|driver-213|64.27696295884016 |
+|c9b055ed-cd28-4397-9704-93da8b2e601f|americas/brazil/sao_paulo           |rider-213|driver-213|43.4923811219014  |
+|e707355a-b8c0-432d-a80f-723b93dc13a8|americas/brazil/sao_paulo           |rider-213|driver-213|66.62084366450246 |
+|d3c39c9e-d128-497a-bf3e-368882f45c28|americas/brazil/sao_paulo           |rider-213|driver-213|34.158284716382845|
+|159441b0-545b-460a-b671-7cc2d509f47b|asia/india/chennai                  |rider-213|driver-213|41.06290929046368 |
+|16031faf-ad8d-4968-90ff-16cead211d3c|asia/india/chennai                  |rider-213|driver-213|17.851135255091155|
++------------------------------------+------------------------------------+---------+----------+------------------+
+```
+
+```
+spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
+```
+
+#### Output
+```
++-------------------+------------------+----------------------+---------+----------+------------------+
+|_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
++-------------------+------------------+----------------------+---------+----------+------------------+
+|               null|              null|                  null|rider-213|driver-213|19.179139106643607|
+|               null|              null|                  null|rider-213|driver-213| 33.92216483948643|
+|               null|              null|                  null|rider-213|driver-213| 27.79478688582596|
+|               null|              null|                  null|rider-213|driver-213| 64.27696295884016|
+|               null|              null|                  null|rider-213|driver-213| 93.56018115236618|
+|               null|              null|                  null|rider-213|driver-213| 66.62084366450246|
+|               null|              null|                  null|rider-213|driver-213|  43.4923811219014|
+|               null|              null|                  null|rider-213|driver-213|34.158284716382845|
+|               null|              null|                  null|rider-213|driver-213|17.851135255091155|
+|               null|              null|                  null|rider-213|driver-213| 41.06290929046368|
++-------------------+------------------+----------------------+---------+----------+------------------+
+```
+Note: all meta fields are null in storage.

Review comment:
       these `Note:` style off hand comments, actually intefere a fair bit with reading flow. :) 

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query

Review comment:
       I feel we can just show that fields are null and incremental queries will fail. why go over the entire quickstart? it feels like adding little value, while increasing the length of the blog.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org