You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/27 15:52:05 UTC

[GitHub] [hudi] vinothchandar commented on a change in pull request #3497: [HUDI-2317] Adding virtual keys blog

vinothchandar commented on a change in pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#discussion_r697543536



##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query

Review comment:
       are we just trying to show that queries work. If so, lets remove this?

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet

Review comment:
       can we use a much simpler quickstart example using some other schema. 

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator

Review comment:
       lets please use actual config names and values and avoid referring loosely to class names out of context. I would argue these are downright hostile for reader friendliness :) . 

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query
+```
+val tripsSnapshotDF = spark.
+  read.
+  format("hudi").
+  load(basePath + "/*/*/*/*")
+//load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+```
+
+```
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+```
+
+#### Output

Review comment:
       can we remove this `Output` subheading?

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:

Review comment:
       do they need to be sections? Can we do a table? its easy to convey these thigns in a table?

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query
+```
+val tripsSnapshotDF = spark.
+  read.
+  format("hudi").
+  load(basePath + "/*/*/*/*")
+//load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+```
+
+```
+spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
+```
+
+#### Output
+```
++------------------+-------------------+-------------------+-------------+
+|              fare|          begin_lon|          begin_lat|           ts|
++------------------+-------------------+-------------------+-------------+
+| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1628951609798|
+| 93.56018115236618|0.14285051259466197|0.21624150367601136|1629012489526|
+| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1629163264651|
+| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1628701606278|
+|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|1628787101240|
+| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1628802740084|
+|34.158284716382845|0.46157858450465483| 0.4726905879569653|1629018593339|
+| 41.06290929046368| 0.8192868687714224|  0.651058505660742|1629131594334|
++------------------+-------------------+-------------------+-------------+
+```
+
+```
+spark.sql("select uuid, partitionpath, rider, driver, fare from  hudi_trips_snapshot").show(false)
+```
+
+#### Output
+```
++------------------------------------+------------------------------------+---------+----------+------------------+
+|uuid                                |partitionpath                       |rider    |driver    |fare              |
++------------------------------------+------------------------------------+---------+----------+------------------+
+|eb7819f1-6f04-429d-8371-df77620b9527|americas/united_states/san_francisco|rider-213|driver-213|27.79478688582596 |
+|37ea44f1-fda7-4ec4-84de-f43f5b5a4d84|americas/united_states/san_francisco|rider-213|driver-213|19.179139106643607|
+|aa601d6b-7cc5-4b82-9687-675d0081616e|americas/united_states/san_francisco|rider-213|driver-213|93.56018115236618 |
+|494bc080-881c-48be-8f8a-8f1739781816|americas/united_states/san_francisco|rider-213|driver-213|33.92216483948643 |
+|09573277-e1c1-4cdd-9b45-57176f184d4d|americas/united_states/san_francisco|rider-213|driver-213|64.27696295884016 |
+|c9b055ed-cd28-4397-9704-93da8b2e601f|americas/brazil/sao_paulo           |rider-213|driver-213|43.4923811219014  |
+|e707355a-b8c0-432d-a80f-723b93dc13a8|americas/brazil/sao_paulo           |rider-213|driver-213|66.62084366450246 |
+|d3c39c9e-d128-497a-bf3e-368882f45c28|americas/brazil/sao_paulo           |rider-213|driver-213|34.158284716382845|
+|159441b0-545b-460a-b671-7cc2d509f47b|asia/india/chennai                  |rider-213|driver-213|41.06290929046368 |
+|16031faf-ad8d-4968-90ff-16cead211d3c|asia/india/chennai                  |rider-213|driver-213|17.851135255091155|
++------------------------------------+------------------------------------+---------+----------+------------------+
+```
+
+```
+spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()
+```
+
+#### Output
+```
++-------------------+------------------+----------------------+---------+----------+------------------+
+|_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
++-------------------+------------------+----------------------+---------+----------+------------------+
+|               null|              null|                  null|rider-213|driver-213|19.179139106643607|
+|               null|              null|                  null|rider-213|driver-213| 33.92216483948643|
+|               null|              null|                  null|rider-213|driver-213| 27.79478688582596|
+|               null|              null|                  null|rider-213|driver-213| 64.27696295884016|
+|               null|              null|                  null|rider-213|driver-213| 93.56018115236618|
+|               null|              null|                  null|rider-213|driver-213| 66.62084366450246|
+|               null|              null|                  null|rider-213|driver-213|  43.4923811219014|
+|               null|              null|                  null|rider-213|driver-213|34.158284716382845|
+|               null|              null|                  null|rider-213|driver-213|17.851135255091155|
+|               null|              null|                  null|rider-213|driver-213| 41.06290929046368|
++-------------------+------------------+----------------------+---------+----------+------------------+
+```
+Note: all meta fields are null in storage.

Review comment:
       these `Note:` style off hand comments, actually intefere a fair bit with reading flow. :) 

##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time etc which serves multiple purpose. 
+This assists in avoiding re-computing the record key, partition path during merges, compaction and other table operations 
+and also assists in supporting incremental queries. But one of the repeated asks from the community is to leverage 
+existing fields and not to add additional meta fields. So, Hudi is adding Virtual keys support to cater to such needs. 
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on demand from existing user
+fields for all records. In regular path, these are computed once and stored as per record metadata and re-used during 
+various operations like merging incoming records to those in storage, compaction, etc. Hudi also stores commit time at 
+record level to support incremental queries. If one does not need incremental support, they can start leverageing 
+Hudi's Virutal key support and still go about using Hudi to build and manage their data lake to reduce the storage 
+overhead due to per record metadata. 
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When disabled, 
+Hudi will enforce virtual keys for the corresponding table. Default value for this config is true, which means, all 
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note: 
+Once virtual keys are enabled, it can't be disabled for a given hudi table, because already stored records may not have 
+the meta fields populated. But if you have an existing table from an older version of hudi, virtual keys can be enabled. 
+Just that going back is not feasible. 
+Another constraint wrt virtual key support is that, Key generator properties for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches of write and later switch to field10, 
+it may not pan out well with hudi table where virtual keys are enabled. 
+
+As its evident, record keys and partition path will have to be re-computed everytime when in need (merges, compaction, 
+MOR snapshot read). Hence we are supporting only built-in key generators with Virtual Keys for COW table type. Incase of 
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's snapshot query, hudi does real time merging of base 
+data file with records from delta log files and hence query latencies will shoot up if we were to support all different
+types of key generators. 
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types: 
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
+(BLOOM, etc) in future releases. 
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with virtual keys except the incremental 
+query support. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
+virtual keys enabled. So, if one's requirement fits into this model, would recommend using virtual keys as it reduces 
+the storage overhead. 
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME.key(), tableName).
+  option("hoodie.populate.meta.fields", "false").
+  option("hoodie.index.type","SIMPLE").
+  mode(Overwrite).
+  save(basePath)
+```
+
+### Query

Review comment:
       I feel we can just show that fields are null and incremental queries will fail. why go over the entire quickstart? it feels like adding little value, while increasing the length of the blog.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org