You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by bh...@apache.org on 2022/05/17 22:37:31 UTC

[hudi] branch asf-site updated: [DOCS] Add image assets and fix blog post styles (#5613)

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 3e85f4d80e [DOCS] Add image assets and fix blog post styles (#5613)
3e85f4d80e is described below

commit 3e85f4d80ee8fb5a364e67de56edc066ca2329e3
Author: Bhavani Sudha Saktheeswaran <21...@users.noreply.github.com>
AuthorDate: Tue May 17 15:37:25 2022 -0700

    [DOCS] Add image assets and fix blog post styles (#5613)
    
    Co-authored-by: Bhavani Sudha Saktheeswaran <su...@vmacs.local>
---
 ...-efficient-migration-of-large-parquet-tables.md |  22 ++++++++---------
 ...2020-08-21-async-compaction-deployment-model.md |  14 +++++------
 ...gh-perf-data-lake-with-hudi-and-alluxio-t3go.md |  26 ++++++++++-----------
 website/blog/2021-01-27-hudi-clustering-intro.md   |  20 ++++++++--------
 website/blog/2021-03-01-hudi-file-sizing.md        |   6 ++---
 website/blog/2021-08-18-virtual-keys.md            |  18 +++++++-------
 ...se-concurrency-control-are-we-too-optimistic.md |   9 ++++---
 ...hudi-zorder-and-hilbert-space-filling-curves.md |  10 ++++----
 ...atures-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx |   1 +
 ...efficiency-at-scale-in-big-data-file-format.png | Bin 0 -> 38755 bytes
 ...2022-02-02-onehouse-commitment-to-openness.jpeg | Bin 0 -> 386047 bytes
 .../images/blog/2022-02-03-onehouse_billboard.png  | Bin 0 -> 554823 bytes
 .../2022-02-17-fresher-data-lake-on-aws-s3.png     | Bin 0 -> 96170 bytes
 ...1-low-latency-pipeline-using-msk-flink-hudi.png | Bin 0 -> 40488 bytes
 ...3-09-serverless-pipeline-using-glue-hudi-s3.png | Bin 0 -> 142433 bytes
 .../2022-04-04-halodoc-lakehouse-architecture.png  | Bin 0 -> 251301 bytes
 .../images/blog/2022-05-17-multimodal-index.gif    | Bin 0 -> 607295 bytes
 17 files changed, 63 insertions(+), 63 deletions(-)

diff --git a/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md b/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
index cd959ce28f..75144dc907 100644
--- a/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
+++ b/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
@@ -8,14 +8,14 @@ category: blog
 We will look at how to migrate a large parquet table to Hudi without having to rewrite the entire dataset. 
 
 <!--truncate-->
-# Motivation:
+## Motivation:
 
 Apache Hudi maintains per record metadata to perform core operations such as upserts and incremental pull. To take advantage of Hudi’s upsert and incremental processing support, users would need to rewrite their whole dataset to make it an Apache Hudi table.  Hudi 0.6.0 comes with an ***experimental feature*** to support efficient migration of large Parquet tables to Hudi without the need to rewrite the entire dataset.
 
 
-# High Level Idea:
+## High Level Idea:
 
-## Per Record Metadata:
+### Per Record Metadata:
 
 Apache Hudi maintains record level metadata for perform efficient upserts and incremental pull.
 
@@ -31,11 +31,11 @@ The parts (1) and (3) constitute what we term as  “Hudi skeleton”. Hudi skel
 
 ![skeleton](/assets/images/blog/2020-08-20-skeleton.png)
 
-# Design Deep Dive:
+## Design Deep Dive:
 
  For a deep dive on the internals, please take a look at the [RFC document](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi) 
 
-# Migration:
+## Migration:
 
 Hudi supports 2 modes when migrating parquet tables.  We will use the term bootstrap and migration interchangeably in this document.  
 
@@ -45,10 +45,10 @@ Hudi supports 2 modes when migrating parquet tables.  We will use the term boots
 You can pick and choose these modes at partition level. One of the common strategy would be to use FULL_RECORD mode for a small set of "hot" partitions which are accessed more frequently and METADATA_ONLY for a larger set of "warm" partitions. 
 
 
-## Query Engine Support:
+### Query Engine Support:
 For a METADATA_ONLY bootstrapped table, Spark - data source, Spark-Hive and native Hive query engines are supported. Presto support is in the works.
 
-## Ways To Migrate :
+### Ways To Migrate :
 
 There are 2 ways to migrate a large parquet table to Hudi. 
 
@@ -57,7 +57,7 @@ There are 2 ways to migrate a large parquet table to Hudi.
 
 We will look at how to migrate using both these approaches.
 
-## Configurations:
+### Configurations:
 
 These are bootstrap specific configurations that needs to be set in addition to regular hudi write configurations.
 
@@ -73,7 +73,7 @@ These are bootstrap specific configurations that needs to be set in addition to
 | hoodie.bootstrap.mode.selector.regex.mode |METADATA_ONLY |No |Bootstrap Mode used when the partition matches the regex pattern in hoodie.bootstrap.mode.selector.regex . Used only when hoodie.bootstrap.mode.selector set to BootstrapRegexModeSelector. |
 | hoodie.bootstrap.mode.selector.regex |\.\* |No |Partition Regex used when  hoodie.bootstrap.mode.selector set to BootstrapRegexModeSelector. |
 
-## Spark Data Source:
+### Spark Data Source:
 
 Here, we use a Spark Datasource Write to perform bootstrap. 
 Here is an example code snippet to perform METADATA_ONLY bootstrap.
@@ -127,7 +127,7 @@ bootstrapDF.write
       .save(basePath)
 ```
 
-## Hoodie DeltaStreamer:
+### Hoodie DeltaStreamer:
 
 Hoodie Deltastreamer allows bootstrap to be performed using --run-bootstrap command line option.
 
@@ -170,6 +170,6 @@ spark-submit --package org.apache.hudi:hudi-spark-bundle_2.11:0.6.0
 --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=METADATA_ONLY
 ```
 
-## Known Caveats
+### Known Caveats
 1. Need proper defaults for the bootstrap config : hoodie.bootstrap.full.input.provider. Here is the [ticket](https://issues.apache.org/jira/browse/HUDI-1213)
 1. DeltaStreamer manages checkpoints inside hoodie commit files and expects checkpoints in previously committed metadata. Users are expected to pass checkpoint or initial checkpoint provider when performing bootstrap through deltastreamer. Such support is not present when doing bootstrap using Spark Datasource. Here is the [ticket](https://issues.apache.org/jira/browse/HUDI-1214).
diff --git a/website/blog/2020-08-21-async-compaction-deployment-model.md b/website/blog/2020-08-21-async-compaction-deployment-model.md
index 3ffa1b4508..5e6eec2657 100644
--- a/website/blog/2020-08-21-async-compaction-deployment-model.md
+++ b/website/blog/2020-08-21-async-compaction-deployment-model.md
@@ -7,7 +7,7 @@ category: blog
 
 We will look at different deployment models for executing compactions asynchronously.
 <!--truncate-->
-# Compaction
+## Compaction
 
 For Merge-On-Read table, data is stored using a combination of columnar (e.g parquet) + row based (e.g avro) file formats. 
 Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or 
@@ -15,7 +15,7 @@ asynchronously. One of th main motivations behind Merge-On-Read is to reduce dat
 Hence, it makes sense to run compaction asynchronously without blocking ingestion.
 
 
-# Async Compaction
+## Async Compaction
 
 Async Compaction is performed in 2 steps:
 
@@ -24,11 +24,11 @@ slices** to be compacted. A compaction plan is finally written to Hudi timeline.
 1. ***Compaction Execution***: A separate process reads the compaction plan and performs compaction of file slices.
 
   
-# Deployment Models
+## Deployment Models
 
 There are few ways by which we can execute compactions asynchronously. 
 
-## Spark Structured Streaming
+### Spark Structured Streaming
 
 With 0.6.0, we now have support for running async compactions in Spark 
 Structured Streaming jobs. Compactions are scheduled and executed asynchronously inside the 
@@ -60,7 +60,7 @@ import org.apache.spark.sql.streaming.ProcessingTime;
  writer.trigger(new ProcessingTime(30000)).start(tablePath);
 ```
 
-## DeltaStreamer Continuous Mode
+### DeltaStreamer Continuous Mode
 Hudi DeltaStreamer provides continuous ingestion mode where a single long running spark application  
 ingests data to Hudi table continuously from upstream sources. In this mode, Hudi supports managing asynchronous 
 compactions. Here is an example snippet for running in continuous mode with async compactions
@@ -78,7 +78,7 @@ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
 --continous
 ```
 
-## Hudi CLI
+### Hudi CLI
 Hudi CLI is yet another way to execute specific compactions asynchronously. Here is an example 
 
 ```properties
@@ -86,7 +86,7 @@ hudi:trips->compaction run --tableName <table_name> --parallelism <parallelism>
 ...
 ```
 
-## Hudi Compactor Script
+### Hudi Compactor Script
 Hudi provides a standalone tool to also execute specific compactions asynchronously. Here is an example
 
 ```properties
diff --git a/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md b/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
index a58b2e585e..75b87a3fb8 100644
--- a/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
+++ b/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
@@ -5,18 +5,18 @@ author: t3go
 category: blog
 ---
 
-# Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go
+## Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go
 [T3Go](https://www.t3go.cn/)  is China’s first platform for smart travel based on the Internet of Vehicles. In this article, Trevor Zhang and Vino Yang from T3Go describe the evolution of their data lake architecture, built on cloud-native or open-source technologies including Alibaba OSS, Apache Hudi, and Alluxio. Today, their data lake stores petabytes of data, supporting hundreds of pipelines and tens of thousands of tasks daily. It is essential for business units at T3Go including Da [...]
 
 In this blog, you will see how we slashed data ingestion time by half using Hudi and Alluxio. Furthermore, data analysts using Presto, Hudi, and Alluxio saw the queries speed up by 10 times. We built our data lake based on data orchestration for multiple stages of our data pipeline, including ingestion and analytics.
 <!--truncate-->
-# I. T3Go data lake Overview
+## I. T3Go data lake Overview
 
 Prior to the data lake, different business units within T3Go managed their own data processing solutions, utilizing different storage systems, ETL tools, and data processing frameworks. Data for each became siloed from every other unit, significantly increasing cost and complexity. Due to the rapid business expansion of T3Go, this inefficiency became our engineering bottleneck.
 
 We moved to a unified data lake solution based on Alibaba OSS, an object store similar to AWS S3, to provide a centralized location to store structured and unstructured data, following the design principles of  _Multi-cluster Shared-data Architecture_; all the applications access OSS storage as the source of truth, as opposed to different data silos. This architecture allows us to store the data as-is, without having to first structure the data, and run different types of analytics to gu [...]
 
-# II. Efficient Near Real-time Analytics Using Hudi
+## II. Efficient Near Real-time Analytics Using Hudi
 
 Our business in smart travel drives the need to process and analyze data in a near real-time manner. With a traditional data warehouse, we faced the following challenges:  
 
@@ -31,21 +31,21 @@ As a result, we adopted Apache Hudi on top of OSS to address these issues. The f
 
 ![architecture](/assets/images/blog/2020-12-01-t3go-architecture.png)
 
-## Enable Near real time data ingestion and analysis
+### Enable Near real time data ingestion and analysis
 
 With Hudi, our data lake supports multiple data sources including Kafka, MySQL binlog, GIS, and other business logs in near real time. As a result, more than 60% of the company’s data is stored in the data lake and this proportion continues to increase.
 
 We are also able to speed up the data ingestion time down to a few minutes by introducing Apache Hudi into the data pipeline. Combined with big data interactive query and analysis framework such as Presto and SparkSQL, real-time data analysis and insights are achieved.
 
-## Enable Incremental processing pipeline
+### Enable Incremental processing pipeline
 
 With the help of Hudi, it is possible to provide incremental changes to the downstream derived table when the upstream table updates frequently. Even with a large number of interdependent tables, we can quickly run partial data updates. This also effectively avoids updating the full partitions of cold tables in the traditional Hive data warehouse.
 
-## Accessing Data using Hudi as a unified format
+### Accessing Data using Hudi as a unified format
 
 Traditional data warehouses often deploy Hadoop to store data and provide batch analysis. Kafka is used separately to distribute Hadoop data to other data processing frameworks, resulting in duplicated data. Hudi helps effectively solve this problem; we always use Spark pipelines to insert new updates into the Hudi tables, then incrementally read the update of Hudi tables. In other words, Hudi tables are used as the unified storage format to access data.
 
-# III. Efficient Data Caching Using Alluxio
+## III. Efficient Data Caching Using Alluxio
 
 In the early version of our data lake without Alluxio, data received from Kafka in real time is processed by Spark and then written to OSS data lake using Hudi DeltaStreamer tasks. With this architecture, Spark often suffered high network latency when writing to OSS directly. Since all data is in OSS storage, OLAP queries on Hudi data may also be slow due to lack of data locality.
 
@@ -57,21 +57,21 @@ Data in formats such as Hudi, Parquet, ORC, and JSON are stored mostly on OSS, c
 
 Specifically, here are a few applications leveraging Alluxio in the T3Go data lake.
 
-## Data lake ingestion
+### Data lake ingestion
 
 We mount the corresponding OSS path to the Alluxio file system and set Hudi’s  _“__target-base-path__”_  parameter value to use the alluxio:// scheme in place of oss:// scheme. Spark pipelines with Hudi continuously ingest data to Alluxio. After data is written to Alluxio, it is asynchronously persisted from the Alluxio cache to the remote OSS every minute. These modifications allow Spark to write to a local Alluxio node instead of writing to remote OSS, significantly reducing the time f [...]
 
-## Data analysis on the lake
+### Data analysis on the lake
 
 We use Presto as an ad-hoc query engine to analyze the Hudi tables in the lake, co-locating Alluxio workers on each Presto worker node. When Presto and Alluxio services are co-located and running, Alluxio caches the input data locally in the Presto worker which greatly benefits Presto for subsequent retrievals. On a cache hit, Presto can read from the local Alluxio worker storage at memory speed without any additional data transfer over the network.
 
-## Concurrent accesses across multiple storage systems
+### Concurrent accesses across multiple storage systems
 
 In order to ensure the accuracy of training samples, our machine learning team often synchronizes desensitized data in production to an offline machine learning environment. During synchronization, the data flows across multiple file systems, from production OSS to an offline HDFS followed by another offline Machine Learning HDFS.
 
 This data migration process is not only inefficient but also error-prune for modelers because multiple different storages with varying configurations are involved. Alluxio helps in this specific scenario by mounting the destination storage systems under the same filesystem to be accessed by their corresponding logical paths in Alluxio namespace. By decoupling the physical storage, this allows applications with different APIs to access and transfer data seamlessly. This data access layout [...]
 
-## Microbenchmark
+### Microbenchmark
 
 Overall, we observed the following improvements with Alluxio:
 
@@ -89,12 +89,12 @@ In the stress test shown above, after the data volume is greater than a certain
 
 Based on our performance benchmarking, we found that the performance can be improved by over 10 times with the help of Alluxio. Furthermore, the larger the data scale, the more prominent the performance improvement.
 
-# IV. Next Step
+## IV. Next Step
 
 As T3Go’s data lake ecosystem expands, we will continue facing the critical scenario of compute and storage segregation. With T3Go’s growing data processing needs, our team plans to deploy Alluxio on a larger scale to accelerate our data lake storage.
 
 In addition to the deployment of Alluxio on the data lake computing engine, which currently is mainly SparkSQL, we plan to add a layer of Alluxio to the OLAP cluster using Apache Kylin and an ad_hoc cluster using Presto. The goal is to have Alluxio cover all computing scenarios, with Alluxio interconnected between each scene to improve the read and write efficiency of the data lake and the surrounding lake ecology.
 
-# V. Conclusion
+## V. Conclusion
 
 As mentioned earlier, Hudi and Alluxio covers all scenarios of Hudi’s near real-time ingestion, near real-time analysis, incremental processing, and data distribution on DFS, among many others, and plays the role of a powerful accelerator on data ingestion and data analysis on the lake. With Hudi and Alluxio together,  **our R&D engineers shortened the time for data ingestion into the lake by up to a factor of 2. Data analysts using Presto, Hudi, and Alluxio in conjunction to query data  [...]
diff --git a/website/blog/2021-01-27-hudi-clustering-intro.md b/website/blog/2021-01-27-hudi-clustering-intro.md
index 5f47ffe411..b55d41b33d 100644
--- a/website/blog/2021-01-27-hudi-clustering-intro.md
+++ b/website/blog/2021-01-27-hudi-clustering-intro.md
@@ -5,12 +5,12 @@ author: satish.kotha
 category: blog
 ---
 
-# Background
+## Background
 
 Apache Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. In a data lake/warehouse, one of the key trade-offs is between ingestion speed and query performance. Data ingestion typically prefers small files to improve parallelism and make data available to queries as soon as possible. However, query performance degrades poorly with a lot of small files. Also, during ingestion, data is typically co-l [...]
 <!--truncate-->
 
-# Clustering Architecture
+## Clustering Architecture
 
 At a high level, Hudi provides different operations such as insert/upsert/bulk_insert through it’s write client API to be able to write data to a Hudi table. To be able to choose a trade-off between file size and ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be able to configure the smallest allowable file size. Users are able to configure the small file [soft limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to `0` to force new data [...]
 
@@ -22,13 +22,13 @@ Clustering table service can run asynchronously or synchronously adding a new ac
 
   
 
-### Overall, there are 2 parts to clustering
+#### Overall, there are 2 parts to clustering
 
 1.  Scheduling clustering: Create a clustering plan using a pluggable clustering strategy.
 2.  Execute clustering: Process the plan using an execution strategy to create new files and replace old files.
     
 
-### Scheduling clustering
+#### Scheduling clustering
 
 Following steps are followed to schedule clustering.
 
@@ -37,7 +37,7 @@ Following steps are followed to schedule clustering.
 3.  Finally, the clustering plan is saved to the timeline in an avro [metadata format](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieClusteringPlan.avsc).
     
 
-### Running clustering
+#### Running clustering
 
 1.  Read the clustering plan and get the ‘clusteringGroups’ that mark the file groups that need to be clustered.
 2.  For each group, we instantiate appropriate strategy class with strategyParams (example: sortColumns) and apply that strategy to rewrite the data.
@@ -51,7 +51,7 @@ NOTE: Clustering can only be scheduled for tables / partitions not receiving any
 ![Clustering example](/assets/images/blog/clustering/example_perf_improvement.png)
 _Figure: Illustrating query performance improvements by clustering_
 
-### Setting up clustering
+#### Setting up clustering
 Inline clustering can be setup easily using spark dataframe options. See sample below
 
 ```scala
@@ -83,7 +83,7 @@ df.write.format("org.apache.hudi").
 For more advanced usecases, async clustering pipeline can also be setup. See an example [here](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob).
 
 
-# Table Query Performance
+## Table Query Performance
 
 We created a dataset from one partition of a known production style table with ~20M records and on-disk size of ~200GB. The dataset has rows for multiple “sessions”. Users always query this data using a predicate on session. Data for a single session is spread across multiple data files because ingestion groups data based on arrival time. The below experiment shows that by clustering on session, we are able to improve the data locality and reduce query execution time by more than 50%.
 
@@ -92,14 +92,14 @@ Query:
 spark.sql("select  *  from table where session_id=123")
 ```
 
-## Before Clustering
+### Before Clustering
 
 Query took 2.2 minutes to complete. Note that the number of output rows in the “scan parquet” part of the query plan includes all 20M rows in the table.
 
 ![Query Plan Before Clustering](/assets/images/blog/clustering/Query_Plan_Before_Clustering.png)
 _Figure: Spark SQL query details before clustering_
 
-## After Clustering
+### After Clustering
 
 The query plan is similar to above. But, because of improved data locality and predicate push down, spark is able to prune a lot of rows. After clustering, the same query only outputs 110K rows (out of 20M rows) while scanning parquet files. This cuts query time to less than a minute from 2.2 minutes.
 
@@ -118,7 +118,7 @@ Query runtime is reduced by 60% after clustering. Similar results were observed
 
 We expect dramatic speedup for large tables, where the query runtime is almost entirely dominated by actual I/O and not query planning, unlike the example above.
 
-# Summary
+## Summary
 
 Using clustering, we can improve query performance by
 1.  Leveraging concepts such as [space filling curves](https://en.wikipedia.org/wiki/Z-order_curve) to adapt data lake layout and reduce the amount of data read during queries.
diff --git a/website/blog/2021-03-01-hudi-file-sizing.md b/website/blog/2021-03-01-hudi-file-sizing.md
index 3d3049601d..463ee0259b 100644
--- a/website/blog/2021-03-01-hudi-file-sizing.md
+++ b/website/blog/2021-03-01-hudi-file-sizing.md
@@ -11,7 +11,7 @@ manual table maintenance. Having a lot of small files will make it harder to ach
 having to open/read/close files way too many times, to plan and execute queries. But for streaming data lake use-cases, 
 inherently ingests are going to end up having smaller volume of writes, which might result in lot of small files if no special handling is done.
 <!--truncate-->
-# During Write vs After Write
+## During Write vs After Write
 
 Common approaches to writing very small files and then later stitching them together solve for system scalability issues posed 
 by small files but might violate query SLA's by exposing small files to them. In fact, you can easily do so on a Hudi table, 
@@ -25,7 +25,7 @@ Hudi has the ability to maintain a configured target file size, when performing
 (Note: bulk_insert operation does not provide this functionality and is designed as a simpler replacement for 
 normal `spark.write.parquet`).
 
-## Configs
+### Configs
 
 For illustration purposes, we are going to consider only COPY_ON_WRITE table.
 
@@ -41,7 +41,7 @@ would be considered a small file.
 
 If you wish to turn off this feature, set the config value for soft file limit to 0.
 
-## Example
+### Example
 
 Let’s say this is the layout of data files for a given partition.
 
diff --git a/website/blog/2021-08-18-virtual-keys.md b/website/blog/2021-08-18-virtual-keys.md
index c1ce8b5b09..57e44da270 100644
--- a/website/blog/2021-08-18-virtual-keys.md
+++ b/website/blog/2021-08-18-virtual-keys.md
@@ -13,13 +13,13 @@ In addition, it ensures data quality by ensuring unique key constraints are enfo
 But one of the repeated asks from the community is to leverage existing fields and not to add additional meta fields, for simple use-cases where such benefits are not desired or key changes are very rare.  
 <!--truncate-->
 
-# Virtual Key support
+## Virtual Key support
 Hudi now supports virtual keys, where Hudi meta fields can be computed on demand from the data fields. Currently, the meta fields are 
 computed once and stored as per record metadata and re-used across various operations. If one does not need incremental query support, 
 they can start leveraging Hudi's Virtual key support and still go about using Hudi to build and manage their data lake to reduce the storage 
 overhead due to per record metadata. 
 
-## Configurations
+### Configurations
 Virtual keys can be enabled for a given table using the below config. When set to `hoodie.populate.meta.fields=false`, 
 Hudi will use virtual keys for the corresponding table. Default value for this config is `true`, which means, all  meta fields will be added by default.
 
@@ -36,24 +36,24 @@ would entail reading all fields out of base and delta logs, sacrificing core col
 for users. Thus, we support only simple key generators (the default key generator, where both record key and partition path refer
 to an existing field ) for now.
 
-### Supported Key Generators with CopyOnWrite(COW) table:
+#### Supported Key Generators with CopyOnWrite(COW) table:
 SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
 
-### Supported Key Generators with MergeOnRead(MOR) table:
+#### Supported Key Generators with MergeOnRead(MOR) table:
 SimpleKeyGenerator
 
-### Supported Index types: 
+#### Supported Index types: 
 Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. We plan to add support for other index 
 (BLOOM, etc) in future releases. 
 
-## Supported Operations
+### Supported Operations
 All existing features are supported for a hudi table with virtual keys, except the incremental 
 queries. Which means, cleaning, archiving, metadata table, clustering, etc can be enabled for a hudi table with 
 virtual keys enabled. So, you are able to merely use Hudi as a transactional table format with all the awesome 
 table service runtimes and platform services, if you wish to do so, without incurring any overheads associated with 
 support for incremental data processing.
 
-## Sample Output
+### Sample Output
 As called out earlier, one has to set `hoodie.populate.meta.fields=false` to enable virtual keys. Let's see the 
 difference between records of a hudi table with and without virtual keys.
 
@@ -99,7 +99,7 @@ And here are some sample records for a hudi table with virtual keys enabled.
 As you could see, all meta fields are null in storage, but all users fields remain intact similar to a regular table.
 :::
 
-## Incremental Queries
+### Incremental Queries
 Since hudi does not maintain any metadata (like commit time at a record level) for a table with virtual keys enabled,  
 incremental queries are not supported. An exception will be thrown as below when an incremental query is triggered for such
 a table.
@@ -121,7 +121,7 @@ org.apache.hudi.exception.HoodieException: Incremental queries are not supported
   ... 61 elided
 ```
 
-## Conclusion 
+### Conclusion 
 Hope this blog was useful for you to learn yet another feature in Apache Hudi. If you are interested in 
 Hudi and looking to contribute, do check out [here](https://hudi.apache.org/contribute/get-involved). 
 
diff --git a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
index f45ace997a..1072ace5f7 100644
--- a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
+++ b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
@@ -14,18 +14,17 @@ Having had the good fortune of working on diverse database projects - an RDBMS (
 
 First, let's set the record straight. RDBMS databases offer the richest set of transactional capabilities and the widest array of concurrency control [mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html). Different isolation levels, fine grained locking, deadlock detection/avoidance, and more are possible because they have to support row-level mutations and reads across many tables while enforcing [key constraints](https://dev.mysql.com/doc/refman/8. [...]
 
-# Pitfalls in Lake Concurrency Control
+### Pitfalls in Lake Concurrency Control
 
 Historically, data lakes have been viewed as batch jobs reading/writing files on cloud storage and it's interesting to see how most new work extends this view and implements glorified file version control using some form of "[**Optimistic concurrency control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)" (OCC). With OCC jobs take a table level lock to check if they have impacted overlapping files and if a conflict exists, they abort their operations completely. Without [...]
 
 Imagine a real-life scenario of two writer processes : an ingest writer job producing new data every 30 minutes and a deletion writer job that is enforcing GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap files with random deletes, and the deletion job is almost guaranteed to starve and fail to commit each time. In database speak, mixing long running transactions with optimism leads to disappointment, since the longer the transactions the higher the probabilit [...]
 
 ![concurrency](/assets/images/blog/concurrency/ConcurrencyControlConflicts.png)
-                static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png
 
 So, what's the alternative? Locking? Wikipedia also says - "_However, locking-based ("pessimistic") methods also can deliver poor performance because locking can drastically limit effective concurrency even when deadlocks are avoided."._ Here is where Hudi takes a different approach, that we believe is more apt for modern lake transactions which are typically long-running and even continuous. Data lake workloads share more characteristics with high throughput stream processing jobs, than [...]
 
-# Model 1 : Single Writer, Inline Table Services
+### Model 1 : Single Writer, Inline Table Services
 
 The simplest form of concurrency control is just no concurrency at all. A data lake table often has common services operating on it to ensure efficiency. Reclaiming storage space from older versions and logs, coalescing files (clustering in Hudi), merging deltas (compactions in Hudi), and more. Hudi can simply eliminate the need for concurrency control and maximizes throughput by supporting these table services out-of-box and running inline after every write to the table.
 
@@ -33,13 +32,13 @@ Execution plans are idempotent, persisted to the timeline and auto-recover from
 
 ![concurrency-single-writer](/assets/images/blog/concurrency/SingleWriterInline.gif)
 
-# Model 2 : Single Writer, Async Table Services
+### Model 2 : Single Writer, Async Table Services
 
 Our delete/ingest example above is n't really that simple. While ingest/writer may just be updating the last N partitions on the table, delete may span across the entire table even. Mixing them in the same job, could slow down ingest latency by a lot. But, Hudi provides the option of running the table services in an async fashion, where most of the heavy lifting (e.g actually rewriting the columnar data by compaction service) is done asynchronously, eliminating any repeated wasteful retr [...]
 
 ![concurrency-async](/assets/images/blog/concurrency/SingleWriterAsync.gif)
 
-# Model 3 : Multiple Writers
+### Model 3 : Multiple Writers
 
 But it's not always possible to serialize the deletes into the same write stream or sql based deletes are required. With multiple distributed processes, some form of locking is inevitable, but like real databases Hudi's concurrency model is intelligent enough to differentiate actual writing to the table, from table services that manage or optimize the table. Hudi offers similar optimistic concurrency control across multiple writers, but table services can still execute completely lock-fr [...]
 
diff --git a/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md b/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
index 2ebe72fea6..cda7b5c66e 100644
--- a/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
+++ b/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
@@ -10,7 +10,7 @@ As of Hudi v0.10.0, we are excited to introduce support for an advanced Data Lay
 
 <!--truncate-->
 
-## Background
+### Background
 
 Amazon EMR team recently published a [great article](https://aws.amazon.com/blogs/big-data/new-features-from-apache-hudi-0-7-0-and-0-8-0-available-on-amazon-emr/) show-casing how [clustering](https://hudi.apache.org/docs/clustering) your data can improve your _query performance_.
 
@@ -71,7 +71,7 @@ In a similar fashion, Hilbert curves also allow you to map points in a N-dimensi
 
 Now, let's check it out in action!
 
-# Setup
+### Setup
 We will use the [Amazon Reviews](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) dataset again, but this time we will use Hudi to Z-Order by `product_id`, `customer_id` columns tuple instead of Clustering or _linear ordering_.
 
 No special preparations are required for the dataset, you can simply download it from [S3](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) in Parquet format and use it directly as an input for Spark ingesting it into Hudi table.
@@ -150,7 +150,7 @@ df.write.format("hudi")
 
 
 
-# Testing
+### Testing
 Please keep in mind, that each individual test is run in a separate spark-shell to avoid caching getting in the way of our measurements.
 
 ```scala
@@ -300,7 +300,7 @@ scala> runQuery3(dataSkippingSnapshotTableName)
 +-----------+-----------+
 ```
 
-# Results
+### Results
 We've summarized the measured performance metrics below:
 
 | **Query** | **Baseline (B)** duration (files scanned / size) | **Linear Sorting (S)** | **Z-order (Z)** duration (scanned) | **Hilbert (H)** duration (scanned) |
@@ -315,6 +315,6 @@ Which is a very clear contrast with space-filling curves (both Z-order and Hilbe
 
 It's worth noting that the performance gains are heavily dependent on your underlying data and queries. In benchmarks on our internal data we were able to achieve queries performance improvements of more than **11x!**
 
-# Epilogue
+### Epilogue
 
 Apache Hudi v0.10 brings new layout optimization capabilities Z-order and Hilbert to open source. Using these industry leading layout optimization techniques can bring substantial performance improvement and cost savings to your queries!
diff --git a/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx b/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
index 9fa5b5abb9..538145b3a9 100644
--- a/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
+++ b/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
@@ -5,6 +5,7 @@ authors:
 - name: Gabriele Cacciola
 - name: Udit Mehrotra
 category: blog
+image: /assets/images/powers/aws.jpg
 ---
 
 import Redirect from '@site/src/components/Redirect';
diff --git a/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png b/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png
new file mode 100644
index 0000000000..34e818702f
Binary files /dev/null and b/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png differ
diff --git a/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg b/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg
new file mode 100644
index 0000000000..a836cf7365
Binary files /dev/null and b/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg differ
diff --git a/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png b/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png
new file mode 100644
index 0000000000..86f44ee020
Binary files /dev/null and b/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png differ
diff --git a/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png b/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png
new file mode 100644
index 0000000000..624264dacb
Binary files /dev/null and b/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png differ
diff --git a/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png b/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png
new file mode 100644
index 0000000000..9e95594741
Binary files /dev/null and b/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png differ
diff --git a/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png b/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png
new file mode 100644
index 0000000000..118839e543
Binary files /dev/null and b/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png differ
diff --git a/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png b/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
new file mode 100644
index 0000000000..134fb77f64
Binary files /dev/null and b/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png differ
diff --git a/website/static/assets/images/blog/2022-05-17-multimodal-index.gif b/website/static/assets/images/blog/2022-05-17-multimodal-index.gif
new file mode 100644
index 0000000000..3e705205a6
Binary files /dev/null and b/website/static/assets/images/blog/2022-05-17-multimodal-index.gif differ