You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/26 15:30:28 UTC

[GitHub] [hudi] codope commented on a change in pull request #3525: [HUDI-2346] Async clustering usage blog

codope commented on a change in pull request #3525:
URL: https://github.com/apache/hudi/pull/3525#discussion_r696744033



##########
File path: website/blog/2021-08-23-async-clustering.md
##########
@@ -0,0 +1,153 @@
+---
+title: "Asynchronous Clustering using Hudi"
+excerpt: "How to setup Hudi for asynchronous clustering"
+author: codope 
+category: blog
+---
+
+In one of the [previous blog](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) posts, we introduced a new
+kind of table service called clustering to reorganize data for improved query performance without compromising on
+ingestion speed. We learnt how to setup inline clustering. In this post, we will discuss what has changed since then and
+see how asynchronous clustering can be setup using HoodieClusteringJob as well as DeltaStreamer utility.
+
+## Introduction
+
+On a high level, clustering creates a plan based on a configurable strategy, groups eligible files based on specific
+criteria and then executes the plan. Hudi's [MVCC model](https://hudi.apache.org/docs/concurrency_control) provides
+snapshot isolation between multiple table services, which allows writers to continue with ingestion while clustering
+runs in the background. For a more detailed overview of the clustering architecture please check out the previous blog
+post.
+
+## Clustering Strategies
+
+As mentioned before, clustering plan as well as execution depends on configurable strategy. These strategies can be
+broadly classified into three types: clustering plan strategy, execution strategy and update strategy.
+
+### Plan Strategy
+
+This strategy comes into play while creating clustering plan. It helps to decide what file groups should be clustered.
+Let's look at different plan strategies that are available with Hudi. Note that these strategies are easily pluggable
+using this [config](https://hudi.apache.org/docs/next/configurations#hoodieclusteringplanstrategyclass).
+
+1. `SparkSizeBasedClusteringPlanStrategy`: It selects file slices based on
+   the [small file limit](https://hudi.apache.org/docs/next/configurations/#hoodieclusteringplanstrategysmallfilelimit)
+   of base files and creates clustering groups upto max file size allowed per group. The max size can be specified using

Review comment:
       Good suggestion. Added a couple of lines.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org