You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/07 22:47:17 UTC

[GitHub] [hudi] suryaprasanna commented on issue #4891: Clustering not working on large table and partitions

suryaprasanna commented on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1061218820

@codope summed it really well.
I just want to expand on how to use clustering. I have used clustering on COW table, so my knowledge is limited to that.

Clustering can be used in both async and inline. Ideally inline-clustering is preferred if you want to work on the recent changes done by ingestion job.
Async-clustering is ideal candidate for running clustering on older partitions, like if you want to sort your entire table on a specific column etc or if you want to detach clustering from ingestion job(so that you don't overload ingestion job with clustering rewrites).

If clustering is running on an entire partition, these operations are resource intensive and also failure rate may be high. So, it is better to run clustering in async that way the ingestion job is not waiting for inline-clustering to complete.
Another optimization that is useful while running clustering for backfills, is to run one job for a group of partition instead of running one job for all the partitions.
**Example:** if you need to partition say from partition 2017/01/01 till 2022/02/28, it is better to identify no. of partitions you can run in bulk instead of running them all in one go, this number depends on your available resources and how soon you want to sort all partitions.

The grouping of the partitions is achieved by using following configs **PARTITION_FILTER_BEGIN_PARTITION** and **PARTITION_FILTER_END_PARTITION** present in HoodieClusteringConfig class.

In addition to grouping you can also schedule multiple clustering jobs in parallel, like say one job for running all partitions in 2017/01 and another job for 2017/02 etc. That way you can fully utilize async nature of clustering to the fullest to complete clustering for the desired partitions.

**Note:**
Although to execute clustering in async with parallelism, you may need to write some extra logic from the service that is starting these clustering jobs, and this should also handle the failure scenarios.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org