You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/03 15:20:52 UTC

[GitHub] [hudi] codope commented on issue #4891: Clustering not working on large table and partitions

codope commented on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1058149325


   > How does normally people perform inline clustering or async clustering on partitions with large amount of data? Do you expect driver memory should be larger than the clustering data size?
   
   Currently, [the union](https://github.com/apache/hudi/blob/a4ba0fff07861bdff338074fb5d84726f430883f/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java#L108) will collect hudi records (as per the plan generated for corresponding replacecommit) into the driver.
   
   > What are the configurations I should be using to perform clustering on these large tables?
   > In PROD I will have 1.8 billion records (each record 3-4 KB in memory) on each day, so is it advised to perform clustering frequently (every 10 to 20 commits) or daily?
   
   As you said, you could run clustering more frequently. Configs will also depend on clustering strategy. Looking at your configs, you can adjust the `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to a lower value. You could also tune `hoodie.clustering.plan.strategy.small.file.limit` to a value such that even though each round of clustering is picking just a few partitions but it is able to utilize the memory as much as possible. What's the nature of your workload? Is it like append-only, or frequent upserts but to more recent partitions? 
   
   > Does MOR table supports async clustering with OCC assurance?
   
   Yes, if the lock provider is configured.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org