You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/08 05:29:03 UTC

[GitHub] [hudi] FelixKJose commented on issue #4891: Clustering not working on large table and partitions

FelixKJose commented on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1061421066


   @codope @suryaprasanna Thank you for the detailed information.
   
   Couple of questions:
   1. Let's say my each partitions (date) are large partitions (eg. 6.5 TB uncompressed data), so having the frequent async clustering is suggested right? I am running on r5.4xlarge (meaning 37GB driver memory), so what will be best clusering frequency? What will be the best value for hoodie.clustering.plan.strategy.small.file.limit?
   2. Which lock provider is advised if I am running on AWS EMR?
   
   Note: Our requirement is to ingest data quickly and at the same time expecting to support interactive workloads for query side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org