You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/14 06:22:16 UTC

[GitHub] [hudi] codope commented on issue #4891: Clustering not working on large table and partitions

codope commented on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1246296509

   We had benchmarked this. For multiple spark jobs, we cannot avoid `union`. Good thing is that clustering does not do rdd.union, instead it runs context.union which is slightly better. Benchmark revealed that writing to parquet column as a whole incurs high overhead. Another thing which hogs up memory is bytes-avro conversion. More details in HUDI-2949. Te fix entailed making changes in [parquet-mr](https://github.com/apache/parquet-mr/commit/06bb358bcf8a0855c54f20122a57a88d9fde16c1). That fix has been merged but we have not yet upgraded parquet version in Hudi. Created HUDI-4840 to track the upgrade. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org