You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/29 00:31:37 UTC

[GitHub] [hudi] yihua commented on issue #5481: [SUPPORT] Slow Upsert When Reloading Data into Hudi Table

yihua commented on issue #5481:
URL: https://github.com/apache/hudi/issues/5481#issuecomment-1169413106

   @MikeBuh sorry for getting back late.  If you still haven't figured out the right configs, [here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) is a more detailed tuning guide for upserts.  At this point, tuning other configs, like making `spark.memory.fraction` and `spark.memory.storageFraction` small (e.g., 0.2 as mentioned in the guide) to make the job reliably slow instead of crashing intermittently, are also helpful to try out.
   
   > size of target table: I noticed that reloading the same batches to a near empty table is more successful
   
   Given that the upsert is going to first index the input based on existing records in the data files, the target table size also matters for choosing the parallelism.  You may try 1000 for the parallelism.
   
   > file sizes: maybe having less but larger files in the target table can help when comparing and updating
   
   Larger files with a fewer number of files in the target table definitely help the indexing phase during upsert as fewer bloom index from footers are read.  Yet tunning the memory and fraction should still make it work with small files.
   
   > compaction and cleanup: if these are heavy operations that need lots of memory then perhaps they can be tweaked
   
   To start with, you may disable async compaction and clean services so that they don't interfere with the ingestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org