You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/23 20:21:37 UTC

[GitHub] [hudi] vinothchandar commented on issue #1757: Slow Bulk Insert Performance [SUPPORT]

vinothchandar commented on issue #1757:
URL: https://github.com/apache/hudi/issues/1757#issuecomment-648396783


   @somebol assuming this is an initial load and after this, you would do insert/upsert operations incrementally? 
   
   High level, `bulk_insert` does a sort and writes out the data. From what I can tell, you have sufficient parallelism. But a bunch of tasks are failing and retrying probably adds a bunch of time to the runs? (stage 2, 4). Could you look into how skewed task runtimes within those stages are? 
   
   P.S:  We do incur the cost of Row -> GenericRecord -> Parquet (@nsivabalan has a branch with a fix, that will make it to 0.6.0) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org