You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/10 22:48:09 UTC
[GitHub] [incubator-hudi] vinothchandar commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

vinothchandar commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612252471
 
 
   @vontman @ahmed-elfar First of all. Thanks for all the detailed information! 
   
   Answers to the good questions you raised 
   
   > Is that the normal time for initial loading for Hudi tables, or we are doing something wrong?
   It's hard to know what normal time is since it depends on schema, machine and so many things. But we should n't this very off. Tried to explain few things below. 
   
   > Do we need a better cluster/recoures to be able to load the data for the first time?, because it is mentioned on Hudi confluence page that COW bulkinsert should match vanilla parquet writing + sort only.
   
   If you are trying to ultimately migrate a table (using bulk_insert once) and then do updates/deletes. I suggest, testing upserts/deletes rather than bulk_insert.. If you primarily want to do bulk_insert alone to get other benefits of Hudi. Happy to work with you more and resolve this. Perf is a major push for the next release. So we can def collaborate here
   
   
   > Does partitioning improves the upsert and/or compaction time for Hudi tables, or just to improve the analytical queries (partition pruning)?
   
   Partitioning would benefit the query performance obviously. But for writing itself, the data size matter more, I would say. 
   
   > We have noticed that the most time spent in the data indexing (the bulk-insert logic itself) and not the sorting stages/operation before the indexing, so how can we improve that? should we provide our own indexing logic?
   Nope. you don't have to supply you own indexing or anthing. Bulk insert does not do any indexing, it does a global sort (so we can pack records belonging to same partition closer into the same file as much) and then writes out files. 
   
   
   **Few observations :** 
   
   - 47 min job is gc-ing quite a bit. So that can affect throughput a lot. Have you tried configuring the jvm.
   - I do see fair bit of skews here from sorting, which may be affecting over all run times.. #1149 is trying to also provide a non-sorted mode, that tradeoffs file sizing for potentially faster writing.
   
   On what could create difference between bulk_insert and spark/parquet :
   
   - I would also set `"hoodie.parquet.compression.codec" -> "SNAPPY"` since Hudi uses gzip compression by default, where spark.write.parquet will use SNAPPY 
   - Hudi currently does an extra `df.rdd` conversion that could affect bulk_insert/insert (upsert/delete workloads are bound by merge costs, this matters less there). I don't see that in your UI though.. 
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services