You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/25 15:35:31 UTC
[GitHub] [hudi] MikeBuh commented on issue #3751: [SUPPORT] Slow Write Speeds to Hudi

MikeBuh commented on issue #3751:
URL: https://github.com/apache/hudi/issues/3751#issuecomment-1108731683

   Hi @codope I did a rework and got a more stable version to work however do still have some things that I would like to clarify. Before going into the details please also consider: 
   
   **Hudi Changes**
   - Recently we have upgraded to Hudi 0.10.0 and thus all our tables now have been updated accordingly 
   - The table(s) in question use BLOOM index (previously this was GLBOAL_BLOOM) 
   - _hoodie.payload.ordering.field_ has been set to the same value of _hoodie.datasource.write.precombine.field_
   - _hoodie.upsert.shuffle.parallelism_ has been set to same value of _spark.sparkContext.defaultParallelism.toString_
   
   **Real Time Flow**
   - We have a real-time flow consuming, processing, and persisting data to Hudi using Spark structured streaming
   - In the most common scenario the flow reads 1 or 2 files of avro data each around 25MB (compacted via NiFi)
   - This flow has been successfully running for a while but we think performance can be improved
   - Our question at this point is if all this resources (seen hereunder) are needed given the small data input size that we have
   > num-executors 3 
   > executor-cores 3 
   > executor-memory 5400m 
   > spark.driver.memoryOverhead=1024m 
   > spark.sql.shuffle.partitions=18 
   > spark.default.parallelism=18 
   
    **Reload Flow**
   - Apart from the real-time flow we sometimes require to reload data in a separate flow, pausing the real-time one. 
   - The input data for this flow is as originally described (Avro Files | 25MB each | 250 / 6GB per day)
   - In our opinion this flow is taking too long to execute given the amount of resources (see hereunder) and size of data
   - Is there any recommended parameter and/or option to look into which might drastically improve performance? 
   
   > num-executors 5
   > executor-cores 5
   > executor-memory 7900m 
   > spark.driver.memoryOverhead=1020m 
   > spark.sql.shuffle.partitions=50 
   > spark.default.parallelism=50
   
   
   Should you require any additional details please reach out to us so that we may provide them. Thanks once again and we look forward to your reply.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org