You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/24 03:13:42 UTC

[GitHub] [hudi] venkee14 opened a new issue #1763: Hudi total upsert time is twice than the individual jobs time in Spark UI added together

venkee14 opened a new issue #1763:
URL: https://github.com/apache/hudi/issues/1763


   I have noticed that the individual jobs runtime in Spark UI server does not add up to the total upsert time taken. I am trying to understand where the extra time is spent and reduce it and make the upsert run faster.
   
   We have recently increased the hoodie.cleaner.commits.retained=250 number for this table to a higher value(250), Could it be due to this? We might want to increase this number even more, Since we would want to be able to do an incremental query going few weeks back, We do a batch upsert into the Hudi table every 10 mins.
   
   Spark UI shows total Uptime - 7.6 min
   Upsert Time from logs - 20/06/24 01:32:51 INFO metrics: type=GAUGE, name=AR_PAYMENT_SCHEDULES_ALL.commit.totalUpsertTime, value=488623
   Individual Job times added together - ~3.4 min
   
   Env:
   
   EMR Version - 5.28
   Hudi Version - 0.5.1
   Spark Version - 2.2.1
   
   I have attached the upsert job log, Spark UI screenshot.
   [Uploading logs.txt…]()
   <img width="1136" alt="Screen Shot 2020-06-23 at 7 21 03 PM" src="https://user-images.githubusercontent.com/12746240/85494664-f560cf00-b58d-11ea-92ff-820393e84216.png">
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1763: [SUPPORT] Hudi total upsert time is twice than the individual jobs time in Spark UI added together

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1763:
URL: https://github.com/apache/hudi/issues/1763#issuecomment-653264952


   @venkee14  any updates on this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1763: [SUPPORT] Hudi total upsert time is twice than the individual jobs time in Spark UI added together

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1763:
URL: https://github.com/apache/hudi/issues/1763#issuecomment-648886584


   @venkee14 can you give 0.5.3 a shot? it has bunch of perf fixes that might help you.. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] venkee14 commented on issue #1763: [SUPPORT] Hudi total upsert time is twice than the individual jobs time in Spark UI added together

Posted by GitBox <gi...@apache.org>.
venkee14 commented on issue #1763:
URL: https://github.com/apache/hudi/issues/1763#issuecomment-654451640


   Issue was due to a bug in our code, where we have set "hoodie.keep.max.commits" to a very big number, which was not intended.
   
   Once we have fixed that archive number, we saw the upsert time coming down. We have also tried upgrading to 0.5.3, but did not see a considerable performance improvment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] venkee14 closed issue #1763: [SUPPORT] Hudi total upsert time is twice than the individual jobs time in Spark UI added together

Posted by GitBox <gi...@apache.org>.
venkee14 closed issue #1763:
URL: https://github.com/apache/hudi/issues/1763


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] venkee14 commented on issue #1763: [SUPPORT] Hudi total upsert time is twice than the individual jobs time in Spark UI added together

Posted by GitBox <gi...@apache.org>.
venkee14 commented on issue #1763:
URL: https://github.com/apache/hudi/issues/1763#issuecomment-648916677


   @vinothchandar : Thanks will try that and report back


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org