You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/20 01:59:18 UTC

[GitHub] [hudi] rubenssoto opened a new issue #2463: [SUPPORT] Tuning Hudi Upsert Job

rubenssoto opened a new issue #2463:
URL: https://github.com/apache/hudi/issues/2463


   Hello Guys,
   
   Im testing Hudi performance in my scenarios. So it is a table with 4gb of parquet files, I'm using 2 executors with 5 cores each and 32gb of memory. The operation is Upsert because I need deduplication and only one file of 200mb were rewritten.
   
   <img width="1680" alt="Captura de Tela 2021-01-19 às 22 57 10" src="https://user-images.githubusercontent.com/36298331/105116535-b81fd980-5aa9-11eb-9bfb-3e6d25f5813b.png">
   
   The operation took 2.1 minutes and most of the time on Job: Getting small files from partitions, probably the job is written the parquet file to s3, but 1.6 minutes written 200mb is not too much?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar closed issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
vinothchandar closed issue #2463:
URL: https://github.com/apache/hudi/issues/2463


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-774420998


   @rubenssoto the target file size can be configured using tips here https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-926188524


   Please reopen as needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-763587396


   Can you attach .hoodie folder and cat the contents of last ".commit" files created ? The time taken could also be during index lookup where Hudi tries to find the record location for the incoming records to honor updates. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-772111789


   No, I think the difference in performance that I see on upsert is because rdd conversion, for example in bulk insert with hoodie.datasource.write.row.writer.enable ON is much faster on write parquet operation, I read in somewhere that exist plans to solve this in upsert too.
   
   I'm trying to tune the file size, 
   is there anything else that can be done?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-765206339


   @rubenssoto : The time taken is coming from the shuffling the data to route for writing to parquet file and the writing part.  I think if you increase the number of executors  (reduce cores per executor to 2) and try, it would be better. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-772111789


   No, I think the difference in performance that I see on upsert is because rdd conversion, for example in bulk insert with hoodie.datasource.write.row.writer.enable ON is much faster on write parquet operation, I read in somewhere that exist plans to solve this in upsert too.
   
   I'm trying to tune the file size, 
   is there anything else that can be done?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-771408348


   @rubenssoto Did increasing the num executors help ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-771408348


   @rubenssoto Did increasing the num executors help ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-764683609


   Hi @bvaradar ,
   
   Could you helpme with this ? :)
   
   thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-763618093


   Sure.
   [20210120003538.commit.zip](https://github.com/apache/hudi/files/5842813/20210120003538.commit.zip)
   
   Is it help you?
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-926188440


   @rubenssoto do you still need assistance with this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-764683609


   Hi @bvaradar ,
   
   Could you helpme with this ? :)
   
   thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org