You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/25 20:51:41 UTC

[GitHub] [hudi] adnanhb opened a new issue #2207: [SUPPORT]

adnanhb opened a new issue #2207:
URL: https://github.com/apache/hudi/issues/2207


   Hello, this might be a basic question but I am not able to find any guidance anywhere. We are writing approx 8 million records (55 columns per reord) to a hudi dataset which is saved on s3. We are using copy on write. The entire process takes about 4 hours. I am pretty sure the overall time can be optimized but i am not sure how to go about it. My biggest confusion is whether running the spark application on multiple executors will speed up the write. From what i have gleaned from reading several posts is that apache hudi does not support concurrent writes. Does that mean having multiple executors manipulating the hudi dataset will not work? Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #2207: Performance issue with Dataset write to S3

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #2207:
URL: https://github.com/apache/hudi/issues/2207


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2207: Performance issue with Dataset write to S3

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2207:
URL: https://github.com/apache/hudi/issues/2207#issuecomment-716301612


   @adnanhb : Concurrent writes are writes happening to the same dataset from different spark applications. In your case, without any additional information, I would guess increasing executors would greatly reduce the write time.  Please look at the following FAQ entries for more details -
   
   https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whatperformance/ingestlatencycanIexpectforHudiwriting
   
   https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org