You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/10 02:43:03 UTC

[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

RajasekarSribalan commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671139855


   Thanks @bvaradar  for quick response. 
   
   We are loading a table initially which has 2TB size of data and each columns will be having huge data(html content) but not sure the exact size of each value. During initial snapshot we dont set the limtFileSize.Hence we leave to Hudi to use the default 120MB size.
   
   hoodie.copyonwrite.record.size.estimate - I haven't used this. I ll try this and let you know the outcome.
   
   I get "Reason: Container killed by YARN for exceeding memory limits. 30.3 GB of 30 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead" while during the last phase of hudi i.e., during write. I hope this parameter should solve the issue.
   
   Regarding the bulk insert parallelism, we get the number of partitions of the existing table and set it has the bulk insert parallelism.
   
   In our case, 2TB data is close to 17000 partitions and hence bulk insert parallelism will be set to 17000.
   
   Please correct me//suggest if you have furthers points to add.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org