You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/09 08:18:04 UTC

[GitHub] [hudi] RajasekarSribalan opened a new issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

RajasekarSribalan opened a new issue #1939:
URL: https://github.com/apache/hudi/issues/1939


   **Describe the problem you faced**
   
   As per Hudi documentation, the size of parquet will be decided based on the configuration provided for limitFileSize? We use the default file size(120MB) but we could see many parquet being created with size exceeding 1Gb?
   
   384.8 M  1.1 G  /user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_77-585-137386_20200809071021.parquet
   413.2 M  1.2 G  /user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_79-655-153282_20200809073207.parquet
   497.5 M  1.5 G  /user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_12-585-137353_20200809071021.parquet
   525.1 M  1.5 G  /user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_13-655-153247_20200809073207.parquet
   405.4 M  1.2 G  /user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_43-655-153260_20200809073207.parquet
   381.2 M  1.1 G  /user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_45-585-137368_20200809071021.parquet
   543.2 M  1.6 G  /user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_21-585-137359_20200809071021.parquet
   568.6 M  1.7 G  /user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_24-655-153249_20200809073207.parquet
   941.8 M  2.8 G  /user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_97-585-137394_20200809071021.parquet
   1012.1 M  3.0 G  /user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_99-655-153289_20200809073207.parquet
   481.7 M  1.4 G  /user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_91-585-137390_20200809071021.parquet
   522.4 M  1.5 G  /user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_93-655-153285_20200809073207.parquet
   776.8 M  2.3 G  /user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_64-655-153271_20200809073207.parquet
   743.1 M  2.2 G  /user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_65-585-137378_20200809071021.parquet
   352.4 M  1.0 G  /user/xxxx/db290d35-277e-4c96-86eb-5036784ac437-0_76-655-153297_20200809073207.parquet
   437.4 M  1.3 G  /user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-585-137397_20200809071021.parquet
   462.1 M  1.4 G  /user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-655-153292_20200809073207.parquet
   710.7 M  2.1 G  /user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_61-655-153270_20200809073207.parquet
   671.8 M  2.0 G  /user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_63-585-137374_20200809071021.parquet
   406.3 M  1.2 G  /user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-585-137358_20200809071021.parquet
   426.3 M  1.2 G  /user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-655-153248_20200809073207.parquet
   382.9 M  1.1 G  /user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_42-655-153276_20200809073207.parquet
   363.0 M  1.1 G  /user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_43-585-137366_20200809071021.parque
   
   We use COW table and streaming the data from Kafka to Hudi table via spark. 
   
   2. Hudi jobs is wailing with "Reason: Container killed by YARN for exceeding memory limits. 30.3 GB of 30 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead." eventhough we give higher executor memory. We aways get this issue during Hudi parquet write/rewrite stage. We gave only 1 core with 30GB per executor and we are still getting this issue. We tried increasing the executor memory but still it fails. PLs advice. How we can solve this issue? Increasing executor memory doesnt solve the issue. Is this because we have parquet with huge sizes?
   
   **Expected behavior**
   
   Parquet size should be less than or equal to 120MB
   
   Parquet files with
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   * Spark version : 2.2.0
   
   * Hive version : 1.X
   
   * Hadoop version : 2.7
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : No
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671079032


   Regarding OOM errors, please check if which Spark stage is causing the failure.  You might need to tune parallelism for this. The size of parquet files should not be the issue. 
   
   Regarding file sizing, How did you create the initial dataset ? Did you change the limitFileSize parameter between commits ? What is your average record size. During initial commit, Hudi relies on hoodie.copyonwrite.record.size.estimate to estimate the average record size needed for file sizing. For the subsequent commits, it will auto tune based on previous commit metadata. May be, your record size is really large and you need to tune this parameter the first time you write to the dataset.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671690639


   To understand, Are you using bulk insert for initial loading and upsert for subsequent operations ? 
   For records with LOBs, it is important to tune hoodie.copyonwrite.record.size.estimate during initial bootstrap to get the file sizing right.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1939:
URL: https://github.com/apache/hudi/issues/1939


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Posted by GitBox <gi...@apache.org>.

RajasekarSribalan commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671139855


   Thanks @bvaradar  for quick response. 
   
   We are loading a table initially which has 2TB size of data and each columns will be having huge data(html content) but not sure the exact size of each value. During initial snapshot we dont set the limtFileSize.Hence we leave to Hudi to use the default 120MB size.
   
   hoodie.copyonwrite.record.size.estimate - I haven't used this. I ll try this and let you know the outcome.
   
   I get "Reason: Container killed by YARN for exceeding memory limits. 30.3 GB of 30 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead" while during the last phase of hudi i.e., during write. I hope this parameter should solve the issue.
   
   Regarding the bulk insert parallelism, we get the number of partitions of the existing table and set it has the bulk insert parallelism.
   
   In our case, 2TB data is close to 17000 partitions and hence bulk insert parallelism will be set to 17000.
   
   Please correct me//suggest if you have furthers points to add.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Posted by GitBox <gi...@apache.org>.

RajasekarSribalan commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671742308


   Yes @bvaradar we do an initial bulk insert and then upsert for subsequent operations.! I configured hoodie.copyonwrite.record.size.estimate to 128 while taking initial load via bulk insert. But during subsequent upserts, we face memory issues as stated above and  streaming jobs are getting failed... But we are sure the size of 10mill records is close to 10GB and we have given sufficient executor memory(60GB per executor and 4 cores)..
   
   We use Dstream and number of records for each micro batch is 10 mil and size of the batch is 10GB.
   
   We persist the RDD(10GB) in disk because we reuse RDD for upsert and subsequent deletes. What i can see from storage tab in spark is, Hudi do persist the data internally in memory. I tried configuring hoodie.write.status.storage.level to Disk to leave more memory for tasks.. But Hudi always persists in memory? Any thoughts on this prop? will this be a reason for memory isue?
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-691739964


   @RajasekarSribalan : Please reopen if you still have any questions.
   
   Thanks,
   Balaji.V


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-678220122


   Sorry for the delay in responding ,  here is the default storage level config I am seeing,
   
    private static final String WRITE_STATUS_STORAGE_LEVEL = "hoodie.write.status.storage.level";
     private static final String DEFAULT_WRITE_STATUS_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
   
   From the code, I can see that hudi uses the spark.persist API to manage this cache. So, this does not look like the problem. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org