You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/09 08:18:04 UTC
[GitHub] [hudi] RajasekarSribalan opened a new issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
RajasekarSribalan opened a new issue #1939:
URL: https://github.com/apache/hudi/issues/1939
**Describe the problem you faced**
As per Hudi documentation, the size of parquet will be decided based on the configuration provided for limitFileSize? We use the default file size(120MB) but we could see many parquet being created with size exceeding 1Gb?
384.8 M 1.1 G /user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_77-585-137386_20200809071021.parquet
413.2 M 1.2 G /user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_79-655-153282_20200809073207.parquet
497.5 M 1.5 G /user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_12-585-137353_20200809071021.parquet
525.1 M 1.5 G /user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_13-655-153247_20200809073207.parquet
405.4 M 1.2 G /user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_43-655-153260_20200809073207.parquet
381.2 M 1.1 G /user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_45-585-137368_20200809071021.parquet
543.2 M 1.6 G /user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_21-585-137359_20200809071021.parquet
568.6 M 1.7 G /user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_24-655-153249_20200809073207.parquet
941.8 M 2.8 G /user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_97-585-137394_20200809071021.parquet
1012.1 M 3.0 G /user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_99-655-153289_20200809073207.parquet
481.7 M 1.4 G /user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_91-585-137390_20200809071021.parquet
522.4 M 1.5 G /user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_93-655-153285_20200809073207.parquet
776.8 M 2.3 G /user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_64-655-153271_20200809073207.parquet
743.1 M 2.2 G /user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_65-585-137378_20200809071021.parquet
352.4 M 1.0 G /user/xxxx/db290d35-277e-4c96-86eb-5036784ac437-0_76-655-153297_20200809073207.parquet
437.4 M 1.3 G /user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-585-137397_20200809071021.parquet
462.1 M 1.4 G /user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-655-153292_20200809073207.parquet
710.7 M 2.1 G /user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_61-655-153270_20200809073207.parquet
671.8 M 2.0 G /user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_63-585-137374_20200809071021.parquet
406.3 M 1.2 G /user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-585-137358_20200809071021.parquet
426.3 M 1.2 G /user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-655-153248_20200809073207.parquet
382.9 M 1.1 G /user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_42-655-153276_20200809073207.parquet
363.0 M 1.1 G /user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_43-585-137366_20200809071021.parque
We use COW table and streaming the data from Kafka to Hudi table via spark.
2. Hudi jobs is wailing with "Reason: Container killed by YARN for exceeding memory limits. 30.3 GB of 30 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead." eventhough we give higher executor memory. We aways get this issue during Hudi parquet write/rewrite stage. We gave only 1 core with 30GB per executor and we are still getting this issue. We tried increasing the executor memory but still it fails. PLs advice. How we can solve this issue? Increasing executor memory doesnt solve the issue. Is this because we have parquet with huge sizes?
**Expected behavior**
Parquet size should be less than or equal to 120MB
Parquet files with
**Environment Description**
* Hudi version : 0.5.2
* Spark version : 2.2.0
* Hive version : 1.X
* Hadoop version : 2.7
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671079032
Regarding OOM errors, please check if which Spark stage is causing the failure. You might need to tune parallelism for this. The size of parquet files should not be the issue.
Regarding file sizing, How did you create the initial dataset ? Did you change the limitFileSize parameter between commits ? What is your average record size. During initial commit, Hudi relies on hoodie.copyonwrite.record.size.estimate to estimate the average record size needed for file sizing. For the subsequent commits, it will auto tune based on previous commit metadata. May be, your record size is really large and you need to tune this parameter the first time you write to the dataset.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671690639
To understand, Are you using bulk insert for initial loading and upsert for subsequent operations ?
For records with LOBs, it is important to tune hoodie.copyonwrite.record.size.estimate during initial bootstrap to get the file sizing right.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1939:
URL: https://github.com/apache/hudi/issues/1939
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671139855
Thanks @bvaradar for quick response.
We are loading a table initially which has 2TB size of data and each columns will be having huge data(html content) but not sure the exact size of each value. During initial snapshot we dont set the limtFileSize.Hence we leave to Hudi to use the default 120MB size.
hoodie.copyonwrite.record.size.estimate - I haven't used this. I ll try this and let you know the outcome.
I get "Reason: Container killed by YARN for exceeding memory limits. 30.3 GB of 30 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead" while during the last phase of hudi i.e., during write. I hope this parameter should solve the issue.
Regarding the bulk insert parallelism, we get the number of partitions of the existing table and set it has the bulk insert parallelism.
In our case, 2TB data is close to 17000 partitions and hence bulk insert parallelism will be set to 17000.
Please correct me//suggest if you have furthers points to add.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] RajasekarSribalan commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
Posted by GitBox <gi...@apache.org>.
RajasekarSribalan commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671742308
Yes @bvaradar we do an initial bulk insert and then upsert for subsequent operations.! I configured hoodie.copyonwrite.record.size.estimate to 128 while taking initial load via bulk insert. But during subsequent upserts, we face memory issues as stated above and streaming jobs are getting failed... But we are sure the size of 10mill records is close to 10GB and we have given sufficient executor memory(60GB per executor and 4 cores)..
We use Dstream and number of records for each micro batch is 10 mil and size of the batch is 10GB.
We persist the RDD(10GB) in disk because we reuse RDD for upsert and subsequent deletes. What i can see from storage tab in spark is, Hudi do persist the data internally in memory. I tried configuring hoodie.write.status.storage.level to Disk to leave more memory for tasks.. But Hudi always persists in memory? Any thoughts on this prop? will this be a reason for memory isue?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-691739964
@RajasekarSribalan : Please reopen if you still have any questions.
Thanks,
Balaji.V
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-678220122
Sorry for the delay in responding , here is the default storage level config I am seeing,
private static final String WRITE_STATUS_STORAGE_LEVEL = "hoodie.write.status.storage.level";
private static final String DEFAULT_WRITE_STATUS_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
From the code, I can see that hudi uses the spark.persist API to manage this cache. So, this does not look like the problem.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org