You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "phani482 (via GitHub)" <gi...@apache.org> on 2023/01/31 03:20:37 UTC

[GitHub] [hudi] phani482 opened a new issue, #7800: "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

phani482 opened a new issue, #7800:
URL: https://github.com/apache/hudi/issues/7800

   Hello Team,
   
   We are running Glue streaming Job which reads from kinesis and writes to Hudi COW table (s3) on glue catalog.
   The Job is running since ~1year without issues. However, lately we started seeing OOM errors as below without much insights from the logging.  
   
   a. I tried moving [.commits_.archive] files out of .hoodie folder to reduce the size of the .hoodie folder. This helped for a while but the issue started to surface again. 
   (s3://<bucket>/prefix/.hoodie/.commits_.archive.1763_1-0-1
   
   b. Here are the write options we are using for Apache Hudi Connector 0.9.0 
             "hoodie.datasource.write.operation": "insert",
               "hoodie.insert.shuffle.parallelism": 10,
               "hoodie.bulkinsert.shuffle.parallelism": 10,
               "hoodie.upsert.shuffle.parallelism": 10,
               "hoodie.delete.shuffle.parallelism": 10,
               "hoodie.parquet.small.file.limit": 8 * 1000 * 1000,  # 8MB
               "hoodie.parquet.max.file.size": 10 * 1000 * 1000,  # 10 MB
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.enable": "false",
               "hoodie.datasource.hive_sync.database": "database_name",
               "hoodie.datasource.hive_sync.table": "raw_table_name",
               "hoodie.datasource.hive_sync.partition_fields": "entity_name",
               "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.support_timestamp": "true",
               "hoodie.keep.min.commits": 1450,  
               "hoodie.keep.max.commits": 1500,  
               "hoodie.cleaner.commits.retained": 1449,
   
   Error:
   ###########
   INFO:py4j.java_gateway:Received command  on object id INFO:py4j.java_gateway:Closing down callback connection
   --
   INFO:py4j.java_gateway:Callback Connection ready to receive messages
   INFO:py4j.java_gateway:Received command c on object id p0
   INFO:root:Batch ID: 160325 has 110 records
   ## java.lang.OutOfMemoryError: Requested array size exceeds VM limit# -XX:OnOutOfMemoryError="kill -9 %p"#   Executing /bin/sh -c "kill -9 7"...
   ###########
   
   Q: We noticed that ".commits_.archive" files are not being cleaned up by hoodie by default. Are there any settings we need to enable for this to happen ?
   
   Q: Our .hoodie folder was ~1.5 GB in size before we started moving archive file out of the folder. Is this a hude size for .hoodie folder to be ? What are the best practices to maintain .hoodie folder in terms of size and object count?
   
   Q: The error logs doesnt indicate more details, but even after using 20 G.1x type DPU on GLue this seems to be not helping. (executor memeory: 10GB, Driver memeory 10 GB, executor cores 8). Our workload is not huge, we get few thousands of events every hr on avg 1 million records a day is what our job processes. The payload size is not more than ~300kb
   
   Please let me know if you need any further details
   
   Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] phani482 commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

Posted by "phani482 (via GitHub)" <gi...@apache.org>.

phani482 commented on issue #7800:
URL: https://github.com/apache/hudi/issues/7800#issuecomment-1463030125

   Thanks! @nsivabalan 
   
   Will try it out and see if this will fix our issue. Although this could take some time for us to implement in prod. Will post here whenever we do the upgrade. Thanks again! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #7800: "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #7800:
URL: https://github.com/apache/hudi/issues/7800#issuecomment-1409773708

   Thanks for the feedback @phani482 , sorry to tell you that cleaning of archival files are not supported now, I have created a JIRA issue to track this: https://issues.apache.org/jira/browse/HUDI-5659
   
   I also noticed that you use the `INSERT` operation, so which spark stage did you percive as slow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #7800:
URL: https://github.com/apache/hudi/issues/7800#issuecomment-1454336739

   but as far as trimming down the no of files, we don't have any automatic support as of now. but will be working on it. 
   if you are interested to work on it, let us know. we can guide you
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #7800:
URL: https://github.com/apache/hudi/issues/7800#issuecomment-1454336475

   hey @phani482 
   sorry for the late turn aorund. 
   Have you enabled sync by any chance? recently we found an issue where meta sync is loading the archival timeline unnecessarily. 
   
   https://github.com/apache/hudi/pull/7561
   
   If you can try w/ 0.13.0 and let us know what do you see, would be nice. or you can cherry-pick this commit into your internal fork if you have one. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] phani482 commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

Posted by "phani482 (via GitHub)" <gi...@apache.org>.

phani482 commented on issue #7800:
URL: https://github.com/apache/hudi/issues/7800#issuecomment-1410740119

   Not slowness, our jobs are failing with above error while hudi write.
   Is it an issue if we remove archive files from .hoodie folder?
   1. Does Hudi ignore archive files from .hoodie folder ?  Will it read archive files into timeline server?
   2. For a long running streaming Job, what are the best practices to manage metadata folder (.hoodie) to avoid out of memory errors?
   3. Are there nay spark heap settings required to be tuned? The hudi documentation is not clear enough on this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org