You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "ehurheap (via GitHub)" <gi...@apache.org> on 2023/03/16 19:14:25 UTC

[GitHub] [hudi] ehurheap opened a new issue, #8209: [SUPPORT] auto_clean stopped running during ingest

ehurheap opened a new issue, #8209:
URL: https://github.com/apache/hudi/issues/8209

   **Describe the problem you faced**
   
   We are running spark streaming ingestion with the following cleaner configs:
   ```
   (hoodie.clean.automatic -> true)
   (hoodie.clean.max.commits -> 30)
   (hoodie.cleaner.hours.retained -> 24)
   (hoodie.cleaner.parallelism -> 256)
   (hoodie.cleaner.policy -> KEEP_LATEST_BY_HOURS)
   ```
   Ingestion commits happen about every 20-30 minutes. However using the hudi-cli I can see that the cleans occur far less frequently, and at some point about 3 weeks ago cleans stopped happening altogether. When the ingestion was restarted, it stalled on `Generating list of file slices to be cleaned:`, and eventually the executors ran out of memory and the job failed.
   
   To allow ingestion to proceed we redeployed with automatic cleaner disabled.
   
   Questions:
   
   - Why did the cleaner stop running?
   - Is it expected that the cleans happen less frequently than commits?
   - Is cleaning impacted by not using the metadata table?
   - What is the best approach to catch up on all the files to be cleaned?
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Deploy ingestion with above write configs
   2. observe cleans in hudi-cli `cleans show`
   3. Redeploy ingestion after cleaner has stopped for some time
   
   
   **Expected behavior**
   The cleaner table service is invoked immediately after each commit.
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   additional write configs include
   ```
   (hoodie.compact.inline -> false)
   (hoodie.compact.schedule.inline -> false)
   (hoodie.datasource.compaction.async.enable -> false)
   (hoodie.metadata.enable -> false)
   (hoodie.write.concurrency.mode,optimistic_concurrency_control)
   (hoodie.write.lock.dynamodb.partition_key,key1)
   (hoodie.write.lock.dynamodb.region,us-east-1)
   (hoodie.write.lock.dynamodb.table,datalake-locks)
   (hoodie.write.lock.provider,org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8209:
URL: https://github.com/apache/hudi/issues/8209#issuecomment-1490482315

   likely you have too huge no of partitions and hence jsonifying the results might result in OOM
   
   can you try disabling embedded timeline server for couple of commits and re-enable cleaner. 
   hoodie.embed.timeline.server=false.
   
   hopefully this should make cleaner get past the issue. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on issue #8209:
URL: https://github.com/apache/hudi/issues/8209#issuecomment-1473088369

   Maybe your commits are archived before their committed data files are cleaned?  
   You can check the earliest commit in your hoodie active timeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan closed issue #8209: [SUPPORT] auto_clean stopped running during ingest
URL: https://github.com/apache/hudi/issues/8209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ehurheap commented on issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "ehurheap (via GitHub)" <gi...@apache.org>.

ehurheap commented on issue #8209:
URL: https://github.com/apache/hudi/issues/8209#issuecomment-1478788893

   I have attempted to run the cleaner as a separate step from the ingestion. The ingestion is now configured with
   ```
   hoodie.clean.automatic -> false
   hoodie.archive.automatic -> false
   ```
   
   Using the hudi-cli I submitted this clean command:
   
   `cleans run --sparkMaster local[8] --sparkMemory 60G --hoodieConfigs "hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS hoodie.cleaner.hours.retained=1920 hoodie.cleaner.parallelism=400"`
   
   After an hour and a half, it failed with 
   ```
   ...INFO S3NativeFileSystem: Opening 's3://path-to-table/users_changes-v1/.hoodie/20221208164706388.savepoint' for reading
   #udi:users_changes->
   # java.lang.OutOfMemoryError: Java heap space
   # -XX:OnOutOfMemoryError="kill -9 %p"
   #   Executing /bin/sh -c "kill -9 1614"...
   Failed to clean hoodie dataset
   ```
   I tried tweaking the hudi-cli.sh script to specify `-Xmx=90G` in the java command, but that did not help.
   
   
   As an alternative I attempted the `spark-submit` version of the cleans command like so:
   ```
   spark-submit --deploy-mode cluster \
   --conf spark.executor.instances=30 \
   --conf spark.executor.cores=2 \
   --conf spark.executor.memory=20G \
   --conf spark.driver.memory=40G \
   --conf spark.app.name=HoodieCleaner_users_changes-v1 \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --class org.apache.hudi.utilities.HoodieCleaner /usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar \
   --target-base-path s3://path-to-table/users_changes-v1 \
   --hoodie-conf hoodie.metadata.enable=false \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
   --hoodie-conf hoodie.cleaner.hours.retained=2000 \
   --hoodie-conf hoodie.cleaner.parallelism=400 \
   --hoodie-conf hoodie.clean.allow.multiple=false \
   --hoodie-conf hoodie.archive.async=false \
   --hoodie-conf hoodie.archive.automatic=false
   ```
   After about an hour the application attempt dies and the driver logs show this:
   ```
   23/03/22 00:43:50 ERROR Javalin: Exception occurred while servicing http-request
   java.lang.OutOfMemoryError: null
   	at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) ~[?:1.8.0_362]
   	at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) ~[?:1.8.0_362]
   	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125) ~[?:1.8.0_362]
   	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) ~[?:1.8.0_362]
   	at java.lang.StringBuilder.append(StringBuilder.java:195) ~[?:1.8.0_362]
   	at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:460) ~[jackson-core-2.13.3.jar:2.13.3]
   	at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:85) ~[jackson-core-2.13.3.jar:2.13.3]
   	at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3827) ~[jackson-databind-2.13.3.jar:2.13.3]
   	at org.apache.hudi.timeline.service.RequestHandler.jsonifyResult(RequestHandler.java:198) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler.writeValueAsStringSync(RequestHandler.java:209) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler.writeValueAsString(RequestHandler.java:176) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler.lambda$registerFileSlicesAPI$18(RequestHandler.java:384) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:501) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.core.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:23) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.http.JavalinServlet$addHandler$protectedHandler$1.handle(JavalinServlet.kt:128) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.http.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:45) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.http.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:24) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.http.JavalinServlet$service$1.invoke(JavalinServlet.kt:136) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.http.JavalinServlet$service$2.invoke(JavalinServlet.kt:40) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.http.JavalinServlet.service(JavalinServlet.kt:81) ~[__app__.jar:0.11.1-amzn-0]
   	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) ~[javax.servlet-api-3.1.0.jar:3.1.0]
   	at io.javalin.websocket.JavalinWsServlet.service(JavalinWsServlet.kt:51) ~[__app__.jar:0.11.1-amzn-0]
   ```
   I tried several attempts tweaking these parameters for each run:
   ```
   --hoodie-conf hoodie.cleaner.hours.retained
   --conf spark.executor.instances
   --conf spark.executor.cores
   --conf spark.executor.memory
   --conf spark.driver.memory
   ```
   Each time I got some error, for example both the following are errors that occurred in different runs:
   
   ```
   # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
   ```
   and
   ```
   23/03/21 20:39:21 ERROR RequestHandler: Got runtime exception servicing request partition=env_id%3D2907378054%2Fweek%3D20221121&basepath=s3%3A%2F%2Fheap-datalake-storage%2Fdata%2Ftables%2Fusers_changes-v1&lastinstantts=20230321200037228&timelinehash=c015e055fa5d5d3f14376d8c4aee8b41e5be8cd928f0c72068646cb95f9365c5
   java.lang.NegativeArraySizeException: null
   	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68) ~[?:1.8.0_362]
   	at java.lang.StringBuilder.<init>(StringBuilder.java:106) ~[?:1.8.0_362]
   	at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:455) ~[jackson-core-2.13.3.jar:2.13.3]
   	at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:85) ~[jackson-core-2.13.3.jar:2.13.3]
   	at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3827) ~[jackson-databind-2.13.3.jar:2.13.3]
   	at org.apache.hudi.timeline.service.RequestHandler.jsonifyResult(RequestHandler.java:198) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler.writeValueAsStringSync(RequestHandler.java:209) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler.writeValueAsString(RequestHandler.java:176) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler.lambda$registerFileSlicesAPI$18(RequestHandler.java:384) ~[__app__.jar:0.11.1-amzn-0]
   	at org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:501) ~[__app__.jar:0.11.1-amzn-0]
   	at io.javalin.core.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:23) ~[__app__.jar:0.11.1-amzn-0]
   ```
   
   Is there some way to reduce the size of the clean operation so that we don't run into these errors?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ehurheap commented on issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "ehurheap (via GitHub)" <gi...@apache.org>.

ehurheap commented on issue #8209:
URL: https://github.com/apache/hudi/issues/8209#issuecomment-1490725208

   Hi @nsivabalan - confirming that the cleaner worked with this config setting - THANKS!
   
   Can you comment on why the timeline server does not have this problem during ingestion, only during cleaning?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #8209:
URL: https://github.com/apache/hudi/issues/8209#issuecomment-1538989347

   cool, we identified some perf hit w/ timeline server when dealing w/ too many files. https://github.com/apache/hudi/pull/8480 
   we fixed it in latest master and could go into 0.13.1. 
   
   So, I suggested to disable timeline server to get past few cleaner where replaced file groups are cleaned up. once the clean up is done, may be number of file groups to deal with are less and hence the timeline server should be able to cope up w/ it (w/o the above fix) 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ehurheap commented on issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "ehurheap (via GitHub)" <gi...@apache.org>.

ehurheap commented on issue #8209:
URL: https://github.com/apache/hudi/issues/8209#issuecomment-1476537258

   Prior to the earliest `deltacommit`, we have 23 filegroups with cleans inflight: for example:
   ```
   20221119132417991.clean
   20221119132417991.clean.inflight
   20221119132417991.clean.requested
   ```
   
   Then the earliest `deltacommit` is:
   ```
   20221208164706388.deltacommit
   20221208164706388.deltacommit.inflight
   20221208164706388.deltacommit.requested
   20221208164706388.savepoint
   20221208164706388.savepoint.inflight
   ```
   Following that filegroup there are a few `replacecommit`s:
   ```
   20221208174357204.replacecommit
   20221208174357204.replacecommit.inflight
   20221208174357204.replacecommit.requested
   ```
   etc. Followed by more `deltacommits`.
   
   
   - Are these in conflict somehow?
   - How can we tell what has been archived?
   - Is it possible that the archive/cleaner configs can be in conflict with each other?
   
   (This is from looking at the files in the S3 console - trying to use the hudi-cli commands is very slow, as it can take an hour to list commits for this table.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ehurheap commented on issue #8209: [SUPPORT] auto_clean stopped running during ingest

Posted by "ehurheap (via GitHub)" <gi...@apache.org>.

ehurheap commented on issue #8209:
URL: https://github.com/apache/hudi/issues/8209#issuecomment-1479853386

   [This ticket](https://github.com/apache/hudi/issues/8199) seems to describe similar symptoms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org