You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "gudladona (via GitHub)" <gi...@apache.org> on 2023/03/16 02:48:04 UTC
[GitHub] [hudi] gudladona opened a new issue, #8199: [SUPPORT] OOM during a Sync/Async clean operation
gudladona opened a new issue, #8199:
URL: https://github.com/apache/hudi/issues/8199
**OOM during a Sync or Async clean operation**
ENV:
Hudi version: 0.11.1
Java Version 1.8
Spark Version: 3.1.2
EMR version: 6.4
Clean Policy: KEEP_LATEST_BY_HOURS -- 24 hours(default)
Clean Parallelism: 200 (default)
Metadata: disabled
We have been experiencing consistent OOM errors when running Hudi delta-streamer job in continuous mode. The oom occurs during the "Generating list of file slices to be cleaned" phase. The image below shows the heap growth during the clean operation.
The heap growth particularly happens during 2 API calls from the executors getReplacedFileGroupsBefore and getAllFileGroups on the file system view.
<img width="1363" alt="image (7)" src="https://user-images.githubusercontent.com/7864088/225490459-566f1b27-0240-4149-b9a1-0e0cca68347f.png">
Also, we do have jfr files that contain memory profiles during a failed clean operation, Github does not allow us to attach them to the issue.
We also tried the following setting `hoodie.embed.timeline.server.async: true` which seems to have reduced the heap usage. This seems to happen due to the single threaded nature of the async executor. Using this setting we notice the following heap usage
<img width="1233" alt="image" src="https://user-images.githubusercontent.com/7864088/225494337-f6bf3c78-8b75-4612-9f50-3ce43183a7df.png">
Flame Graph for the Async clean with async timeline server
<img width="1638" alt="image" src="https://user-images.githubusercontent.com/7864088/225494603-f96fc3b6-08b8-4f7a-ad7b-941cb0b992c1.png">
Flame Graph for Async clean with sync timeline server
<img width="1641" alt="image" src="https://user-images.githubusercontent.com/7864088/225494795-0eb42127-70e8-485e-8967-675e2c3e0abc.png">
**To Reproduce**
Description of the table's s3 partition structure
<s3-prefix>/tenant=[0-9]/date=YYYY-MM-DD
Steps to reproduce the behavior:
1.
2.
3.
4.
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version : 0.11.1
* Spark version : 3.2.1
* Hadoop version : 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```qtp1729765409-406
at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
at java.lang.StringCoding.encode(Ljava/nio/charset/Charset;[CII)[B (StringCoding.java:350)
at java.lang.String.getBytes(Ljava/nio/charset/Charset;)[B (String.java:941)
at io.javalin.Context.result(Ljava/lang/String;)Lio/javalin/Context; (Context.kt:364)
at org.apache.hudi.timeline.service.RequestHandler.writeValueAsStringSync(Lio/javalin/Context;Ljava/lang/Object;)V (RequestHandler.java:210)
at org.apache.hudi.timeline.service.RequestHandler.writeValueAsString(Lio/javalin/Context;Ljava/lang/Object;)V (RequestHandler.java:176)
at org.apache.hudi.timeline.service.RequestHandler.lambda$registerFileSlicesAPI$18(Lio/javalin/Context;)V (RequestHandler.java:384)
at org.apache.hudi.timeline.service.RequestHandler$$Lambda$2356.handle(Lio/javalin/Context;)V (Unknown Source)
at org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(Lio/javalin/Context;)V (RequestHandler.java:501)
at io.javalin.security.SecurityUtil.noopAccessManager(Lio/javalin/Handler;Lio/javalin/Context;Ljava/util/Set;)V (SecurityUtil.kt:22)
at io.javalin.Javalin$$Lambda$2336.manage(Lio/javalin/Handler;Lio/javalin/Context;Ljava/util/Set;)V (Unknown Source)
at io.javalin.Javalin.lambda$addHandler$0(Lio/javalin/Handler;Ljava/util/Set;Lio/javalin/Context;)V (Javalin.java:606)
at io.javalin.Javalin$$Lambda$2340.handle(Lio/javalin/Context;)V (Unknown Source)
at io.javalin.core.JavalinServlet$service$2$1.invoke()V (JavalinServlet.kt:46)
at io.javalin.core.JavalinServlet$service$2$1.invoke()Ljava/lang/Object; (JavalinServlet.kt:17)
at io.javalin.core.JavalinServlet$service$1.invoke(Lkotlin/jvm/functions/Function0;)V (JavalinServlet.kt:143)
at io.javalin.core.JavalinServlet$service$2.invoke()V (JavalinServlet.kt:41)
at io.javalin.core.JavalinServlet.service(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (JavalinServlet.kt:107)
at io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (JettyServerUtil.kt:72)
at org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ScopedHandler.java:203)
at org.apache.hudi.org.eclipse.jetty.servlet.ServletHandler.doScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ServletHandler.java:480)
at org.apache.hudi.org.eclipse.jetty.server.session.SessionHandler.doScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (SessionHandler.java:1668)
at org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ScopedHandler.java:201)
at org.apache.hudi.org.eclipse.jetty.server.handler.ContextHandler.doScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ContextHandler.java:1247)
at org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ScopedHandler.java:144)
at org.apache.hudi.org.eclipse.jetty.server.handler.HandlerList.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (HandlerList.java:61)
at org.apache.hudi.org.eclipse.jetty.server.handler.StatisticsHandler.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (StatisticsHandler.java:174)
at org.apache.hudi.org.eclipse.jetty.server.handler.HandlerWrapper.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (HandlerWrapper.java:132)
at org.apache.hudi.org.eclipse.jetty.server.Server.handle(Lorg/apache/hudi/org/eclipse/jetty/server/HttpChannel;)V (Server.java:502)
at org.apache.hudi.org.eclipse.jetty.server.HttpChannel.handle()Z (HttpChannel.java:370)
at org.apache.hudi.org.eclipse.jetty.server.HttpConnection.onFillable()V (HttpConnection.java:267)
at org.apache.hudi.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded()V (AbstractConnection.java:305)
at org.apache.hudi.org.eclipse.jetty.io.FillInterest.fillable()Z (FillInterest.java:103)
at org.apache.hudi.org.eclipse.jetty.io.ChannelEndPoint$2.run()V (ChannelEndPoint.java:117)
at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(Ljava/lang/Runnable;)V (EatWhatYouKill.java:333)
at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(Z)Z (EatWhatYouKill.java:310)
at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(Z)V (EatWhatYouKill.java:168)
at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()V (EatWhatYouKill.java:126)
at org.apache.hudi.org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()V (ReservedThreadExecutor.java:366)
at org.apache.hudi.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Ljava/lang/Runnable;)V (QueuedThreadPool.java:765)
at org.apache.hudi.org.eclipse.jetty.util.thread.QueuedThreadPool$2.run()V (QueuedThreadPool.java:683)
at java.lang.Thread.run()V (Thread.java:750)```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation
Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1542733620
We attempted a fix https://github.com/apache/hudi/pull/8480
let us know if this helps solve the issue
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] gudladona commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation
Posted by "gudladona (via GitHub)" <gi...@apache.org>.
gudladona commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1474500670
We may have some indicators on what is causing this problem
we have a small file limit of 100MB, it appears that this works well (makes larger files and cleans smaller files) for an average partitions that meets the size requirements.
however, for a partition thats very busy/high volume. it seems like its over bucketing the inserts into many files bec based on avg rec size and the size of new inserts it would always exceed the file size limits and causing it to write to a new file group
example, here is number of file groups written for a single instant(commit) in this partition
```
aws s3 ls s3://<prefix>/<table>/<tenant>/date=20230316/ | awk -F _ '{print $3}' | sort | uniq -c | sort -nk1 | tail
167 20230316203454183.parquet
168 20230316195218670.parquet
168 20230316201208079.parquet
170 20230316200728433.parquet
175 20230316210557345.parquet
180 20230316130454342.parquet
182 20230316212237421.parquet
211 20230316192405566.parquet
245 20230316210251305.parquet
263 20230316204926437.parquet
```
As we can see here the shear number of small files in this partition is causing a HUGE json response from the driver there by triggering OOM errors.
we need help in figuring out how to tune this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] nfarah86 commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation
Posted by "nfarah86 (via GitHub)" <gi...@apache.org>.
nfarah86 commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1481697918
what are the memory configs you're using?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] gudladona commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation
Posted by "gudladona (via GitHub)" <gi...@apache.org>.
gudladona commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1494817669
> what are the memory configs you're using?
This failed on driver even with MAX heap of 32G. The threshold for failure can be relative to the number of files the partition
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org