You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "gudladona (via GitHub)" <gi...@apache.org> on 2023/03/16 02:48:04 UTC

[GitHub] [hudi] gudladona opened a new issue, #8199: [SUPPORT] OOM during a Sync/Async clean operation

gudladona opened a new issue, #8199:
URL: https://github.com/apache/hudi/issues/8199

   
   
   **OOM during a Sync or Async clean operation**
   
   ENV:
   
   Hudi version: 0.11.1
   Java Version 1.8
   Spark Version: 3.1.2
   EMR version: 6.4
   Clean Policy: KEEP_LATEST_BY_HOURS -- 24 hours(default)
   Clean Parallelism: 200 (default)
   Metadata: disabled
   
   We have been experiencing consistent OOM errors when running Hudi delta-streamer job in continuous mode. The oom occurs during the "Generating list of file slices to be cleaned" phase. The image below shows the heap growth during the clean operation. 
   The heap growth particularly happens during 2 API calls from the executors getReplacedFileGroupsBefore and getAllFileGroups on the file system view.
   
   <img width="1363" alt="image (7)" src="https://user-images.githubusercontent.com/7864088/225490459-566f1b27-0240-4149-b9a1-0e0cca68347f.png">
   
   Also, we do have jfr files that contain memory profiles during a failed clean operation, Github does not allow us to attach them to the issue.
   
   
   We also tried the following setting `hoodie.embed.timeline.server.async: true` which seems to have reduced the heap usage. This seems to happen due to the single threaded nature of the async executor.  Using this setting we notice the following heap usage
   
   <img width="1233" alt="image" src="https://user-images.githubusercontent.com/7864088/225494337-f6bf3c78-8b75-4612-9f50-3ce43183a7df.png">
   
   Flame Graph for the Async clean with async timeline server
   
   <img width="1638" alt="image" src="https://user-images.githubusercontent.com/7864088/225494603-f96fc3b6-08b8-4f7a-ad7b-941cb0b992c1.png">
   
   Flame Graph for Async clean with sync timeline server
   
   <img width="1641" alt="image" src="https://user-images.githubusercontent.com/7864088/225494795-0eb42127-70e8-485e-8967-675e2c3e0abc.png">
   
   
   **To Reproduce**
   
   Description of the table's s3 partition structure
   
   <s3-prefix>/tenant=[0-9]/date=YYYY-MM-DD
   
   
   Steps to reproduce the behavior:
   
   1.  
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.2.1
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```qtp1729765409-406
     at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
     at java.lang.StringCoding.encode(Ljava/nio/charset/Charset;[CII)[B (StringCoding.java:350)
     at java.lang.String.getBytes(Ljava/nio/charset/Charset;)[B (String.java:941)
     at io.javalin.Context.result(Ljava/lang/String;)Lio/javalin/Context; (Context.kt:364)
     at org.apache.hudi.timeline.service.RequestHandler.writeValueAsStringSync(Lio/javalin/Context;Ljava/lang/Object;)V (RequestHandler.java:210)
     at org.apache.hudi.timeline.service.RequestHandler.writeValueAsString(Lio/javalin/Context;Ljava/lang/Object;)V (RequestHandler.java:176)
     at org.apache.hudi.timeline.service.RequestHandler.lambda$registerFileSlicesAPI$18(Lio/javalin/Context;)V (RequestHandler.java:384)
     at org.apache.hudi.timeline.service.RequestHandler$$Lambda$2356.handle(Lio/javalin/Context;)V (Unknown Source)
     at org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(Lio/javalin/Context;)V (RequestHandler.java:501)
     at io.javalin.security.SecurityUtil.noopAccessManager(Lio/javalin/Handler;Lio/javalin/Context;Ljava/util/Set;)V (SecurityUtil.kt:22)
     at io.javalin.Javalin$$Lambda$2336.manage(Lio/javalin/Handler;Lio/javalin/Context;Ljava/util/Set;)V (Unknown Source)
     at io.javalin.Javalin.lambda$addHandler$0(Lio/javalin/Handler;Ljava/util/Set;Lio/javalin/Context;)V (Javalin.java:606)
     at io.javalin.Javalin$$Lambda$2340.handle(Lio/javalin/Context;)V (Unknown Source)
     at io.javalin.core.JavalinServlet$service$2$1.invoke()V (JavalinServlet.kt:46)
     at io.javalin.core.JavalinServlet$service$2$1.invoke()Ljava/lang/Object; (JavalinServlet.kt:17)
     at io.javalin.core.JavalinServlet$service$1.invoke(Lkotlin/jvm/functions/Function0;)V (JavalinServlet.kt:143)
     at io.javalin.core.JavalinServlet$service$2.invoke()V (JavalinServlet.kt:41)
     at io.javalin.core.JavalinServlet.service(Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (JavalinServlet.kt:107)
     at io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (JettyServerUtil.kt:72)
     at org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ScopedHandler.java:203)
     at org.apache.hudi.org.eclipse.jetty.servlet.ServletHandler.doScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ServletHandler.java:480)
     at org.apache.hudi.org.eclipse.jetty.server.session.SessionHandler.doScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (SessionHandler.java:1668)
     at org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ScopedHandler.java:201)
     at org.apache.hudi.org.eclipse.jetty.server.handler.ContextHandler.doScope(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ContextHandler.java:1247)
     at org.apache.hudi.org.eclipse.jetty.server.handler.ScopedHandler.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (ScopedHandler.java:144)
     at org.apache.hudi.org.eclipse.jetty.server.handler.HandlerList.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (HandlerList.java:61)
     at org.apache.hudi.org.eclipse.jetty.server.handler.StatisticsHandler.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (StatisticsHandler.java:174)
     at org.apache.hudi.org.eclipse.jetty.server.handler.HandlerWrapper.handle(Ljava/lang/String;Lorg/apache/hudi/org/eclipse/jetty/server/Request;Ljavax/servlet/http/HttpServletRequest;Ljavax/servlet/http/HttpServletResponse;)V (HandlerWrapper.java:132)
     at org.apache.hudi.org.eclipse.jetty.server.Server.handle(Lorg/apache/hudi/org/eclipse/jetty/server/HttpChannel;)V (Server.java:502)
     at org.apache.hudi.org.eclipse.jetty.server.HttpChannel.handle()Z (HttpChannel.java:370)
     at org.apache.hudi.org.eclipse.jetty.server.HttpConnection.onFillable()V (HttpConnection.java:267)
     at org.apache.hudi.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded()V (AbstractConnection.java:305)
     at org.apache.hudi.org.eclipse.jetty.io.FillInterest.fillable()Z (FillInterest.java:103)
     at org.apache.hudi.org.eclipse.jetty.io.ChannelEndPoint$2.run()V (ChannelEndPoint.java:117)
     at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(Ljava/lang/Runnable;)V (EatWhatYouKill.java:333)
     at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(Z)Z (EatWhatYouKill.java:310)
     at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(Z)V (EatWhatYouKill.java:168)
     at org.apache.hudi.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()V (EatWhatYouKill.java:126)
     at org.apache.hudi.org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()V (ReservedThreadExecutor.java:366)
     at org.apache.hudi.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Ljava/lang/Runnable;)V (QueuedThreadPool.java:765)
     at org.apache.hudi.org.eclipse.jetty.util.thread.QueuedThreadPool$2.run()V (QueuedThreadPool.java:683)
     at java.lang.Thread.run()V (Thread.java:750)```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1542733620

   We attempted a fix https://github.com/apache/hudi/pull/8480 
   let us know if this helps solve the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] gudladona commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation

Posted by "gudladona (via GitHub)" <gi...@apache.org>.
gudladona commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1474500670

   We may have some indicators on what is causing this problem 
   
   we have a small file limit of 100MB, it appears that this works well (makes larger files and cleans smaller files) for an average partitions that meets the size requirements.
   
   however, for a partition thats very busy/high volume. it seems like its over bucketing the inserts into many files bec based on avg rec size and the size of new inserts it would always exceed the file size limits and causing it to write to a new file group
   
   example, here is number of file groups written for a single instant(commit) in this partition
   
   ```
   aws s3 ls s3://<prefix>/<table>/<tenant>/date=20230316/ | awk -F _ '{print $3}' | sort | uniq -c | sort -nk1  | tail
    167 20230316203454183.parquet
    168 20230316195218670.parquet
    168 20230316201208079.parquet
    170 20230316200728433.parquet
    175 20230316210557345.parquet
    180 20230316130454342.parquet
    182 20230316212237421.parquet
    211 20230316192405566.parquet
    245 20230316210251305.parquet
    263 20230316204926437.parquet
   ```
   
   As we can see here the shear number of small files in this partition is causing a HUGE json response from the driver there by triggering OOM errors. 
   
   we need help in figuring out how to tune this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nfarah86 commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation

Posted by "nfarah86 (via GitHub)" <gi...@apache.org>.
nfarah86 commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1481697918

   what are the memory configs you're using?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] gudladona commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation

Posted by "gudladona (via GitHub)" <gi...@apache.org>.
gudladona commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1494817669

   > what are the memory configs you're using?
   
   This failed on driver even with MAX heap of 32G. The threshold for failure can be relative to the number of files the partition 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org