You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Jean Helou (Jira)" <se...@james.apache.org> on 2022/08/03 20:47:00 UTC
[jira] [Comment Edited] (JAMES-3793) OOM when loading a very large object from S3?

    [ https://issues.apache.org/jira/browse/JAMES-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574919#comment-17574919 ] 

Jean Helou edited comment on JAMES-3793 at 8/3/22 8:46 PM:
-----------------------------------------------------------

+1 for a hard crash on non recoverable errors.

The Scala standard library provides an[ interesting definition of what a NonFatal error|http://example.com][https://github.com/scala/scala/blob/2.13.x/src/library/scala/util/control/NonFatal.scala] is, it definitely doesn't include OOM :)


was (Author: jeantil):
+1 for a hard crash on non recoverable errors.

The Scala standard library provides an[ interesting definition of what a NonFatal error|http://example.com]https://github.com/scala/scala/blob/2.13.x/src/library/scala/util/control/NonFatal.scala is (see scala.util.control.NonFatal) it definitely doesn't include OOM :)

> OOM when loading a very large object from S3?
> ---------------------------------------------
>
>                 Key: JAMES-3793
>                 URL: https://issues.apache.org/jira/browse/JAMES-3793
>             Project: James Server
>          Issue Type: Bug
>            Reporter: Benoit Tellier
>            Priority: Major
>
> h2. What?
> We encountered recurring OutOfMemory exception on one of our production deployment.
> Memory dump analysis was unconclusive and this tends to disqualify an explanation based on a memory leak (300MB of objects only on the heap a few minutes after the OOM).
> A careful log analysis lead to find what seems to be the "original OOM":
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> at java.base/java.util.Arrays.copyOf(Unknown Source)
> at software.amazon.awssdk.core.BytesWrapper.asByteArray(BytesWrapper.java:64)
> at org.apache.james.blob.objectstorage.aws.S3BlobStoreDAO$$Lambda$4237/0x00000008019f5ad8.apply(Unknown Source)
> at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
> at reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
> at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> Following this OOM the application is in a zombie state, unresponsive, throwing OOMs without stacktraces, with Cassandra queries that never finishes, unable to obtain a rabbitMQ connection and have issues within the S3 driver... This sound like reactive programming limitations, that prevents the java platform to handle the OOM like it should (crash the app, take a dump, etc...)
> We did audit quickly our dateset and found several emails/attachment exceeding the 100MB, with a partial and quick audit (we might very well have some larger data!).
> Thus the current explanation is that somehow we successfully saved in S3 a very big mail and now have OOMs when one tries to read it (as S3 blob store DAO does defensive copies).
> h2. Possible actions
> This is an ongoing events, thus our understanding of it can evolve yet as it raises interesting fixes that are hard to understand without the related context, I decided to share it here anyway. I will report upcoming developments here.
> Our first action is to confirm the current diagnostic:
>   - Further audit our datasets to find large items
>   - Deploy a patched version of James that rejects and log S3 objects larger than 50MB
> Yet our current understanding leads to interesting questions...
> *Is it a good idea to load big objects from S3 into our memory?*
> As a preliminary answer, upon email reads we are using `byte[]` for simplicity (no resource management, full view of the data). Changing this is not the scope of this ticket at this is likely a major rework with many unthought impacts. (I dont want to open that Pandora box...)
> SMTP, IMAP, JMAP, and the mailet container all have configuration preventing sending/saving/receiving/uploading too big of a mail/attachment/blob, so we likely have the convincing defense line at the protocol level. Yet this can be defeated by either bad configuration (in our case JMAP was not checking the size of sent email....) history (rules were not the same in the past so we ingested too big of a mail in the past), 'malicious action' (if all it takes to crash james is to replace a 1 MB mail by a 1 GB mail....). It thus sounds interesting to me to have additional protection at the data access layer, and be able to (optionally) configure S3 to not load objects of say more than 50 MBs. This can be added within the blob.properties file.
> Something like:
> {code:java}
> # Maximum size of blobs allowed to be loaded as byte array. Allow to prevent loading too large objects into memory (can cause OutOfMemoryException).
> # Optional, defaults to no limit being enforced. This is a size in bytes. Supported units are B, K, M, G, T defaulting to B)
> max.blob.inmemory.size=50M
> {code}
> As an operator this would give me some peace of mind knowing that James won't attempt to load GB large emails into memory and would fail early, without heading to the OOM realm and all the related stability issues it brings.
> Also the incriminated code path (`BytesWrapper::asByteArray`) does a defensive copy but there is an alternative: BytesWrapper::asByteArrayUnsafe. The S3 driver guaranties not to mutate the byte[] which sounds good enough given that james don't do it either. Preventing needless copies of MBs of large mail won't solve the core issue but definitly give a nice performance boost as well as decrease the impact of handling very large emails...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org