You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2022/08/26 01:54:00 UTC
[jira] [Closed] (JAMES-3810) Performance issues with RabbitMQMailQueue browsing

     [ https://issues.apache.org/jira/browse/JAMES-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Tellier closed JAMES-3810.
---------------------------------
    Resolution: Fixed

> Performance issues with RabbitMQMailQueue browsing
> --------------------------------------------------
>
>                 Key: JAMES-3810
>                 URL: https://issues.apache.org/jira/browse/JAMES-3810
>             Project: James Server
>          Issue Type: Improvement
>          Components: Queue, rabbitmq
>    Affects Versions: 3.7.0
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.8.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. What
> Upon incidents, the process of the browse start update can be suspended if some old mails are stuck in the mail queue.
> Given enough time, browsing/browse-start updates will fail (too much data loaded at once).
> Periodically browse start updates would pile up until creating performance issues.
> h3. The fix
>  - Allow better resilience upon browsing large amount of data.
>  - Handle correctly casandra exceptions upon dequeue (nack items)
>  - And finally provide a health check to WARN when health check is out of date
> h3. What to do when health check is out of date
>  - Identify the items still present in the queue
>  - Remove them from the queue
>  - and let everything catch up.
> Alternatively if old content can be discarded an aggressive anual update of the browse start can be an option.
> h3. Alternative
> Disabling the Cassandra part of the RabbitMQ is OK for a MDA, and RabbitMQ mailQueue is unfit for a MX. 
> Thus it would sound legit to just disable it.
> h3. Interesting Gitter discussion
> I leave it here as this is a very interesting discussions, explaining cause, effects, solutions, quick fixes, etc...
> {code:java}
> @iljaweis
> Hello, I'm currently investigating some performance problems around the Mail Queue in RabbitMQ und Cassandra, especially enqueuedmailsv4. From reading https://github.com/apache/james-project/blob/master/src/adr/0031-distributed-mail-queue.md and https://github.com/apache/james-project/blob/master/src/adr/0032-distributed-mail-queue-cleanup.md it is my understanding that the content of the tables "contentstart" and "browsestart" indicate the oldest dates where we should start processing entries from enqueuedmailsv4. However, the content of both contentstart and browsestart for all queues is stuck at "2022-07-12 14:00:00" and does not change. The result is that we're querying lots of "old" data from enqueuedmailsv4, but our queues (judging from RabbitMQ) appear to be basically empty all the time. The value of updateBrowseStartPace is 1000, the default. We're using distributed-pop3-app. Is this behaviour expected, or if not, is there something I could look at? Thanks.
> @iljaweis
> Hmm. Looks like James uses LIMIT 5000, but we're already at 17435 rows for that timerangestart. Looks like we need to increate bucketCount, but is there a way to deal with these old rows?
> @chibenwa
> Hello.
> Emails between contentStart and browseStart is elligible for cleanup (meaning deletion from objectStore, then from the Cassandra mailQueue view). BrowseStart matches (modulo the duration of slices) of the oldest mail still in the queue. ContentStart matches the fist email whose content is not deleted. So contentStart <= browseStart.
> > Looks like James uses LIMIT 5000
> I bet you are looking at logs on Cassandra server, most likely upon massive query rates. That limit do not correspond to any code on James side (Cf EnqueuedMailsDAO) but nothing to be worrying off as this is the default page size (drivers requests large reads by packets of 5000.
> > I'm currently investigating some performance problems around the Mail Queue in RabbitMQ und Cassandra, especially enqueuedmailsv4. 
> This is rather insufficient as a description for me understanding what happens, but (knowing the design of the underlying solution) let met take a bet:
> You are having performance issues as massive load queries are done by the RabbitMQ mail queue. This massive load is due to a full content read of the mail queue content everytime the browse start is read. (reads everything from 12/07, so slice * bucket, and returns load of data.
> The root cause is the browseStart being "stuck" on the 12/07. This could be because an email from the 12/07 had not been properly processed.
> Things I would check includes:
> > Dead letter queues on RabbitMQ management API. If they are not empty, audit, and consider re-routing its content to resume delivery.
> (Deserialization fails, or missing object in S3 can be the root cause, thus message that will always fail to get dequeued are discarded into a dead letter queue to prevent an infinite loop while not loosing data).
> > Check the content of the mail queue by browsing it. If this is not doable, audit the content of the oldest slice in enqueuedMailV4 and check it against deletedMailsV2. We are looking for content within enqueuedMailV4 that is not being marked as deleted.
> > The value of updateBrowseStartPace is 1000
> During the audit phase, until you know why "browseStart" is stuck, I recommend increasing it to an unreasonably high value to effectively disable browseStart updates. This will prevent high query rates that are hurting you from happening.
> You could consider a higher value based on your traffic too. Are you delivering more that 1000 emails for the slicing period?
> > Looks like we need to increate bucketCount
> Won't help. Unless you scale out your Cassandra cluster and want to spread the write load of the mail queue.
> Here the way to go is to audit why browse start updates are stuck.
> > Is this behaviour expected, or if not, is there something I could look at? Thanks.
> As explained before, that is a "not surprising" edge case. I'm interested to confirm the diagnostic, and also find the root cause.
> Regards
> Benoit
> @chibenwa
> Had a look. By default the browse start updates does 256 slice reads, which is likely too much. We could likely decrease this number for it to be less aggressive.
> @chibenwa
> Also we could think of a healthcheck warning when the browseStart is older than say 1 week.
> @iljaweis
> Hello, yes, you guessed exactly right what the situation is. Sorry for being slightly unclear. Browsing the queue (using the web admin endpoint) timed out, so I checked for enqueueids from enqueuedmailsv4 that are not in deletedmailsv2 and found a lot of them. That day we actually had a problem with a backend, and a lot of mails suffered exceptions during queue handling. The Mails were moved to /var/mail/error, but also a bounce was created. The mails are still sitting in enqueuedmailsv4 with state=error.
> @iljaweis
> There isn't anything in any of the dead letter queues
> @iljaweis
> Right now I'm looking for any exceptions that point to one of the possible failures you suggested.
> @chibenwa
> Ok, so... I would "jump" manually the browse start past them (say to ~ 1 week ago).
> Don't touch the contentStart.
> Then the content clean up will kick in and clean up dandling data.
> @chibenwa
> > Right now I'm looking for any exceptions that point to one of the possible failures you suggested.
> Speak to me more about this outage...
> Like: massive disturbance in the force on the cassandra cluster side?
> ```
>    Flux<? extends MailQueue.MailQueueItem> deQueue() {
>         return Flux.using(receiverProvider::createReceiver,
>                 receiver -> receiver.consumeManualAck(this.name.toWorkQueueName().asString(), new ConsumeOptions().qos(this.prefetchCount.asInt())),
>                 Receiver::close)
>             .filter(getResponse -> getResponse.getBody() != null)
>             .flatMapSequential(this::loadItem)
>             .concatMap(this::filterIfDeleted);
>     }
>     private Mono<RabbitMQMailQueueItem> filterIfDeleted(RabbitMQMailQueueItem item) {
>         return mailQueueView.isPresent(item.getEnqueueId())
>             .handle((isPresent, sink) -> {
>                 if (isPresent) {
>                     sink.next(item);
>                 } else {
>                     item.done(true);
>                     sink.complete();
>                 }
>             });
>     }
> ```
> That would cause the "filterIfDeleted" method to fail?
> And if this fails it do not look like the error handling is in place to nack correctly the corresponding message. That message will thus sit unacknowledged, and once the unacknowledge count exceed the prefetch of 20, no subsequent deliveries is made until James is restarted....
> Sounds like a known story
> That do not explain "emails being dropped" though
> @iljaweis
> Would it be possible to just "skip" the bad timeslice by manually adjusting the browsestart/contentstart? We actually don't care about any of those old mails any more.
> @iljaweis
> Ah, sorry, didn't see you already answered.
> @iljaweis
> The outage was a problem with our S3 backend. We got a large number of Exception calling LocalDelivery: org.apache.james.queue.api.MailQueue$MailQueueException: Timeout enqueueing during the exact period. I checked some of the mails from enqueuedmailsv4 not in deletedmailsv2 and they ran into exactly that.
> @iljaweis
> I will skip the nasty day by adjusting the browsestart table.
> @iljaweis
> About updateBrowseStartPace: Slice is still the default of 1 hour, but we're processing more than 1000 mails during that period.
> @chibenwa
> >i  I will skip the nasty day by adjusting the browsestart table.
> Update to something reasonably recent or you won't be able to update the browse start automatically. 1 week to one month.
> > About updateBrowseStartPace: Slice is still the default of 1 hour, but we're processing more than 1000 mails during that period.
> That should be ok: you then wasted ressources trying to update the browse start too many time during that period. No other harmful consequences.
> @chibenwa
> WIP PR apache/james-project#1142
> @iljaweis
> I've set the values in browsestart to '2022-08-12 06:00:00.000000+0000' and let it run over night.
> @iljaweis
> Spool there is now at '2022-08-23 05:00:00' but outgoing still at '2022-08-12 06:00:00'. No progress for contentstart
> @iljaweis
> Can't see any obvious exception. I could try and add some glowroot instrumentation, would org.apache.james.queue.rabbitmq.view.cassandra be the correct package?
> @iljaweis
> Ah, now both browsestart and contentstart for "spool" have caught up, but no movement for "outgoing". Perhaps it's caused by https://issues.apache.org/jira/browse/JAMES-3791, because we have ongoing problems with syntactically broken recipient addresses. We'll deploy a fix for that the next days.
> @chibenwa
> contentStart catch up on browseStart have a 1/updatePace on every mail.
>  - It is random
>  - Spool have likely more traffic than outgoing.
> I'd just take a coffee and relax: if it worked for your spool, when triggered it will work for your outgoing queue too.
> @iljaweis
> Will do that coffee and relax thing, and wait for our bugfix deployment anyway. Thank you for your help! :-)
> @iljaweis
> That coffee thing worked. We're up to date with contentstart.
> @chibenwa
> I'm opening a JIRA to preserve this discussion and contextualize the above proposed PR.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org