You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by Benoit TELLIER <bt...@apache.org> on 2022/05/18 11:28:37 UTC
Fiabilization of Processing in James

*# Context*

James does mostly 2 kinds of processing:

  - Mail processing: when receiving a mail in SMTP, the mail is enqueued 
and then the mailet processing is executed. Mailet/matchers are called 
against it and a serie of decision can be made: store this mail n a user 
mailbox, forward it, bounce, ignore it, etc...
  - Event processing: Once actions are taken in a user mailbox, an event 
is emitted to the event bus. Listeners are then called to "decorate" 
features of the mailbox manager in a non invasive way. ElasticSearch 
indexing, quota management, JMAP projections and state, IMAP/JMAP 
notifications...

Of course each of these processing can fail and error management is applied.

Regarding Mail processing the mail is stored in /var/mail/error .
To detect incidents:
   - ERROR logs during processing
   - Webadmin calls shows a non-zero /var/mail/error repository size
To fix this incident:
   - Explicit admin action is required, and if needed a reprocessing can 
be attempted (webadmin)

Regarding Event processing, listener execution is retried several time 
with a delay. If it keeps failingit is eventually stored in dead letter.
To detect incidents:
  - ERROR logs during processing
  - WebAdmin reports a non-zero size for deadletter
  - Health check, wich eventually does a recuring WARNING log that 
cannot be missed.
To fix this incident:
   - Explicit admin action is required, and if needed a redelivery can 
be attempted (webadmin)
*
**# Problem statement*

Most users misses this yet critical part of error management in James.

Actions are never taken, problems piles up.

While understandably major incidents with thousands of problems would 
clearly benefit from an admin intervention, I would like small incidents 
to self recover without a human intervention.

In practice, none of my clients (me included) managed to set up a 
reliable action plan regarding processing failures. Problems could be 
takled months after they arise thus escalating in major issues needlessly.

*# Proposal*

Get a recuring job addressing reprocessing/redelivery.

In order to avoid a retry storm, propose an upper bound to the count of 
entities reprocessed/redelivered. Such a retry storm could occur upon 
large scale incidents or persisting failures.

*# Impact*

In practice one reprocessing/redelivery per hour of 10 elements maximum 
would enable auto-recovering from minor incidents, without endangering 
overall performances, and would greatly benefit overall stability.

We thus would achieve auto-recovery of minor incidents.

Major incidents / persisting failures would still require admin actions.

*# Alternative*

An upper bound on redelivery/reprocessing exposed by webadmin would 
enpower the same kind of behaviour with external CRON.

Benefits:
  - Simpler code in James code base
  - Too easy for an admin not to configure this CRON

If the main proposal is deemed too complex, then this one would still 
fit my needs. I would just be happy to be selling consulting to say a 
user to configure these CRON ;-)

*# Related topics**
*
1.We would benefit from having the following metrics:

  - Size of deadletter over time
  - Size of mail repository over time

It would enable the set up of Grafana boards, allowing to detect issues 
without grocking logs.

2. We could have a health check checking /var/mail/error size and 
submitting a recuring WARNING.

In practice this proved useful to diagnose deadletter issues. I think 
error mail repository would benefit from this too.
*
**# Follow up*

I would be happy to open an ADR following this mailing thread. I can 
open JIRAs for the selected solutions,.

As we plan to do a Polish Sprint in June, Linagora would have some 
bandwith to work on the topic then.

----------------

Best regards,

Benoit