You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2022/08/20 06:16:00 UTC
[jira] [Closed] (JAMES-3784) Ease mail repository / event dead letter operation
[ https://issues.apache.org/jira/browse/JAMES-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Tellier closed JAMES-3784.
---------------------------------
Resolution: Fixed
Done
> Ease mail repository / event dead letter operation
> --------------------------------------------------
>
> Key: JAMES-3784
> URL: https://issues.apache.org/jira/browse/JAMES-3784
> Project: James Server
> Issue Type: Improvement
> Components: mailbox, MailStore & MailRepository
> Affects Versions: master
> Reporter: Benoit Tellier
> Priority: Major
> Fix For: 3.8.0
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> h3. Mailing list thread
> https://www.mail-archive.com/server-dev@james.apache.org/msg72012.html
> h3. context
> James does mostly 2 kinds of processing:
> - Mail processing: when receiving a mail in SMTP, the mail is enqueued and then the mailet processing is executed. Mailet/matchers are called against it and a serie of decision can be made: store this mail n a user mailbox, forward it, bounce, ignore it, etc... - Event processing: Once actions are taken in a user mailbox, an event is emitted to the event bus. Listeners are then called to "decorate" features of the mailbox manager in a non invasive way. ElasticSearch indexing, quota management, JMAP projections and state, IMAP/JMAP notifications...
> Of course each of these processing can fail and error management is applied.
> Regarding Mail processing the mail is stored in /var/mail/error .
> To detect incidents:
> - ERROR logs during processing
> - Webadmin calls shows a non-zero /var/mail/error repository size
> To fix this incident:
> - Explicit admin action is required, and if needed a reprocessing can be attempted (webadmin)
> Regarding Event processing, listener execution is retried several time with a delay. If it keeps failingit is eventually stored in dead letter.
> To detect incidents:
> - ERROR logs during processing
> - WebAdmin reports a non-zero size for deadletter
> - Health check, wich eventually does a recuring WARNING log that cannot be missed.
> To fix this incident:
> - Explicit admin action is required, and if needed a redelivery can be attempted (webadmin)
> h3. Problem statement
> Most users misses this yet critical part of error management in James.
> Actions are never taken, problems piles up.
> While understandably major incidents with thousands of problems would clearly benefit from an admin intervention, I would like small incidents to self recover without a human intervention.
> In practice, none of my clients (me included) managed to set up a reliable action plan regarding processing failures. Problems could be takled months after they arise thus escalating in major issues needlessly.
> h3. Proposed solution
> - Implement a healthcheck that verifies var/mail/error is empty
> - An upper bound on redelivery/reprocessing exposed by webadmin
> The goal of this limit is to prevent unbounded processing that could consume unbounded resources. Auto-healing could be budgetted for (eg: 10 mails/min).
> A human intervention is still needed in some cases:
> - Massive outage whose require a full redelivery/reprocessing
> - Bugs that cause recurring failure.
> The goal is to have auto-healing in place, given those tasks are called with CRONs.
> CRONs remove the need for extra James based developments that adds complexity.
> h3. Proposed changes
> Add a `limit` parameter to reprocessing /redelivery.
> If specified, it enables to limit the count of element reprocessed/redelivered. If unspecified the count of processed element is unbounded (like today)
> Endpoints to modify:
> - https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_reprocessing_mails_from_a_mail_repository
> - https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_all_events
> - https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_group_events
> We also need:
> - to update webadmin documentation accordingly.
> - to recommend a CRON of eg 10 redelivery/reprocessing per minute in our operation guides. (https://james.apache.org/server/manage-guice-distributed-james.html + https://james.staged.apache.org/james-distributed-app/3.7.0/operate/guide.html)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org